BloGroonga

2020-12-29

Groonga 10.1.0 has been released

Groonga 10.1.0 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • highlight_html Added support for removing leading full width spaces from highlight target.

  • status Added a new item features.

  • status Added a new item apache_arrow.

  • Window function Added support for processing all tables at once even if target tables straddle a shard. (experimental)

  • Added support for sequential search against reference column.

  • [tokenizers] Added support for the token column into TokenDocumentVectorTFIDF and TokenDocumentVectorBM25.

  • Improved performance when below case.

    • (column @ "value") && (column @ "value")
  • Ubuntu Added support for Ubuntu 20.10 (Groovy Gorilla).

  • Debian Dropped stretch support.

  • CentOS Dropped CentOS 6.

  • [httpd] Updated bundled nginx to 1.19.6.

  • Fixed a bug that Groonga crash when we use multiple keys drilldown and use multiple accessor.

  • Fixed a bug that the near phrase search did not match when the same phrase occurs multiple times.

Conclusion

Please refert to the following news for more details.

News Release 10.1.0

Let's search by Groonga!

2020-12-01

Groonga 10.0.9 has been released

Groonga 10.0.9 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • select Improved performance when we specified -1 to limit.

  • reference_acquire Added a new option --auto_release_count.

  • Modify behavior when Groonga evaluated empty vector and uvector.

    • Empty vector and uvector are evaluated to false in command version 3.
  • Normalizers Added a new Normalizer NormalizerNFKC130 based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 13.0.

  • Token filters Added a new TokenFilter TokenFilterNFKC130 based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 13.0.

  • select Improved performance for "_score = column - X".

  • reference_acquire Improved that --reference_acquire doesn't get unnecessary reference of index column when we specify the --recursive dependent option.

  • select Add support for ordered near phrase search.

    • Until now, the near phrase search have only looked for records that the distance of between specified phrases near.
    • This feature look for satisfy the following conditions records.

      • If the distance of between specified phrases is near.
      • If the specified phrases are in line with specified order.
  • [httpd] Updated bundled nginx to 1.19.5.

  • Groonga HTTP server Fixed that Groonga HTTP server finished without waiting all woker threads finished completely.

Conclusion

Please refert to the following news for more details.

News Release 10.0.9

Let's search by Groonga!

2020-11-10

PGroonga (fast full text search module for PostgreSQL) 2.2.7 has been released

PGroonga 2.2.7 has been released! PGroonga makes PostgreSQL fast full text search for all languages.

If you are new user, see also About PGroonga.

Highlight

Here are highlights after PGroonga 2.2.7:

  • Provided the packages for PostgreSQL13.

    • We have already supported PostgreSQL 13 from before version. However, we don't provided packages for PostgreSQL 13.

    • In this release, we started provision them.

  • [Windows] Upgraded bundled Groonga to 10.0.8.

  • [Ubuntu], [Debian] Fixed a bug that WAL support was disabled.

  • Fixed a bug that PGroonga might crash when PostgreSQL wrote WAL.

    • It only occurs in the environment that is enabled pgroonga.enable_wal.

How to upgrade

This version is compatible with before versions. You can upgrade by steps in "Compatible case" in Upgrade document.

Announce

Session

This session is for people who have already used PGroonga's WAL or will try to use it int the future.

Conclusion

Try PGroonga when you want to perform fast full text search against all languages on PostgreSQL!

2020-10-29

Groonga 10.0.8 has been released

Groonga 10.0.8 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • select Added support for large drilldown keys.

  • select Added support for handling as the same dynamic column even if columns refer to different tables.

  • select Improved performance when the number of records for search result are huge.

  • Updated bundled LZ4 to 1.9.2 from 1.8.2.

  • Added support xxHash 0.8

  • [httpd] Updated bundled nginx to 1.19.4.

  • Fixed the following bugs related the browser based administration tool.

  • between Fixed a bug that between(_key, ...) is always evaluated by sequential search.

Conclusion

Please refert to the following news for more details.

News Release 10.0.8

Let's search by Groonga!

2020-09-29

Groonga 10.0.7 has been released

Groonga 10.0.7 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • [highlight], [highlight_full] Added support for normalizer options.

  • return code Added a new return code GRN_CONNECTION_RESET for resetting connection.

    • it is returned when an existing connection was forcibly close by the remote host.
  • Dropped Ubuntu 19.10 (Eoan Ermine).

    • Because this version has been EOL.
  • [httpd] Updated bundled nginx to 1.19.2.

  • grndb Added support for detecting duplicate keys.

    • grndb check is also able to detect duplicate keys since this release.
    • This check valid except a table of TABLE_NO_KEY.
    • If the table that was detected duplicate keys by grndb check has only index columns, we can recover by grndb recover.
  • [table_create], [column_create] Added a new option --path.

  • [dump] Added a new option --dump_paths.

  • Added a new function string_toknize().

    • It tokenizes the column value that is specified in the second argument with the tokenizer that is specified in the first argument.
  • [tokenizer] Added a new tokenizer TokenDocumentVectorTFIDF (experimental).

    • It generates automatically document vector by TF-IDF.
  • [tokenizer] Added a new tokenizer TokenDocumentVectorBM25 (experimental).

    • It generates automatically document vector by BM25.
  • [select] Added support for near search in same sentence.

  • Fixed a bug that load didn't a return response when we executed it against 257 columns.

    • This bug may occur from 10.0.4 or later.
    • This bug only occur when we load data by using [a, b, c, ...] format.

      • If we load data by using [{...}], this bug doesn't occur.
  • [MessagePack] Fixed a bug that float32 value isn't be unpacked correctly.

  • Fixed the following bugs related multi column index.

    • _score may be broken with full text search.
    • The records that couldn't hit might hit.

[highlight], highlight_full Added support for normalizer options

  • We can also specify normalizer options into highlight() and highlight_full().
  • Please refer to the following about possible options to set.

    • https://groonga.org/docs/reference/normalizers/normalizer_nfkc100.html#parameters
  • For example, we can identify hyphen that has different code point by using unify_hyphen.

    table_create Entries TABLE_NO_KEY
    column_create Entries body COLUMN_SCALAR ShortText
    
    load --table Entries
    [
    {"body": "full-text-search. Use U+002D HYPHEN-MINUS"},
    {"body": "full֊text֊search. Use U+058A ARMENIAN HYPHEN"},
    {"body": "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"}
    ]
    
    select Entries --output_columns \
      'highlight_full(body, \
                      "NormalizerNFKC121(\\"unify_hyphen\\", true)", \
                      true, \
                      "full-text-search", \
                      "<span class=\\"keyword1\\">", \
                      "</span>")' --output-pretty yes
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            3
          ],
          [
            [
    	  "highlight_full",
    	  null
    	]
          ],
          [
            "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
          ],
          [
            "<span class=\"keyword1\">full֊text֊search</span>. Use U+058A ARMENIAN HYPHEN"
          ],
          [
            "<span class=\"keyword1\">full˗text˗search</span>. Use U+02D7 MODIFIER LETTER MINUS SIGN"
          ]
        ]
      ]
    ]
    
  • If we don't specify unify_hyphen option, {"body": "full-text-search. Use U+002D HYPHEN-MINUS"} is only highlighted as below.

    • Because the other record different code point from the hyphen that is included the search keyword.
    select Entries --output_columns \
      'highlight_full(body, \
                      "NormalizerNFKC121()", \
                      true, \
                      "full-text-search", \
                      "<span class=\\"keyword1\\">", \
                      "</span>")'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
        3
      ],
      [
        [
          "highlight_full",
          null
        ]
      ],
      [
        "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
      ],
      [
        "full֊text֊search. Use U+058A ARMENIAN HYPHEN"
      ],
      [
        "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"
      ]
        ]
      ]
    ]
    

table_create, column_create Added a new option --path.

  • We can store specified a table or a column to any path using this option.

  • This option is useful if we want to store a table or a column that we often use to fast storage (e.g. SSD) and store them that we don't often use to slow storage (e.g. HDD).

  • We can specify both relative path and absolute path in this option.

    • If we specify relative path in this option, the path is resolved the path of groonga process as the origin.
  • However, if we specify --path, the result of dump command includes --path informations.

    • Therefore, if we specify --path, we can't restore to host in different enviroment.
    • If we don't want include --path informations to a dump, we need specify --dump_paths no in dump command.

dump Added a new option --dump_paths.

  • --dump_paths option control whether --path is dumped or not.

  • The default value of it is yes.

  • If we specify --path when we create tables or columns and we don't want include --path informations to a dump, we specify no into --dump_paths when we execute dump command.

  • the near search can't search in the same sentence until now.

  • It can search in the same sentence as below from this release.

    table_create Memos TABLE_PAT_KEY ShortText
    column_create Memos content COLUMN_SCALAR ShortText
    
    table_create Terms TABLE_PAT_KEY ShortText \
      --default_tokenizer TokenBigram \
      --normalizer NormalizerAuto
    column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
    
    load --table Memos
    [
    {"_key":"alphabets1", "content": "a c d ."},
    {"_key":"alphabets2", "content": "a b c d e f ."},
    {"_key":"alphabets3", "content": "a b x c d e f ."},
    {"_key":"alphabets4", "content": "a b x x c d e f ."}
    ]
    
    select \
      --table Memos \
      --match_columns content \
      --query '*NP3,-1"a c .$"' \
      --output_columns _score,_key,content
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            2
          ],
          [
            [
              "_score",
              "Int32"
            ],
            [
              "_key",
              "ShortText"
            ],
            [
              "content",
              "ShortText"
            ]
          ],
          [
            1,
            "alphabets1",
            "a c d ."
          ],
          [
            1,
            "alphabets2",
            "a b x c ."
          ]
        ]
      ]
    ]
    
  • We use the following syntax for using near-search in the same sentence.

    • '"NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL}"'${FIRST_PHRASE},${LASR_PHRASE} ${SEPARATOR}$

      • If we specify -1 into ${ADDITIONAL_LAST_INTERVAL}, a record that the interval the first phrase and the last phrase less or equal than ${MAX_INTERVAL} hit.

        • In this case, however much the phrase and the separator are apart from each other, it hit.
      • If we specify an integer not smaller than 1 into ${ADDITIONAL_LAST_INTERVAL}, the record of the following conditions hit.

        • The interval of the first phrase and the last phrase less or equal than ${MAX_INTERVAL}.
        • The interval of the first phrase and the separator less or equal than ${MAX_INTERVAL}+${ADDITIONAL_LAST_INTERVAL}.
      • If we specify an integer not smaller than 0 into ${ADDITIONAL_LAST_INTERVAL}, the near-search same behavior as before.

        • The default value of ${ADDITIONAL_LAST_INTERVAL} is .
    • We can specify any character in ${SEPARATOR}.

Fixed the following bugs related multi column index.

  • _score may be broken with full text search.
  • The records that couldn't hit might hit.

    • For example, if we execute the following query, this bug occur.

      select TABLE \
        --match_columns 'LEXICON.INDEX[10]' \
        --query 'XXX' \
        --output_columns _score
      
      • In this case, LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized.

        • Groonga decide indexed of search target by whether the value of each sections is 0 or not.
        • Therefore, if LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized, Groonga may choose wrong target.

          • Because the values of LEXICON.INDEX[0] - LEXICON.INDEX[9] are indefinite.
        • In addition, these values are used weight of score.

          • Therefore, the value of _score are also indefinite.
    • However, if we execute the following query, this bug doesn't occur. Because areaes of indefinite don't occur.

      select TABLE \
        --match_columns 'LEXICON.INDEX[0]' \
        --query 'XXX' \
        --output_columns _score
      
    • In other words, it occurred in the following situations.

      • If we specified section as below when there is a multi column index that has source columns a, b, and c.
    • We specified a and c.
    • We specified b and c.
    • We only specified b.
    • We only specified c.

      • However, It doesn't occurred in the following situations. Because areaes of indefinite don't occur in the following situations.
    • We specify by filling space from the first section as below.

      • We specified a, b, and c.
      • We specified a and b.
      • We only specified a.

Conclusion

Let's search by Groonga!