BloGroonga

2020-09-29

Groonga 10.0.7 has been released

Groonga 10.0.7 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • [highlight], [highlight_full] Added support for normalizer options.

  • return code Added a new return code GRN_CONNECTION_RESET for resetting connection.

    • it is returned when an existing connection was forcibly close by the remote host.
  • Dropped Ubuntu 19.10 (Eoan Ermine).

    • Because this version has been EOL.
  • [httpd] Updated bundled nginx to 1.19.2.

  • grndb Added support for detecting duplicate keys.

    • grndb check is also able to detect duplicate keys since this release.
    • This check valid except a table of TABLE_NO_KEY.
    • If the table that was detected duplicate keys by grndb check has only index columns, we can recover by grndb recover.
  • [table_create], [column_create] Added a new option --path.

  • [dump] Added a new option --dump_paths.

  • Added a new function string_toknize().

    • It tokenizes the column value that is specified in the second argument with the tokenizer that is specified in the first argument.
  • [tokenizer] Added a new tokenizer TokenDocumentVectorTFIDF (experimental).

    • It generates automatically document vector by TF-IDF.
  • [tokenizer] Added a new tokenizer TokenDocumentVectorBM25 (experimental).

    • It generates automatically document vector by BM25.
  • [select] Added support for near search in same sentence.

  • Fixed a bug that load didn't a return response when we executed it against 257 columns.

    • This bug may occur from 10.0.4 or later.
    • This bug only occur when we load data by using [a, b, c, ...] format.

      • If we load data by using [{...}], this bug doesn't occur.
  • [MessagePack] Fixed a bug that float32 value isn't be unpacked correctly.

  • Fixed the following bugs related multi column index.

    • _score may be broken with full text search.
    • The records that couldn't hit might hit.

[highlight], highlight_full Added support for normalizer options

  • We can also specify normalizer options into highlight() and highlight_full().
  • Please refer to the following about possible options to set.

    • https://groonga.org/docs/reference/normalizers/normalizer_nfkc100.html#parameters
  • For example, we can identify hyphen that has different code point by using unify_hyphen.

    table_create Entries TABLE_NO_KEY
    column_create Entries body COLUMN_SCALAR ShortText
    
    load --table Entries
    [
    {"body": "full-text-search. Use U+002D HYPHEN-MINUS"},
    {"body": "full֊text֊search. Use U+058A ARMENIAN HYPHEN"},
    {"body": "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"}
    ]
    
    select Entries --output_columns \
      'highlight_full(body, \
                      "NormalizerNFKC121(\\"unify_hyphen\\", true)", \
                      true, \
                      "full-text-search", \
                      "<span class=\\"keyword1\\">", \
                      "</span>")' --output-pretty yes
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            3
          ],
          [
            [
    	  "highlight_full",
    	  null
    	]
          ],
          [
            "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
          ],
          [
            "<span class=\"keyword1\">full֊text֊search</span>. Use U+058A ARMENIAN HYPHEN"
          ],
          [
            "<span class=\"keyword1\">full˗text˗search</span>. Use U+02D7 MODIFIER LETTER MINUS SIGN"
          ]
        ]
      ]
    ]
    
  • If we don't specify unify_hyphen option, {"body": "full-text-search. Use U+002D HYPHEN-MINUS"} is only highlighted as below.

    • Because the other record different code point from the hyphen that is included the search keyword.
    select Entries --output_columns \
      'highlight_full(body, \
                      "NormalizerNFKC121()", \
                      true, \
                      "full-text-search", \
                      "<span class=\\"keyword1\\">", \
                      "</span>")'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
        3
      ],
      [
        [
          "highlight_full",
          null
        ]
      ],
      [
        "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
      ],
      [
        "full֊text֊search. Use U+058A ARMENIAN HYPHEN"
      ],
      [
        "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"
      ]
        ]
      ]
    ]
    

table_create, column_create Added a new option --path.

  • We can store specified a table or a column to any path using this option.

  • This option is useful if we want to store a table or a column that we often use to fast storage (e.g. SSD) and store them that we don't often use to slow storage (e.g. HDD).

  • We can specify both relative path and absolute path in this option.

    • If we specify relative path in this option, the path is resolved the path of groonga process as the origin.
  • However, if we specify --path, the result of dump command includes --path informations.

    • Therefore, if we specify --path, we can't restore to host in different enviroment.
    • If we don't want include --path informations to a dump, we need specify --dump_paths no in dump command.

dump Added a new option --dump_paths.

  • --dump_paths option control whether --path is dumped or not.

  • The default value of it is yes.

  • If we specify --path when we create tables or columns and we don't want include --path informations to a dump, we specify no into --dump_paths when we execute dump command.

  • the near search can't search in the same sentence until now.

  • It can search in the same sentence as below from this release.

    table_create Memos TABLE_PAT_KEY ShortText
    column_create Memos content COLUMN_SCALAR ShortText
    
    table_create Terms TABLE_PAT_KEY ShortText \
      --default_tokenizer TokenBigram \
      --normalizer NormalizerAuto
    column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
    
    load --table Memos
    [
    {"_key":"alphabets1", "content": "a c d ."},
    {"_key":"alphabets2", "content": "a b c d e f ."},
    {"_key":"alphabets3", "content": "a b x c d e f ."},
    {"_key":"alphabets4", "content": "a b x x c d e f ."}
    ]
    
    select \
      --table Memos \
      --match_columns content \
      --query '*NP3,-1"a c .$"' \
      --output_columns _score,_key,content
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            2
          ],
          [
            [
              "_score",
              "Int32"
            ],
            [
              "_key",
              "ShortText"
            ],
            [
              "content",
              "ShortText"
            ]
          ],
          [
            1,
            "alphabets1",
            "a c d ."
          ],
          [
            1,
            "alphabets2",
            "a b x c ."
          ]
        ]
      ]
    ]
    
  • We use the following syntax for using near-search in the same sentence.

    • '"NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL}"'${FIRST_PHRASE},${LASR_PHRASE} ${SEPARATOR}$

      • If we specify -1 into ${ADDITIONAL_LAST_INTERVAL}, a record that the interval the first phrase and the last phrase less or equal than ${MAX_INTERVAL} hit.

        • In this case, however much the phrase and the separator are apart from each other, it hit.
      • If we specify an integer not smaller than 1 into ${ADDITIONAL_LAST_INTERVAL}, the record of the following conditions hit.

        • The interval of the first phrase and the last phrase less or equal than ${MAX_INTERVAL}.
        • The interval of the first phrase and the separator less or equal than ${MAX_INTERVAL}+${ADDITIONAL_LAST_INTERVAL}.
      • If we specify an integer not smaller than 0 into ${ADDITIONAL_LAST_INTERVAL}, the near-search same behavior as before.

        • The default value of ${ADDITIONAL_LAST_INTERVAL} is .
    • We can specify any character in ${SEPARATOR}.

Fixed the following bugs related multi column index.

  • _score may be broken with full text search.
  • The records that couldn't hit might hit.

    • For example, if we execute the following query, this bug occur.

      select TABLE \
        --match_columns 'LEXICON.INDEX[10]' \
        --query 'XXX' \
        --output_columns _score
      
      • In this case, LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized.

        • Groonga decide indexed of search target by whether the value of each sections is 0 or not.
        • Therefore, if LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized, Groonga may choose wrong target.

          • Because the values of LEXICON.INDEX[0] - LEXICON.INDEX[9] are indefinite.
        • In addition, these values are used weight of score.

          • Therefore, the value of _score are also indefinite.
    • However, if we execute the following query, this bug doesn't occur. Because areaes of indefinite don't occur.

      select TABLE \
        --match_columns 'LEXICON.INDEX[0]' \
        --query 'XXX' \
        --output_columns _score
      
    • In other words, it occurred in the following situations.

      • If we specified section as below when there is a multi column index that has source columns a, b, and c.
    • We specified a and c.
    • We specified b and c.
    • We only specified b.
    • We only specified c.

      • However, It doesn't occurred in the following situations. Because areaes of indefinite don't occur in the following situations.
    • We specify by filling space from the first section as below.

      • We specified a, b, and c.
      • We specified a and b.
      • We only specified a.

Conclusion

Let's search by Groonga!