Groonga 10.0.7 has been released

BloGroonga

2020-09-29

Groonga 10.0.7 has been released

Groonga 10.0.7 has been released!

How to install: Install

Changes

Here are important changes in this release:

[highlight], [highlight_full] Added support for normalizer options.
return code Added a new return code GRN_CONNECTION_RESET for resetting connection.
- it is returned when an existing connection was forcibly close by the remote host.
Dropped Ubuntu 19.10 (Eoan Ermine).
- Because this version has been EOL.
[httpd] Updated bundled nginx to 1.19.2.
grndb Added support for detecting duplicate keys.
- grndb check is also able to detect duplicate keys since this release.
- This check valid except a table of TABLE_NO_KEY.
- If the table that was detected duplicate keys by grndb check has only index columns, we can recover by grndb recover.
[table_create], [column_create] Added a new option --path.
[dump] Added a new option --dump_paths.
Added a new function string_toknize().
- It tokenizes the column value that is specified in the second argument with the tokenizer that is specified in the first argument.
[tokenizer] Added a new tokenizer TokenDocumentVectorTFIDF (experimental).
- It generates automatically document vector by TF-IDF.
[tokenizer] Added a new tokenizer TokenDocumentVectorBM25 (experimental).
- It generates automatically document vector by BM25.
[select] Added support for near search in same sentence.
Fixed a bug that load didn't a return response when we executed it against 257 columns.
- This bug may occur from 10.0.4 or later.
- This bug only occur when we load data by using [a, b, c, ...] format.
  - If we load data by using [{...}], this bug doesn't occur.
[MessagePack] Fixed a bug that float32 value isn't be unpacked correctly.
Fixed the following bugs related multi column index.
- _score may be broken with full text search.
- The records that couldn't hit might hit.

[highlight], highlight_full Added support for normalizer options

We can also specify normalizer options into highlight() and highlight_full().
Please refer to the following about possible options to set.
- https://groonga.org/docs/reference/normalizers/normalizer_nfkc100.html#parameters

For example, we can identify hyphen that has different code point by using unify_hyphen.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

load --table Entries
[
{"body": "full-text-search. Use U+002D HYPHEN-MINUS"},
{"body": "full֊text֊search. Use U+058A ARMENIAN HYPHEN"},
{"body": "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"}
]

select Entries --output_columns \
  'highlight_full(body, \
                  "NormalizerNFKC121(\\"unify_hyphen\\", true)", \
                  true, \
                  "full-text-search", \
                  "<span class=\\"keyword1\\">", \
                  "</span>")' --output-pretty yes
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
	  "highlight_full",
	  null
	]
      ],
      [
        "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
      ],
      [
        "<span class=\"keyword1\">full֊text֊search</span>. Use U+058A ARMENIAN HYPHEN"
      ],
      [
        "<span class=\"keyword1\">full˗text˗search</span>. Use U+02D7 MODIFIER LETTER MINUS SIGN"
      ]
    ]
  ]
]

If we don't specify unify_hyphen option, {"body": "full-text-search. Use U+002D HYPHEN-MINUS"} is only highlighted as below.

Because the other record different code point from the hyphen that is included the search keyword.

select Entries --output_columns \
  'highlight_full(body, \
                  "NormalizerNFKC121()", \
                  true, \
                  "full-text-search", \
                  "<span class=\\"keyword1\\">", \
                  "</span>")'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
    3
  ],
  [
    [
      "highlight_full",
      null
    ]
  ],
  [
    "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
  ],
  [
    "full֊text֊search. Use U+058A ARMENIAN HYPHEN"
  ],
  [
    "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"
  ]
    ]
  ]
]

table_create, column_create Added a new option `--path`.

We can store specified a table or a column to any path using this option.
This option is useful if we want to store a table or a column that we often use to fast storage (e.g. SSD) and store them that we don't often use to slow storage (e.g. HDD).
We can specify both relative path and absolute path in this option.
- If we specify relative path in this option, the path is resolved the path of groonga process as the origin.
However, if we specify --path, the result of dump command includes --path informations.
- Therefore, if we specify --path, we can't restore to host in different enviroment.
- If we don't want include --path informations to a dump, we need specify --dump_paths no in dump command.

dump Added a new option `--dump_paths`.

--dump_paths option control whether --path is dumped or not.
The default value of it is yes.
If we specify --path when we create tables or columns and we don't want include --path informations to a dump, we specify no into --dump_paths when we execute dump command.

select Added support for near search in same sentence.

the near search can't search in the same sentence until now.

It can search in the same sentence as below from this release.

table_create Memos TABLE_PAT_KEY ShortText
column_create Memos content COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content

load --table Memos
[
{"_key":"alphabets1", "content": "a c d ."},
{"_key":"alphabets2", "content": "a b c d e f ."},
{"_key":"alphabets3", "content": "a b x c d e f ."},
{"_key":"alphabets4", "content": "a b x x c d e f ."}
]

select \
  --table Memos \
  --match_columns content \
  --query '*NP3,-1"a c .$"' \
  --output_columns _score,_key,content
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "_score",
          "Int32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "content",
          "ShortText"
        ]
      ],
      [
        1,
        "alphabets1",
        "a c d ."
      ],
      [
        1,
        "alphabets2",
        "a b x c ."
      ]
    ]
  ]
]

We use the following syntax for using near-search in the same sentence.
- '"NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL}"'${FIRST_PHRASE},${LASR_PHRASE} ${SEPARATOR}$
  - If we specify -1 into ${ADDITIONAL_LAST_INTERVAL}, a record that the interval the first phrase and the last phrase less or equal than ${MAX_INTERVAL} hit.
    - In this case, however much the phrase and the separator are apart from each other, it hit.
  - If we specify an integer not smaller than 1 into ${ADDITIONAL_LAST_INTERVAL}, the record of the following conditions hit.
    - The interval of the first phrase and the last phrase less or equal than ${MAX_INTERVAL}.
    - The interval of the first phrase and the separator less or equal than ${MAX_INTERVAL}+${ADDITIONAL_LAST_INTERVAL}.
  - If we specify an integer not smaller than 0 into ${ADDITIONAL_LAST_INTERVAL}, the near-search same behavior as before.
    - The default value of ${ADDITIONAL_LAST_INTERVAL} is ０.
- We can specify any character in ${SEPARATOR}.

Fixed the following bugs related multi column index.

_score may be broken with full text search.
The records that couldn't hit might hit.
- For example, if we execute the following query, this bug occur.
```
select TABLE \
  --match_columns 'LEXICON.INDEX[10]' \
  --query 'XXX' \
  --output_columns _score
```
  - In this case, LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized.
    - Groonga decide indexed of search target by whether the value of each sections is 0 or not.
    - Therefore, if LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized, Groonga may choose wrong target.
      - Because the values of LEXICON.INDEX[0] - LEXICON.INDEX[9] are indefinite.
    - In addition, these values are used weight of score.
      - Therefore, the value of _score are also indefinite.
- However, if we execute the following query, this bug doesn't occur. Because areaes of indefinite don't occur.
```
select TABLE \
  --match_columns 'LEXICON.INDEX[0]' \
  --query 'XXX' \
  --output_columns _score
```
- In other words, it occurred in the following situations.
  - If we specified section as below when there is a multi column index that has source columns a, b, and c.
- We specified a and c.
- We specified b and c.
- We only specified b.
- We only specified c.
  - However, It doesn't occurred in the following situations. Because areaes of indefinite don't occur in the following situations.
- We specify by filling space from the first section as below.
  - We specified a, b, and c.
  - We specified a and b.
  - We only specified a.

Conclusion

Let's search by Groonga!

BloGroonga

Groonga 10.0.7 has been released

Changes

[highlight], highlight_full Added support for normalizer options

table_create, column_create Added a new option `--path`.

dump Added a new option `--dump_paths`.

select Added support for near search in same sentence.

Fixed the following bugs related multi column index.

Conclusion

Links

The Latest Posts

BloGroonga

Groonga 10.0.7 has been released

Changes

[highlight], highlight_full Added support for normalizer options

table_create, column_create Added a new option --path.

dump Added a new option --dump_paths.

select Added support for near search in same sentence.

Fixed the following bugs related multi column index.

Conclusion

Links

The Latest Posts

table_create, column_create Added a new option `--path`.

dump Added a new option `--dump_paths`.