Groonga 10.0.7 has been released
Groonga 10.0.7 has been released!
How to install: Install
Changes
Here are important changes in this release:
-
[highlight], [highlight_full] Added support for normalizer options.
-
return code Added a new return code
GRN_CONNECTION_RESET
for resetting connection.- it is returned when an existing connection was forcibly close by the remote host.
-
Dropped Ubuntu 19.10 (Eoan Ermine).
- Because this version has been EOL.
-
[httpd] Updated bundled nginx to 1.19.2.
-
grndb Added support for detecting duplicate keys.
grndb check
is also able to detect duplicate keys since this release.- This check valid except a table of
TABLE_NO_KEY
. - If the table that was detected duplicate keys by
grndb check
has only index columns, we can recover bygrndb recover
.
-
Added a new function
string_toknize()
.- It tokenizes the column value that is specified in the second argument with the tokenizer that is specified in the first argument.
-
[tokenizer] Added a new tokenizer
TokenDocumentVectorTFIDF
(experimental).- It generates automatically document vector by TF-IDF.
-
[tokenizer] Added a new tokenizer
TokenDocumentVectorBM25
(experimental).- It generates automatically document vector by BM25.
-
Fixed a bug that
load
didn't a return response when we executed it against 257 columns.- This bug may occur from 10.0.4 or later.
-
This bug only occur when we load data by using
[a, b, c, ...]
format.- If we load data by using
[{...}]
, this bug doesn't occur.
- If we load data by using
-
[MessagePack] Fixed a bug that float32 value isn't be unpacked correctly.
-
Fixed the following bugs related multi column index.
_score
may be broken with full text search.- The records that couldn't hit might hit.
[highlight], highlight_full Added support for normalizer options
- We can also specify normalizer options into
highlight()
andhighlight_full()
. -
Please refer to the following about possible options to set.
- https://groonga.org/docs/reference/normalizers/normalizer_nfkc100.html#parameters
-
For example, we can identify hyphen that has different code point by using
unify_hyphen
.table_create Entries TABLE_NO_KEY column_create Entries body COLUMN_SCALAR ShortText load --table Entries [ {"body": "full-text-search. Use U+002D HYPHEN-MINUS"}, {"body": "full֊text֊search. Use U+058A ARMENIAN HYPHEN"}, {"body": "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"} ] select Entries --output_columns \ 'highlight_full(body, \ "NormalizerNFKC121(\\"unify_hyphen\\", true)", \ true, \ "full-text-search", \ "<span class=\\"keyword1\\">", \ "</span>")' --output-pretty yes [ [ 0, 0.0, 0.0 ], [ [ [ 3 ], [ [ "highlight_full", null ] ], [ "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS" ], [ "<span class=\"keyword1\">full֊text֊search</span>. Use U+058A ARMENIAN HYPHEN" ], [ "<span class=\"keyword1\">full˗text˗search</span>. Use U+02D7 MODIFIER LETTER MINUS SIGN" ] ] ] ]
-
If we don't specify
unify_hyphen
option,{"body": "full-text-search. Use U+002D HYPHEN-MINUS"}
is only highlighted as below.- Because the other record different code point from the hyphen that is included the search keyword.
select Entries --output_columns \ 'highlight_full(body, \ "NormalizerNFKC121()", \ true, \ "full-text-search", \ "<span class=\\"keyword1\\">", \ "</span>")' [ [ 0, 0.0, 0.0 ], [ [ [ 3 ], [ [ "highlight_full", null ] ], [ "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS" ], [ "full֊text֊search. Use U+058A ARMENIAN HYPHEN" ], [ "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN" ] ] ] ]
table_create, column_create Added a new option --path
.
-
We can store specified a table or a column to any path using this option.
-
This option is useful if we want to store a table or a column that we often use to fast storage (e.g. SSD) and store them that we don't often use to slow storage (e.g. HDD).
-
We can specify both relative path and absolute path in this option.
- If we specify relative path in this option, the path is resolved the path of
groonga
process as the origin.
- If we specify relative path in this option, the path is resolved the path of
-
However, if we specify
--path
, the result ofdump
command includes--path
informations.- Therefore, if we specify
--path
, we can't restore to host in different enviroment. - If we don't want include
--path
informations to a dump, we need specify--dump_paths no
indump
command.
- Therefore, if we specify
dump Added a new option --dump_paths
.
-
--dump_paths
option control whether--path
is dumped or not. -
The default value of it is
yes
. -
If we specify
--path
when we create tables or columns and we don't want include--path
informations to a dump, we specifyno
into--dump_paths
when we executedump
command.
select Added support for near search in same sentence.
-
the near search can't search in the same sentence until now.
-
It can search in the same sentence as below from this release.
table_create Memos TABLE_PAT_KEY ShortText column_create Memos content COLUMN_SCALAR ShortText table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer TokenBigram \ --normalizer NormalizerAuto column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content load --table Memos [ {"_key":"alphabets1", "content": "a c d ."}, {"_key":"alphabets2", "content": "a b c d e f ."}, {"_key":"alphabets3", "content": "a b x c d e f ."}, {"_key":"alphabets4", "content": "a b x x c d e f ."} ] select \ --table Memos \ --match_columns content \ --query '*NP3,-1"a c .$"' \ --output_columns _score,_key,content [ [ 0, 0.0, 0.0 ], [ [ [ 2 ], [ [ "_score", "Int32" ], [ "_key", "ShortText" ], [ "content", "ShortText" ] ], [ 1, "alphabets1", "a c d ." ], [ 1, "alphabets2", "a b x c ." ] ] ] ]
-
We use the following syntax for using near-search in the same sentence.
-
'"NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL}"'${FIRST_PHRASE},${LASR_PHRASE} ${SEPARATOR}$
-
If we specify
-1
into ${ADDITIONAL_LAST_INTERVAL}, a record that the interval the first phrase and the last phrase less or equal than ${MAX_INTERVAL} hit.- In this case, however much the phrase and the separator are apart from each other, it hit.
-
If we specify an integer not smaller than
1
into ${ADDITIONAL_LAST_INTERVAL}, the record of the following conditions hit.- The interval of the first phrase and the last phrase less or equal than ${MAX_INTERVAL}.
- The interval of the first phrase and the separator less or equal than ${MAX_INTERVAL}+${ADDITIONAL_LAST_INTERVAL}.
-
If we specify an integer not smaller than
0
into ${ADDITIONAL_LAST_INTERVAL}, the near-search same behavior as before.- The default value of ${ADDITIONAL_LAST_INTERVAL} is
0
.
- The default value of ${ADDITIONAL_LAST_INTERVAL} is
-
-
We can specify any character in ${SEPARATOR}.
-
Fixed the following bugs related multi column index.
_score
may be broken with full text search.-
The records that couldn't hit might hit.
-
For example, if we execute the following query, this bug occur.
select TABLE \ --match_columns 'LEXICON.INDEX[10]' \ --query 'XXX' \ --output_columns _score
-
In this case,
LEXICON.INDEX[0]
-LEXICON.INDEX[9]
were not initialized.- Groonga decide indexed of search target by whether the value of each sections is 0 or not.
-
Therefore, if
LEXICON.INDEX[0]
-LEXICON.INDEX[9]
were not initialized, Groonga may choose wrong target.- Because the values of
LEXICON.INDEX[0]
-LEXICON.INDEX[9]
are indefinite.
- Because the values of
-
In addition, these values are used weight of score.
- Therefore, the value of
_score
are also indefinite.
- Therefore, the value of
-
-
However, if we execute the following query, this bug doesn't occur. Because areaes of indefinite don't occur.
select TABLE \ --match_columns 'LEXICON.INDEX[0]' \ --query 'XXX' \ --output_columns _score
-
In other words, it occurred in the following situations.
- If we specified section as below when there is a multi column index that has source columns
a
,b
, andc
.
- If we specified section as below when there is a multi column index that has source columns
- We specified
a
andc
. - We specified
b
andc
. - We only specified
b
. -
We only specified
c
.- However, It doesn't occurred in the following situations. Because areaes of indefinite don't occur in the following situations.
-
We specify by filling space from the first section as below.
- We specified
a
,b
, andc
. - We specified
a
andb
. - We only specified
a
.
- We specified
-
Conclusion
Let's search by Groonga!