News - 13 series#
Release 13.1.1 - 2024-01-09#
Improvements#
Dropped support for mingw32. [GitHub#1654]
Added support for index search of “vector_column[N] OPERATOR literal” with
--match_columns
and--query
.
Fixes#
[Windows] Bundled
groonga-normalizer-mysql
again. [GitHub#1655]Groonga 13.1.0 for Windows didn’t include
groonga-normalizer-mysql
. This problem only occured in Groonga 13.1.0.
Release 13.1.0 - 2023-12-26#
Improvements#
[select] Groonga also cached trace log.
Added support for outputting
dict<string>
in a responce of Apache Arrow format.[Groonga HTTP server] Added support for new content type
application/vnd.apache.arrow.stream
[query] Added support empty input as below.
table_create Users TABLE_NO_KEY column_create Users name COLUMN_SCALAR ShortText table_create Lexicon TABLE_HASH_KEY ShortText --default_tokenizer TokenBigramSplitSymbolAlphaDigit --normalizer NormalizerAuto column_create Lexicon users_name COLUMN_INDEX|WITH_POSITION Users name load --table Users [ {"name": "Alice"}, {"name": "Alisa"}, {"name": "Bob"} ] select Users --output_columns name,_score --filter 'query("name", " ")' [ [ 0, 0.0, 0.0 ], [ [ [ 0 ], [ [ "name", "ShortText" ], [ "_score", "Int32" ] ] ] ] ]
Added support for BFloat16(experimental)
We can just load and select BFloat16. We can’t use arithmetic operations such as
bfloat16_value - 1.2
.[column_create] Added new flag
WEIGHT_BFLOAT16
.
Fixes#
[select] Fixed a bug that when Groonga cached
output_pretty=yes
result, Groonga returned a query withoutput_pretty
even if we sent a query withoutoutput_pretty
.Fixed a wrong data created bug.
In general, users can’t do this explicitly because the command API doesn’t accept
GRN_OBJ_{APPEND,PREPEND}
. This may be used internally when a dynamic numeric vector column is created and a temporary result set is created (OR is used).For example, the following query may create wrong data:
select TABLE \ --match_columns TEXT_COLUMN \ --query 'A B OR C' \ --columns[NUMERIC_DYNAMIC_COLUMN].stage result_set \ --columns[NUMERIC_DYNAMIC_COLUMN].type Float32 \ --columns[NUMERIC_DYNAMIC_COLUMN].flags COLUMN_VECTOR
If this is happen,
NUMERIC_DYNAMIC_COLUMN
contains many garbage elements. It also causes too much memory consumption.Note that this is caused by an uninitialized variable on stack. So this may or may not be happen.
Fixed a bug that may fail to set valid
normalizers/token_filters
.[fuzzy_search] Fixed a crash bug when the following three conditions established.
Query has 2 or more multi-byte characters.
${ASCII}${ASCII}${MULTIBYTE}*
characters in a patricia trie table.WITH_TRANSPOSITION
is enabled.
For example, “aaあ” in a patricia trie table with query “あああ” pair has this problem as below.
table_create Users TABLE_NO_KEY column_create Users name COLUMN_SCALAR ShortText table_create Names TABLE_PAT_KEY ShortText column_create Names user COLUMN_INDEX Users name load --table Users [ {"name": "aaあ"}, {"name": "あうi"}, {"name": "あう"}, {"name": "あi"}, {"name": "iう"} ] select Users --filter 'fuzzy_search(name, "あiう", {"with_transposition": true, "max_distance": 3})' --output_columns 'name, _score' --match_escalation_threshold -1
Release 13.0.9 - 2023-10-29#
Improvements#
[select] Changed the default value of
--fuzzy_max_expansions
from 0 to 10.--fuzzy_max_expansions
can limit number of words that has close edit distance to use search process. This argument can help to balance hit numbers and performance of the search. When--fuzzy_max_expansions
is 0, the search use all words that the edit distance are under--fuzzy_max_distance
in the vocabulary list.--fuzzy_max_expansions
is 0 (unlimited) may slow down a search. Therefore, the default value of--fuzzy_max_expansions
is 10 from this release.[select] Improved
select
arguments with addition new argument--fuzzy_with_transposition
(experimental).We can choose edit distance
1
or2
for the transposition case by using this argument.If this parameter is
yes
, the edit distance of this case is1
. It’s2
otherwise.[select] Improved
select
arguments with addition new argument--fuzzy_tokenize
.When
--fuzzy_tokenize
isyes
, Groonga use tokenizer that specifies in--default_tokenizer
in typo tolerance search.The default value of
--fuzzy_tokenize
isno
. The useful case of--fuzzy_tokenize
is the following case.Search targets are only Japanese data.
Specify
TokenMecab
in--default_tokenizer
.
[load] Added support for
--ifexists
even if we specifiedapache-arrow
intoinput_type
.[Normalizers] Improved
NormalizerNFKC*
options with addition new optionremove_blank_force
.When
remove_blank_force
isfalse
, Normalizer doesn’t ignore space as below.table_create Entries TABLE_NO_KEY column_create Entries body COLUMN_SCALAR ShortText load --table Entries [ {"body": "Groonga はとても速い"}, {"body": "Groongaはとても速い"} ] select Entries --output_columns \ 'highlight(body, \ "gaはとても", "<keyword>", "</keyword>", \ {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"} \ )' [ [ 0, 0.0, 0.0 ], [ [ [ 2 ], [ [ "highlight", null ] ], [ "Groonga はとても速い" ], [ "Groon<keyword>gaはとても</keyword>速い" ] ] ] ]
[select] Improved
select
arguments with addition new argument--output_trace_log
(experimental).If we specify
yes
in--output_trace_log
and--command_version 3
, Groonga output addition new log as below.table_create Memos TABLE_NO_KEY column_create Memos content COLUMN_SCALAR ShortText table_create Lexicon TABLE_PAT_KEY ShortText --default_tokenizer TokenNgram --normalizer NormalizerNFKC150 column_create Lexicon memos_content COLUMN_INDEX|WITH_POSITION Memos content load --table Memos [ {"content": "This is a pen"}, {"content": "That is a pen"}, {"content": "They are pens"} ] select Memos \ --match_columns content \ --query "Thas OR ere" \ --fuzzy_max_distance 1 \ --output_columns *,_score \ --command_version 3 \ --output_trace_log yes \ --output_type apache-arrow return_code: int32 start_time: timestamp[ns] elapsed_time: double error_message: string error_file: string error_line: uint32 error_function: string error_input_file: string error_input_line: int32 error_input_command: string -- metadata -- GROONGA:data_type: metadata return_code start_time elapsed_time error_message error_file error_line error_function error_input_file error_input_line error_input_command 0 0 1970-01-01T09:00:00+09:00 0.000000 (null) (null) (null) (null) (null) (null) (null) ======================================== depth: uint16 sequence: uint16 name: string value: dense_union<0: uint32=0, 1: string=1> elapsed_time: uint64 -- metadata -- GROONGA:data_type: trace_log depth sequence name value elapsed_time 0 1 0 ii.select.input Thas 0 1 2 0 ii.select.exact.n_hits 0 1 2 2 0 ii.select.fuzzy.input Thas 2 3 2 1 ii.select.fuzzy.input.actual that 3 4 2 2 ii.select.fuzzy.input.actual this 4 5 2 3 ii.select.fuzzy.n_hits 2 5 6 1 1 ii.select.n_hits 2 6 7 1 0 ii.select.input ere 7 8 2 0 ii.select.exact.n_hits 2 8 9 1 1 ii.select.n_hits 2 9 ======================================== content: string _score: double -- metadata -- GROONGA:n_hits: 2 content _score 0 This is a pen 1.000000 1 That is a pen 1.000000
--output_trace_log
is valid in only command version 3.This will be useful for the following cases:
Detect real words used by fuzzy query.
Measure elapsed timeout without seeing query log.
[query] Added support for object literal.
[query_expand] Added support for
NPP
andONPP
(experimental).[snippet] Added support for
normalizers
option.We can use normalizer with option. For example, when we don’t want to ignore space in
snippet()
function, we use this option as below.table_create Entries TABLE_NO_KEY column_create Entries content COLUMN_SCALAR ShortText load --table Entries [ {"content": "Groonga and MySQL"}, {"content": "Groonga and My SQL"} ] select Entries \ --output_columns \ ' snippet(content, "MySQL", "<keyword>", "</keyword>", {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"} )' [ [ 0, 0.0, 0.0 ], [ [ [ 2 ], [ [ "snippet", null ] ], [ [ "Groonga and <keyword>MySQL</keyword>" ] ], [ null ] ] ] ]
Fixes#
Fixed a bug in
Time OPERATOR Float{,32}
comparison. GH-1624[Reported by yssrku]Microsecond (small value than second) information in
Float{,32}
isn’t used. This is happen only whenTime OPERATOR Float{,32}
.This is happen in
load --ifexists 'A OP B || C OP D'
as below.table_create Reports TABLE_HASH_KEY ShortText column_create Reports content COLUMN_SCALAR Text column_create Reports modified_at COLUMN_SCALAR Time load --table Reports [ {"_key": "a", "content": "", "modified_at": 1663989875.438} ] load \ --table Reports \ --ifexists 'content == "" && modified_at <= 1663989875.437'
However, this isn’t happen in
select --filter
.Fixed a bug that
alnum(a-zA-Z0-9) + blank
may be detected.If the number of input is 2 such as
ab
and text with some blanks such asa b
is matched,a b
is detected. However, it should not be detected in this case.For example,
a i
is detected when this bug occures as below.table_create Entries TABLE_NO_KEY column_create Entries body COLUMN_SCALAR ShortText load --table Entries [ {"body": "Groonga is fast"} ] select Entries \ --output_columns 'highlight(body, "ai", "<keyword>", "</keyword>")' [ [ 0,0.0,0.0 ], [ [ [ 1 ], [ [ "highlight", null ] ], [ "Groong<keyword>a i</keyword>s fast" ] ] ] ]
However, the above result is unexpected result. We don’t want to detect
a i
in the above case.
Thanks#
yssrku
Release 13.0.8 - 2023-09-29#
Improvements#
[column_create] Improved
column_create
flags with addition new flagsCOLUMN_FILTER_SHUFFLE
,COLUMN_FILTER_BYTE_DELTA
,COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE
, andCOMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES
.Added new bundling library Blosc.
COLUMN_FILTER_SHUFFLE
,COLUMN_FILTER_BYTE_DELTA
,COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE
, andCOMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES
flags are require Blosc.[status] Improved
status
output with addition new features"blosc"
.[groonga executable file] Improved
groonga --version
output with addition new valueblosc
.[select] Improved
select
arguments with addition new argument--fuzzy_max_distance
(experimental).[select] Improved
select
arguments with addition new argument--fuzzy_max_expansions
(experimental).[select] Improved
select
arguments with addition new argument--fuzzy_max_distance_ratio
(experimental).[select] Improved
select
arguments with addition new argument--fuzzy_prefix_length
(experimental).[cast] Added support for casting
"[0.0, 1.0, 1.1, ...]"
toFloat
/Float32
vector.[fuzzy_search] Rename
max_expansion
option tomax_expansions
option.max_expansion
option is deprecate since this release. However, we can usemax_expansion
in the feature to backward compatibility.Rename master branch to main branch.
[RPM] Use CMake for building.
[Debian] Added support for Debian trixie.
Fixes#
[fuzzy_search] Fixed a bug that Groonga may get records that should not match.
[Near phrase search condition][Near phrase search operator] Fixed a bug that Groonga crashed when the first phrase group doesn’t match anything as below.
table_create Entries TABLE_NO_KEY column_create Entries content COLUMN_SCALAR Text table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer TokenNgram \ --normalizer NormalizerNFKC121 column_create Terms entries_content COLUMN_INDEX|WITH_POSITION \ Entries content load --table Entries [ {"content": "x y z"} ] select Entries \ --match_columns Terms.entries_content.content \ --query '*NPP1"(NONEXISTENT) (z)"' \ --output_columns '_score, content'
Release 13.0.7 - 2023-09-12#
Fixes#
[normalize] Fixed a bug that
normalize
command doesn’t output last offset and type.normalize
command can output offset and type of string after normalize as below, butnormalize
command doesn’t output the last offset and type by this bug.table_create Normalizations TABLE_PAT_KEY ShortText column_create Normalizations normalized COLUMN_SCALAR ShortText load --table Normalizations [ {"_key": "あ", "normalized": "<あ>"} ] normalize 'NormalizerNFKC130("unify_kana", true, "report_source_offset", true), NormalizerTable("normalized", "Normalizations.normalized", "report_source_offset", true)' "お あ a ア i ア オ" REMOVE_BLANK|WITH_TYPES|WITH_CHECKS [ [ 0, 0.0, 0.0 ], { "normalized": "お<あ>a<あ>i<あ>お", "types": [ "hiragana", "symbol", "hiragana", "symbol", "alpha", "symbol", "hiragana", "symbol", "alpha", "symbol", "hiragana", "symbol", "hiragana" ], "checks": [ 3, 0, 0, 4, -1, 0, 0, -1, 4, 4, -1, 0, 0, -1, 4, 4, -1, 0, 0, -1, 4, 0, 0 ], "offsets": [ 0, 4, 4, 4, 8, 12, 12, 12, 16, 20, 20, 20, 24 ] } ]
[Normalizers] Fixed a bug that the last offset value may be invalid when we use multiple normalizers.
For the following example, the last offset value is 27 correctly, but it is 17 in the following example by this bug.
table_create Normalizations TABLE_PAT_KEY ShortText column_create Normalizations normalized COLUMN_SCALAR ShortText load --table Normalizations [ {"_key": "あ", "normalized": "<あ>"} ] normalize 'NormalizerNFKC130("unify_kana", true, "report_source_offset", true), NormalizerTable("normalized", "Normalizations.normalized", "report_source_offset", true)' "お あ a ア i ア オ" REMOVE_BLANK|WITH_TYPES|WITH_CHECKS [ [ 0, 0.0, 0.0 ], { "normalized": "お<あ>a<あ>i<あ>お", "types": [ "hiragana", "symbol", "hiragana", "symbol", "alpha", "symbol", "hiragana", "symbol", "alpha", "symbol", "hiragana", "symbol", "hiragana", "null" ], "checks": [ 3, 0, 0, 4, -1, 0, 0, -1, 4, 4, -1, 0, 0, -1, 4, 4, -1, 0, 0, -1, 4, 0, 0 ], "offsets": [ 0, 4, 4, 4, 8, 12, 12, 12, 16, 20, 20, 20, 24, 17 ] } ]
Release 13.0.6 - 2023-08-31#
Improvements#
[highlight_html] Don’t report error when we specify empty string into
highlight_html()
as below.highlight_html()
just returns an empty text.table_create Entries TABLE_NO_KEY column_create Entries body COLUMN_SCALAR ShortText table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer 'TokenNgram("report_source_location", true)' \ --normalizer 'NormalizerNFKC150' column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body load --table Entries [ {"body": "ab cd ed gh"} ] select Entries \ --match_columns body \ --query 'ab' \ --output_columns 'highlight_html("", Terms)' [ [ 0, 0.0, 0.0 ], [ [ [ 1 ], [ [ "highlight_html",null ] ], [ "" ] ] ] ]
Added support
aggregator_*
for dynamic columns and pseudo columns.Pseudo column is a column with
_
prefix.(e.g._id
,_nsubrecs
, …).
Fixes#
[CMake] Fixed a build error with CMake when both of msgpack and msgpackc-cxx are installed.
Please refer the comment of groonga/groonga#1601 about details.
Fixed a parse bug when we use
x OR <0.0y
withQUERY_NO_SYNTAX_ERROR
.Records that should match may be not matched.
For example, if we execute the following query,
{"_key": "name yyy"}
should match but{"_key": "name yyy"}
is not match.table_create Names TABLE_PAT_KEY ShortText table_create Tokens TABLE_PAT_KEY ShortText \ --default_tokenizer TokenBigram \ --normalizer NormalizerAuto column_create Tokens names_key COLUMN_INDEX|WITH_POSITION Names _key load --table Names [ {"_key": "name yyy"} ] select Names \ --match_columns "_key" \ --query "xxx OR <0.0yyy" \ --query_flags ALLOW_PRAGMA|ALLOW_COLUMN|QUERY_NO_SYNTAX_ERROR [ [ 0, 0.0, 0.0 ], [ [ [ 0 ], [ [ "_id", "UInt32" ], [ "_key", "ShortText" ] ] ] ] ]
[highlight_html] Fixed a bug that highlight position may be incorrect.
For example, this bug occures when we specify as highlight target both of keyword with the number of characters is one and keyword with the number of characters is two.
Release 13.0.5 - 2023-08-02#
Fixes#
Fixed a bug that index creation may be failed.
Groonga v13.0.2, v13.0.3, and v13.0.4 have this bug. Therefore, If you already have used the above version, we highly recommend that you use Groonga v13.0.5 or later.
[Near phrase search condition][Near phrase search operator] Fixed a bug that Groonga may crash when we specify invalid syntax query.
For example, Groonga is supported to occures error in the following case. However if Groonga has this bug, Groonga is crashed in the following case.
(We want to notice that one too many a close parenthesis in the value of “–query” in the following case.)table_create Entries TABLE_NO_KEY [[0,0.0,0.0],true] column_create Entries content COLUMN_SCALAR Text [[0,0.0,0.0],true] table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer TokenNgram --normalizer NormalizerNFKC121 [[0,0.0,0.0],true] column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content [[0,0.0,0.0],true] load --table Entries [ {"content": "a b c"} ] [[0,0.0,0.0],1] select Entries --match_columns content --query '*NPP2"(a b))"' --output_columns '_score, content'
Release 13.0.4 - 2023-07-26#
Improvements#
[Windows] Stopped providing 32-bit packages.
Fixes#
[Debian GNU/Linux] [Ubuntu] Fixed the default configuration file path for QueryExpanderTSV.
[CMake] Fixed a bug that some errors may be reported when CMake 3.16 or 3.17 are used.
Release 13.0.3 - 2023-07-24#
Improvements#
[groonga-httpd] Extracted to groonga-nginx. We stopped providing the
groonga-httpd
package.If you’re an user of Debian GNU/Linux 12+ or Ubuntu 23.10+, you can use the
libnginx-mod-http-groonga
package with the defaultnginx
package. See groonga-nginx’s README for details.If you’re an user of old Debian/Ubuntu or RHEL related distributions, you can’t use any
groonga-httpd
equivalent package. You can use Groonga HTTP server instead. If Groonga HTTP server isn’t suitable for your use case, please report it to Discussions with your use case.[Ubuntu] Added support for Ubuntu 23.10 (Mantic Minotaur).
[Debian GNU/Linux] Enabled xxHash support.
[Ubuntu] Enabled xxHash support.
Fixes#
Fixed a bug that the source archive can’t be built with CMake.
Release 13.0.2 - 2023-07-12#
Improvements#
[Ubuntu] Dropped support for Ubuntu 18.04 (Bionic Beaver).
[Ubuntu] Added support for Ubuntu 23.04 (Lunar Lobster).
[Debian GNU/Linux] Added support for Debian GNU/Linux 12 (bookworm).
[Oracle Linux] Dropped support for Oracle Linux. Use AlmaLinux packages instead.
[
grn_highlighter
] Added support for changing tag.GH-1453 [Reported by askdkc]
[
grn_highlighter
] Added support for customizing normalizers.[
grn_highlighter
] Added support for customizing HTML mode.[
grn_highlighter
] Added support for multiple tags.[highlight] Added the
sequential_class_tag_mode
option.[highlight_html] Added the
sequential_class_tag_mode
option.[reference_acquire] Changed to refer index columns that are for the target object when
--recursive dependent
is used.If the target object is a column, index columns for the column is also referred.
If the target object is a table, index columns for the table is also referred.
If the target object is a DB, all tables are processed as target objects.
[CMake] Changed to require CMake 3.16 or later. We’ll use CMake instead of GNU Autotools as our recommended build tool in near future and drop support for GNU Autotools eventually.
[CMake] Added support for CMake package. You can use it by
find_package(Groonga)
.[Packaging] Changed to use newer GNU Autotools to generate
configure
in the source archive.[
ranguba/rroonga#220 <https://github.com/ranguba/rroonga/issues/220>
_] [Reported by ZangRuochen][reference_acquire] Optimized reference count implementation for built-in objects.
Added support for logging backtrace on
SIGABRT
.
Fixes#
[Ordered near phrase product search condition] [Ordered near phrase product search operator] Fixed a search bug that records that should be matched may not be matched.
It’s happen when multiple 3+ tokens are overlapped in query. For example,
abc
andabcd
are an invalid combination. If shorter one (abc
) exists before longer one (abcd
), this bug is happen. For example,ONPP1 "(abcd abc) (1 2)"
works butONPP1 "(abc abcd) (1 2)"
doesn’t work.Fixed a bug that invalid weight may be used with multiple adjusts.
[Query syntax] Fixed a typo.
[
GH-1560 <https://github.com/groonga/groonga/issues/1560>
_] [Patch by Dylan Golow][Near phrase search condition] [Near phrase search operator] [Near phrase product search condition] [Near phrase product search operator] [Ordered near phrase search condition] [Ordered near phrase search operator] [Ordered near phrase product search condition] [Ordered near phrase product search operator] Fixed an invalid interval calculation when
additional_last_interval
is used.For example, let’s think about
*NP3,-1"aaa bbb .$"
againstaaaxxxbbbcdefghi.
In this case, the number of tokens betweenaaa
andbbb
must be 3 but it was 7.[Near phrase product search condition] [Near phrase product search operator] [Ordered near phrase product search condition] [Ordered near phrase product search operator] Fixed infinite loop bugs when the same phrase exists in the same phrase group.
For example,
*NPP1 "(abcd abc abcd bcde) (efghi)"
is a bad query because the first phrase group has twoabcd
phrases.For example,
*NPP1 "(abcde \"abc de\") (efghi)"
is a bad query because the first phrase group hasabcde
and"abc de"
. They are “logically” the same phrases.Fixed a bug that internal lock count may not be decreased when lock acquisition is failed. In normal use-case, this will not be a real problem.
Thanks#
askdkc
Dylan Golow
ZangRuochen
Release 13.0.1 - 2023-03-24#
Improvements#
[highlight_html] Added support for prefix search.
We can now use prefix search in
highlight_html
.Note that highlight keyword is also highlighted not only at the first but also in the middle or at the end.
table_create Tags TABLE_NO_KEY column_create Tags name COLUMN_SCALAR ShortText table_create Terms TABLE_PAT_KEY ShortText \ --normalizer 'NormalizerNFKC150' column_create Terms tags_name COLUMN_INDEX Tags name load --table Tags [ {"name": "Groonga"} ] select Tags \ --query "name:^g" \ --output_columns "highlight_html(name)" # [ # [ # 0, # 0.0, # 0.0 # ], # [ # [ # [ # 1 # ], # [ # [ # "highlight_html", # null # ] # ], # [ # "<span class=\"keyword\">G</span>roon<span class=\"keyword\">g</span>a" # ] # ] # ] # ]
[Normalizers] Added new options for NormalizerNFKC*.
unify_kana_prolonged_sound_mark
We can now normalize prolonged_sound_mark with this option as below.
ァー -> ァア, アー -> アア, ヵー -> ヵア, カー -> カア, ガー -> ガア, サー -> サア, ザー -> ザア, ター -> タア, ダー -> ダア, ナー -> ナア, ハー -> ハア, バー -> バア, パー -> パア, マー -> マア, ャー -> ャア, ヤー -> ヤア, ラー -> ラア, ヮー -> ヮア, ワー -> ワア, ヷー -> ヷア, ィー -> ィイ, イー -> イイ, キー -> キイ, ギー -> ギイ, シー -> シイ, ジー -> ジイ, チー -> チイ, ヂー -> ヂイ, ニー -> ニイ, ヒー -> ヒイ, ビー -> ビイ, ピー -> ピイ, ミー -> ミイ, リー -> リイ, ヰー -> ヰイ, ヸー -> ヸイ, ゥー -> ゥウ, ウー -> ウウ, クー -> クウ, グー -> グウ, スー -> スウ, ズー -> ズウ, ツー -> ツウ, ヅー -> ヅウ, ヌー -> ヌウ, フー -> フウ, ブー -> ブウ, プー -> プウ, ムー -> ムウ, ュー -> ュウ, ユー -> ユウ, ルー -> ルウ, ヱー -> ヱウ, ヴー -> ヴウ, ェー -> ェエ, エー -> エエ, ヶー -> ヶエ, ケー -> ケエ, ゲー -> ゲエ, セー -> セエ, ゼー -> ゼエ, テー -> テエ, デー -> デエ, ネー -> ネエ, ヘー -> ヘエ, ベー -> ベエ, ペー -> ペエ, メー -> メエ, レー -> レエ, ヹー -> ヹエ, ォー -> ォオ, オー -> オオ, コー -> コオ, ゴー -> ゴオ, ソー -> ソオ, ゾー -> ゾオ, トー -> トオ, ドー -> ドオ, ノー -> ノオ, ホー -> ホオ, ボー -> ボオ, ポー -> ポオ, モー -> モオ, ョー -> ョオ, ヨー -> ヨオ, ロー -> ロオ, ヲー -> ヲオ, ヺー -> ヺオ, ンー -> ンン ぁー -> ぁあ, あー -> ああ, ゕー -> ゕあ, かー -> かあ, がー -> があ, さー -> さあ, ざー -> ざあ, たー -> たあ, だー -> だあ, なー -> なあ, はー -> はあ, ばー -> ばあ, ぱー -> ぱあ, まー -> まあ, ゃー -> ゃあ, やー -> やあ, らー -> らあ, ゎー -> ゎあ, わー -> わあ ぃー -> ぃい, いー -> いい, きー -> きい, ぎー -> ぎい, しー -> しい, じー -> じい, ちー -> ちい, ぢー -> ぢい, にー -> にい, ひー -> ひい, びー -> びい, ぴー -> ぴい, みー -> みい, りー -> りい, ゐー -> ゐい ぅー -> ぅう, うー -> うう, くー -> くう, ぐー -> ぐう, すー -> すう, ずー -> ずう, つー -> つう, づー -> づう, ぬー -> ぬう, ふー -> ふう, ぶー -> ぶう, ぷー -> ぷう, むー -> むう, ゅー -> ゅう, ゆー -> ゆう, るー -> るう, ゑー -> ゑう, ゔー -> ゔう ぇー -> ぇえ, えー -> ええ, ゖー -> ゖえ, けー -> けえ, げー -> げえ, せー -> せえ, ぜー -> ぜえ, てー -> てえ, でー -> でえ, ねー -> ねえ, へー -> へえ, べー -> べえ, ぺー -> ぺえ, めー -> めえ, れー -> れえ ぉー -> ぉお, おー -> おお, こー -> こお, ごー -> ごお, そー -> そお, ぞー -> ぞお, とー -> とお, どー -> どお, のー -> のお, ほー -> ほお, ぼー -> ぼお, ぽー -> ぽお, もー -> もお, ょー -> ょお, よー -> よお, ろー -> ろお, をー -> をお んー -> んん
Here is an example of
unify_kana_prolonged_sound_mark
.table_create --name Animals --flags TABLE_HASH_KEY --key_type ShortText column_create --table Animals --name name --type ShortText column_create --table Animals --name sound --type ShortText load --table Animals [ {"_key":"1","name":"羊", "sound":"メーメー"}, ] table_create \ --name idx_animals_sound \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_kana_prolonged_sound_mark", true)' column_create --table idx_animals_sound --name animals_sound --flags COLUMN_INDEX|WITH_POSITION --type Animals --source sound select --table Animals --query sound:@メエメエ # [ # [ # 0, # 1677829950.652696, # 0.01971983909606934 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ], # [ # "sound", # "ShortText" # ] # ], # [ # 1, # "1", # "羊", # "メーメー" # ] # ] # ] # ]
unify_kana_hyphen
We can now normalize hyphen with this option as below.
ァ- -> ァア, ア- -> アア, ヵ- -> ヵア, カ- -> カア, ガ- -> ガア, サ- -> サア, ザ- -> ザア, タ- -> タア, ダ- -> ダア, ナ- -> ナア, ハ- -> ハア, バ- -> バア, パ- -> パア, マ- -> マア, ャ- -> ャア, ヤ- -> ヤア, ラ- -> ラア, ヮ- -> ヮア, ワ- -> ワア, ヷ- -> ヷア, ィ- -> ィイ, イ- -> イイ, キ- -> キイ, ギ- -> ギイ, シ- -> シイ, ジ- -> ジイ, チ- -> チイ, ヂ- -> ヂイ, ニ- -> ニイ, ヒ- -> ヒイ, ビ- -> ビイ, ピ- -> ピイ, ミ- -> ミイ, リ- -> リイ, ヰ- -> ヰイ, ヸ- -> ヸイ, ゥ- -> ゥウ, ウ- -> ウウ, ク- -> クウ, グ- -> グウ, ス- -> スウ, ズ- -> ズウ, ツ- -> ツウ, ヅ- -> ヅウ, ヌ- -> ヌウ, フ- -> フウ, ブ- -> ブウ, プ- -> プウ, ム- -> ムウ, ュ- -> ュウ, ユ- -> ユウ, ル- -> ルウ, ヱ- -> ヱウ, ヴ- -> ヴウ, ェ- -> ェエ, エ- -> エエ, ヶ- -> ヶエ, ケ- -> ケエ, ゲ- -> ゲエ, セ- -> セエ, ゼ- -> ゼエ, テ- -> テエ, デ- -> デエ, ネ- -> ネエ, ヘ- -> ヘエ, ベ- -> ベエ, ペ- -> ペエ, メ- -> メエ, レ- -> レエ, ヹ- -> ヹエ, ォ- -> ォオ, オ- -> オオ, コ- -> コオ, ゴ- -> ゴオ, ソ- -> ソオ, ゾ- -> ゾオ, ト- -> トオ, ド- -> ドオ, ノ- -> ノオ, ホ- -> ホオ, ボ- -> ボオ, ポ- -> ポオ, モ- -> モオ, ョ- -> ョオ, ヨ- -> ヨオ, ロ- -> ロオ, ヲ- -> ヲオ, ヺ- -> ヺオ, ン- -> ンン ぁ- -> ぁあ, あ- -> ああ, ゕ- -> ゕあ, か- -> かあ, が- -> があ, さ- -> さあ, ざ- -> ざあ, た- -> たあ, だ- -> だあ, な- -> なあ, は- -> はあ, ば- -> ばあ, ぱ- -> ぱあ, ま- -> まあ, ゃ- -> ゃあ, や- -> やあ, ら- -> らあ, ゎ- -> ゎあ, わ- -> わあ ぃ- -> ぃい, い- -> いい, き- -> きい, ぎ- -> ぎい, し- -> しい, じ- -> じい, ち- -> ちい, ぢ- -> ぢい, に- -> にい, ひ- -> ひい, び- -> びい, ぴ- -> ぴい, み- -> みい, り- -> りい, ゐ- -> ゐい ぅ- -> ぅう, う- -> うう, く- -> くう, ぐ- -> ぐう, す- -> すう, ず- -> ずう, つ- -> つう, づ- -> づう, ぬ- -> ぬう, ふ- -> ふう, ぶ- -> ぶう, ぷ- -> ぷう, む- -> むう, ゅ- -> ゅう, ゆ- -> ゆう, る- -> るう, ゑ- -> ゑう, ゔ- -> ゔう ぇ- -> ぇえ, え- -> ええ, ゖ- -> ゖえ, け- -> けえ, げ- -> げえ, せ- -> せえ, ぜ- -> ぜえ, て- -> てえ, で- -> でえ, ね- -> ねえ, へ- -> へえ, べ- -> べえ, ぺ- -> ぺえ, め- -> めえ, れ- -> れえ ぉ- -> ぉお, お- -> おお, こ- -> こお, ご- -> ごお, そ- -> そお, ぞ- -> ぞお, と- -> とお, ど- -> どお, の- -> のお, ほ- -> ほお, ぼ- -> ぼお, ぽ- -> ぽお, も- -> もお, ょ- -> ょお, よ- -> よお, ろ- -> ろお, を- -> をお ん- -> んん
Here is an example of
unify_kana_hyphen
.table_create --name Animals --flags TABLE_HASH_KEY --key_type ShortText column_create --table Animals --name name --type ShortText column_create --table Animals --name sound --type ShortText load --table Animals [ {"_key":"1","name":"羊", "sound":"メ-メ-"}, ] table_create \ --name idx_animals_sound \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_kana_hyphen", true)' column_create --table idx_animals_sound --name animals_sound --flags COLUMN_INDEX|WITH_POSITION --type Animals --source sound select --table Animals --query sound:@メエメエ # [ # [ # 0,1677829950.652696, # 0.01971983909606934 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ], # [ # "sound", # "ShortText" # ] # ], # [ # 1, # "1", # "羊", # "メ-メ-" # ] # ] # ] # ]
[Near search condition][Near search operator] Added a new option
${MIN_INTERVAL}
for near-search family.We can now specifiy the minimum interval between phrases (words) with
${MIN_INTERVAL}
. The interval between phrases (words) must be at least this value.Here are new syntax:
*N${MAX_INTERVAL},${MAX_TOKEN_INTERVAL_1}|${MAX_TOKEN_INTERVAL_2}|...,${MIN_INTERVAL} "word1 word2 ..." *NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "phrase1 phrase2 ..." *NPP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "(phrase1-1 phrase1-2 ...) (phrase2-1 phrase2-2 ...) ..." *ONP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "phrase1 phrase2 ..." *ONPP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "(phrase1-1 phrase1-2 ...) (phrase2-1 phrase2-2 ...) ..."
The default value of
${MIN_INTERVAL}
isINT32_MIN
(-2147483648
). We use the default value when${MIN_INTERVAL}
is omitted.This option is useful when we want to ignore overlapped phrases.
The interval for
*NP
is culculated asinterval between the top tokens of phrases - tokens in the left phrase + 1
.When a tokenizer is Bigram, for exmaple,
東京
has one token東京
, also京都
has one token京都
.Considering
東京都
as a target value of*NP "東京 京都"
:interval between the top tokens of phrases
:1
(interval between東京
and京都
)tokens in the left phrase
:1
(東京
)
The interval for
*NP
of東京都
is1 - 1 + 1
=1
.As a result, the interval for
*NP
is greater than1
when東京
and京都
are not overlapped.Here is an example for ignoring overlapped phrases.
table_create Entries TABLE_NO_KEY column_create Entries content COLUMN_SCALAR Text table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer 'TokenNgram("unify_alphabet", false, \ "unify_digit", false)' \ --normalizer NormalizerNFKC150 column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content load --table Entries [ {"content": "東京都"}, {"content": "東京京都"} ] select Entries \ --match_columns content \ --query '*NP-1,0,,2"東京 京都"' \ --output_columns '_score, content' # [ # [ # 0, # 0.0, # 0.0 # ], # [ # [ # [ # 1 # ], # [ # [ # "_score", # "Int32" # ], # [ # "content", # "Text" # ] # ], # [ # 1, # "東京京都" # ] # ] # ] # ]
In the example above,
東京都
is not matched as the interval is1
but東京京都
is matched as the interval is2
.[Normalizers] Added support for new values in the
unify_katakana_trailing_o
option.We added support for normalizing the following new values in the
unify_katakana_trailing_o
option because a vowel of those left letters isO
.ォオ
->ォウ
ョオ
->ョウ
ヺオ
->ヺウ
Add support for MessagePack v6.0.0. [GitHub#1536][Reported by Carlo Cabrera]
Groonga can not found MessagePack v6.0.0 or later when we execute configure or cmake until now. Groonga can found MessagePack since this release even if the version of MessagePack is v6.0.0 or later.
Fixes#
[Normalizers] Fixed a bug that NormalizerNFKC* did incorrect normalization.
This bug occured when
unify_kana_case
andunify_katakana_v_sounds
used at the same time.For example,
ヴァ
was normalized toバア
withunify_kana_case
andunify_katakana_v_sounds
, butヴァ
should be normalized toバ
.This was because
ヴァ
was normalized toヴア
withunify_kana_case
, and after that,ヴア
was normalized toバア
withunify_katakana_v_sounds
. We fixed to normalize characters withunify_katakana_v_sounds
beforeunify_kana_case
.Here is an example of the bug in the previous version.
normalize \ 'NormalizerNFKC150("unify_katakana_v_sounds", true, \ "unify_kana_case", true)' \ "ヴァーチャル" #[ # [ # 0, # 1678097412.913053, # 0.00019073486328125 # ], # { # "normalized":"ブアーチヤル", # "types":[], # "checks":[] # } #]
From this version,
ヴァーチャル
is normalized toバーチヤル
.[Ordered near phrase search condition][Ordered near phrase search operator] Fixed a bug that
${MAX_PHRASE_INTERVALS}
doesn’t work correctly.When this bug occured, intervals are regarded as
0
. Therefore, if this bug occurs, records may hit too many.This bug occured when:
Use
*ONP
with${MAX_PHRASE_INTERVALS}
The number of tokens in the matched left phrase is greater than or equal to the number of
${MAX_PHRASE_INTERVALS}
elements.
Here is an example of this bug.
table_create Entries TABLE_NO_KEY column_create Entries content COLUMN_SCALAR Text table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer 'TokenNgram("unify_alphabet", false, \ "unify_digit", false)' \ --normalizer NormalizerNFKC150 column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content load --table Entries [ {"content": "abcXYZdef"}, {"content": "abcdef"}, {"content": "abc123456789def"}, {"content": "abc12345678def"}, {"content": "abc1de2def"} ] select Entries --filter 'content *ONP-1,0,1 "abc def"' --output_columns '_score, content' #[ # [ # 0, # 0.0, # 0.0 # ], # [ # [ # [ # 5 # ], # [ # [ # "_score", # "Int32" # ], # [ # "content", # "Text" # ] # ], # [ # 1, # "abcXYZdef" # ], # [ # 1, # "abcdef" # ], # [ # 1, # "abc123456789def" # ], # [ # 1, # "abc12345678def" # ], # [ # 1, # "abc1de2def" # ] # ] # ] #]
In the example above, the first element interval is specified as
1
with*ONP-1,0,1 "abc def"
, but allcontent
is matched, including those farther than1
.This is because the example satisfies the condition for the bug and the interval is regarded as
0
.Use
*ONP
with${MAX_PHRASE_INTERVALS}
*ONP-1,0,1 "abc def"
specifies${MAX_PHRASE_INTERVALS}
.The number of tokens in the matched left phrase is greater than or equal to the number of
${MAX_PHRASE_INTERVALS}
elements.The matched left phrase:
abc
Included tokens:
ab
,bc
Tokenized with
TokenNgram("unify_alphabet", false, "unify_digit", false)
The number of elements specified with
max_element_intervals
:1
1
of*ONP-1,0,1 "abc def"
The number of tokens in the left phrase (
2
) > the number of elements specified withmax_element_intervals
(1
)
[Near phrase search condition][Near phrase search operator] Fixed a bug that phrases specified as last didn’t used as last in near-phrase family.
When this bug occured,
${ADDITIONAL_LAST_INTERVAL}
was ignored and only${MAX_INTERVAL}
was used.This bug occured when:
A phrase specified as last contains multiple tokens.
The size of the last token of the phrase is smaller than or equals to the sizes of other tokens in the phrase.
The token size is the number of tokens appeared in the all records.
Here is an example of this bug.
table_create Entries TABLE_NO_KEY column_create Entries content COLUMN_SCALAR Text table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer 'TokenNgram("unify_alphabet", false, \ "unify_digit", false)' \ --normalizer NormalizerNFKC150 column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content load --table Entries [ {"content": "abc123456789defg"}, {"content": "dededede"} ] select Entries \ --filter 'content *NP10,1"abc defg$"' \ --output_columns '_score, content' #[ # [ # 0, # 0.0, # 0.0 # ], # [ # [ # [ # 0 # ], # [ # [ # "_score", # "Int32" # ], # [ # "content", # "Text" # ] # ] # ] # ] #]
In the example above, for
abc123456789defg
, the intervalabc
todefg
is11
.${MAX_INTERVAL}
is10
and${ADDITIONAL_LAST_INTERVAL}
is1
, so a threshold for matching last phrase is11
. So it should be matched, but isn’t.This is because the example satisfies the condition for the bug as below, and only ``${MAX_INTERVAL}` is used.
A phrase specified as last contains multiple tokens.
defg$
is specified as last because the suffix is$
.defg$
is tokenized tode
,ef
,fg
withTokenNgram("unify_alphabet", false, "unify_digit", false)
The size of the last token of the phrase is smaller than or equals to the sizes of other tokens in the phrase.
fg
is the last token ofdefg$
.abc123456789defg
contains onefg
andde
, anddededede
contains 4de
.So, the size of
fg
is 1 andde
is 5.
[Near phrase search condition] Fixed interval calculation.
If we use near phrase search, records may hit too many by this bug.
[highlight_html] Fixed a bug that highlight position may move over when we use
loose_symbol=true
.
Thanks#
Carlo Cabrera
Release 13.0.0 - 2023-02-09#
This is a major version up! But It keeps backward compatibility. We can upgrade to 13.0.0 without rebuilding database.
First of all, we introduce the main changes in 13.0.0. Then, we introduce the hilight and summary of changes from Groonga 12.0.0 to 12.1.2.
Improvements#
[Normalizers] Added a new Normalizer
NormalizerNFKC150
based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 15.0.[Token filters] Added a new TokenFilter
TokenFilterNFKC150
based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 15.0.[NormalizerNFKC150] Added new options for NormalizerNFKC* as below.
unify_katakana_gu_small_sounds
We can normalize “グァ -> ガ”, “グィ -> ギ”, “グェ -> ゲ”, and “グォ -> ゴ” with this option.
Here is an example of
unify_katakana_gu_small_sounds
option.table_create --name Countries --flags TABLE_HASH_KEY --key_type ShortText column_create --table Countries --name name --type ShortText load --table Countries [ {"_key":"JP","name":"日本"}, {"_key":"GT","name":"グァテマラ共和国"}, ] table_create \ --name idx_contry_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_gu_small_sounds", true)' column_create --table idx_contry_name --name contry_name --flags COLUMN_INDEX|WITH_POSITION --type Countries --source name select --table Countries --query name:@ガテマラ共和国 # [ # [0, # 0, # 0 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 2, # "GT", # "グァテマラ共和国" # ] # ] # ] # ]
unify_katakana_di_sound
We can normalize “ヂ -> ジ” with this option.
Here is an example of
unify_katakana_di_sound
option.table_create --name Foods --flags TABLE_HASH_KEY --key_type ShortText column_create --table Foods --name name --type ShortText load --table Foods [ {"_key":"1","name":"チジミ"}, {"_key":"2","name":"パジョン"}, ] table_create \ --name idx_food_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_di_sound", true)' column_create --table idx_food_name --name food_name --flags COLUMN_INDEX|WITH_POSITION --type Foods --source name select --table Foods --query name:@チヂミ # [ # [ # 0, # 0, # 0 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 1, # "1", # "チジミ" # ] # ] # ] # ]
unify_katakana_wo_sound
We can normalize “ヲ -> オ” with this option.
Here is an example of
unify_katakana_wo_sound
option.table_create --name Foods --flags TABLE_HASH_KEY --key_type ShortText column_create --table Foods --name name --type ShortText load --table Foods [ {"_key":"1","name":"アヲハタ"}, {"_key":"2","name":"ヴェルデ"}, {"_key":"3","name":"ランプ"}, ] table_create \ --name idx_food_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_wo_sound", true)' column_create --table idx_food_name --name food_name --flags COLUMN_INDEX|WITH_POSITION --type Foods --source name select --table Foods --query name:@アオハタ # [ # [ # 0, # 0, # 0 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 1, # "1", # "アヲハタ" # ] # ] # ] # ]
unify_katakana_zu_small_sounds
We can normalize “ズァ -> ザ”, “ズィ -> ジ”, “ズェ -> ゼ”, and “ズォ -> ゾ” with this option.
Here is an example of
unify_katakana_zu_small_sounds
option.table_create --name Cities --flags TABLE_HASH_KEY --key_type ShortText column_create --table Cities --name name --type ShortText load --table Cities [ {"_key":"1","name":"ガージヤーバード"}, {"_key":"2","name":"デリー"}, ] table_create \ --name idx_city_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_zu_small_sounds", true)' column_create --table idx_city_name --name city_name --flags COLUMN_INDEX|WITH_POSITION --type Cities --source name select --table Cities --query name:@ガーズィヤーバード # [ # [ # 0, # 0, # 0 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 1, # "1", # "ガージヤーバード" # ] # ] # ] # ]
unify_katakana_du_sound
We can normalize “ヅ -> ズ” with this option.
Here is an example of
unify_katakana_du_sound
option.table_create --name Plants --flags TABLE_HASH_KEY --key_type ShortText column_create --table Plants --name name --type ShortText load --table Plants [ {"_key":"1","name":"ハスノカヅラ"}, {"_key":"2","name":"オオツヅラフジ"}, {"_key":"3","name":"アオツヅラフジ"}, ] table_create \ --name idx_plant_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_du_sound", true)' column_create --table idx_plant_name --name plant_name --flags COLUMN_INDEX|WITH_POSITION --type Plants --source name select --table Plants --query name:@ツズラ # [ # [ # 0, # 0, # 0 # ], # [ # [ # [ # 2 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 2, # "2", # "オオツヅラフジ" # ], # [ # 3, # "3", # "アオツヅラフジ" # ] # ] # ] # ]
unify_katakana_trailing_o
We can normalize following characters with this option.
“オオ -> オウ”
“コオ -> コウ”
“ソオ -> ソウ”
“トオ -> トウ”
“ノオ -> ノウ”
“ホオ -> ホウ”
“モオ -> モウ”
“ヨオ -> ヨウ”
“ロオ -> ロウ”
“ゴオ -> ゴウ”
“ゾオ -> ゾウ”
“ドオ -> ドウ”
“ボオ -> ボウ”
“ポオ -> ポウ”
Here is an example of
unify_katakana_trailing_o
option.table_create --name Sharks --flags TABLE_HASH_KEY --key_type ShortText column_create --table Sharks --name name --type ShortText load --table Sharks [ {"_key":"1","name":"ホオジロザメ"}, {"_key":"2","name":"ジンベイザメ"}, ] table_create \ --name idx_shark_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_trailing_o", true)' column_create --table idx_shark_name --name shark_name --flags COLUMN_INDEX|WITH_POSITION --type Sharks --source name select --table Sharks --query name:@ホウジロザメ # [ # [ # 0, # 0, # 0 # ], # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 1, # "1", # "ホオジロザメ" # ] # ] # ] # ]
unify_katakana_du_small_sounds
We can normalize “ヅァ -> ザ”, “ヅィ -> ジ”, “ヅェ -> ゼ”, and “ヅォ -> ゾ” with this option.
Here is an example of
unify_katakana_du_small_sounds
option.table_create --name Airports --flags TABLE_HASH_KEY --key_type ShortText column_create --table Airports --name name --type ShortText load --table Airports [ {"_key":"HER","name":"イラクリオ・ニコスカザンヅァキス国際空港"}, {"_key":"ATH","name":"アテネ国際空港"}, ] table_create \ --name idx_airport_name \ --flags TABLE_PAT_KEY \ --key_type ShortText \ --default_tokenizer TokenBigram \ --normalizer 'NormalizerNFKC150("unify_katakana_du_small_sounds", true)' column_create --table idx_airport_name --name airport_name --flags COLUMN_INDEX|WITH_POSITION --type Airports --source name select --table Airports --query name:@ニコスカザンザキス # [ # [ # [ # 1 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "_key", # "ShortText" # ], # [ # "name", # "ShortText" # ] # ], # [ # 1, # "HER", # "イラクリオ・ニコスカザンヅァキス国際空港" # ] # ] # ]
[Oracle Linux] Added newly support for Oracle Linux 8 and 9.
Thanks#
Atsushi Shinoda
i10a
naoa
shinonon
Zhanzhao (Deo) Liang
David CARLIER
Higlight and Summary of changes from 12.0.0 to 12.1.2#
Higlight#
[Normalizers] Added
NormalizerHTML
. (Experimental)NormalizerHTML
is a normalizer for HTML.Currently
NormalizerHTML
supports removing tags like<span>
or</span>
and expanding character references like&
or&
.Here are sample queries for
NormalizerHTML
.normalize NormalizerHTML "<span> Groonga & Mroonga & Rroonga </span>" # [[0,1666923364.883798,0.0005481243133544922],{"normalized":" Groonga & Mroonga & Rroonga ","types":[],"checks":[]}]
In this sample
<span>
and</span>
are removed, and&
and&
are expanded to&
.We can specify whether removing the tags with the
remove_tag
option. (The default value of theremove_tag
option istrue
.)normalize 'NormalizerHTML("remove_tag", false)' "<span> Groonga & Mroonga & Rroonga </span>" # [[0,1666924069.278549,0.0001978874206542969],{"normalized":"<span> Groonga & Mroonga & Rroonga </span>","types":[],"checks":[]}]
In this sample,
<span>
and</span>
are not removed.We can specify whether expanding the character references with the
expand_character_reference
option. (The default value of theexpand_character_reference
option istrue
.)normalize 'NormalizerHTML("expand_character_reference", false)' "<span> Groonga & Mroonga & Rroonga </span>" # [[0,1666924357.099782,0.0002346038818359375],{"normalized":" Groonga & Mroonga & Rroonga ","types":[],"checks":[]}]
In this sample,
&
and&
are not expanded.
[snippet],[snippet_html] Added support for text vector as input. [groonga-dev,04956][Reported by shinonon]
For example, we can extract snippets of target text around search keywords against vector in JSON data as below.
table_create Entries TABLE_NO_KEY column_create Entries title COLUMN_SCALAR ShortText column_create Entries contents COLUMN_VECTOR ShortText table_create Tokens TABLE_PAT_KEY ShortText --default_tokenizer TokenNgram --normalizer NormalizerNFKC130 column_create Tokens entries_title COLUMN_INDEX|WITH_POSITION Entries title column_create Tokens entries_contents COLUMN_INDEX|WITH_SECTION|WITH_POSITION Entries contents load --table Entries [ { "title": "Groonga and MySQL", "contents": [ "Groonga is a full text search engine", "MySQL is a RDBMS", "Mroonga is a MySQL storage engine based on Groonga" ] } ] select Entries\ --output_columns 'snippet_html(contents), contents'\ --match_columns 'title'\ --query Groonga # [ # [ # 0, # 0.0, # 0.0 # ], # [ # [ # [ # 1 # ], # [ # [ # "snippet_html", # null # ], # [ # "contents", # "ShortText" # ] # ], # [ # [ # "<span class=\"keyword\">Groonga</span> is a full text search engine", # "Mroonga is a MySQL storage engine based on <span class=\"keyword\">Groonga</span>" # ], # [ # "Groonga is a full text search engine", # "MySQL is a RDBMS", # "Mroonga is a MySQL storage engine based on Groonga" # ] # ] # ] # ] # ]
Until now, if we specified
snippet*
like--output_columns 'snippet_html(contents[1])
, we could extract snippets of target text around search keywords against the vector as below. However, we didn’t know which we should output elements. Because we didn’t know which element was hit on search.select Entries\ --output_columns 'snippet_html(contents[0]), contents'\ --match_columns 'title'\ --query Groonga # [ # [ # 0, # 0.0, # 0.0 # ], # [ # [ # [ # 1 # ], # [ # [ # "snippet_html", # null # ], # [ # "contents", # "ShortText" # ] # ], # [ # [ # "<span class=\"keyword\">Groonga</span> is a full text search engine" # ], # [ # "Groonga is a full text search engine", # "MySQL is a RDBMS", # "Mroonga is a MySQL storage engine based on Groonga" # ] # ] # ] # ] # ]
[query_expand] Added a support for synonym group.
Until now, We had to each defined a keyword and synonyms of the keyword as below when we use the synonym search.
table_create Thesaurus TABLE_PAT_KEY ShortText --normalizer NormalizerAuto # [[0, 1337566253.89858, 0.000355720520019531], true] column_create Thesaurus synonym COLUMN_VECTOR ShortText # [[0, 1337566253.89858, 0.000355720520019531], true] load --table Thesaurus [ {"_key": "mroonga", "synonym": ["mroonga", "tritonn", "groonga mysql"]}, {"_key": "groonga", "synonym": ["groonga", "senna"]} ]
In the above case, if we search
mroonga
, Groonga searchmroonga OR tritonn OR "groonga mysql"
as we intended. However, if we searchtritonn
, Groonga search onlytritonn
. If we want to searchtritonn OR mroonga OR "groonga mysql"
even if we searchtritonn
, we need had added a definition as below.load --table Thesaurus [ {"_key": "tritonn", "synonym": ["tritonn", "mroonga", "groonga mysql"]}, ]
In many cases, if we expand
mroonga
tomroonga OR tritonn OR "groonga mysql"
, we feel we want to expandtritonn
and"groonga mysql"
tomroonga OR tritonn OR "groonga mysql"
. However, until now, we had needed additional definitions in such a case. Therefore, if target keywords for synonyms are many, we are troublesome to define synonyms. Because we need to define many similar definitions.In addition, when we remove synonyms, we are troublesome because we need to execute remove against many records.
We can make a group by deciding on a representative synonym record since this release. For example, the all following keywords are the “mroonga” group.
load --table Synonyms [ {"_key": "mroonga": "representative": "mroonga"} ] load --table Synonyms [ {"_key": "tritonn": "representative": "mroonga"}, {"_key": "groonga mysql": "representative": "mroonga"} ]
In this case,
mroonga
is expanded tomroonga OR tritonn OR "groonga mysql"
. In addition,tritonn
and"groonga mysql"
are also expanded tomroonga OR tritonn OR "groonga mysql"
.When we want to remove synonyms, we execute just remove against a target record. For example, if we want to remove
"groonga mysql"
from synonyms, we just remove{"_key": "groonga mysql": "representative": "mroonga"}
.
[index_column_have_source_record] Added a new function
index_column_have_source_record()
.We can confirm whether a token that is existing in the index is included in any of the records that are registered in Groonga or not.
Groonga does not remove a token even if the token become never used from records in Groonga by updating records. Therefore, for example, when we use the feature of autocomplete, Groonga may return a token that is not included in any of the records as candidates for search words. However, we can become that we don’t return the needless token by using this function.
Because this function can detect a token that is not included in any of the records.
[select] Added new arguments
drilldown_max_n_target_records
anddrilldown[${LABEL}].max_n_target_records
.We can specify the max number of records of the drilldown target table (filtered result) to use drilldown. If the number of filtered result is larger than the specified value, some records in filtered result aren’t used for drilldown. The default value of this arguments are
-1
. If these arguments are set-1
, Groonga uses all records for drilldown.This argument is useful when filtered result may be very large. Because a drilldown against large filtered result may be slow. We can limit the max number of records to be used for drilldown by this feature.
Here is an example to limit the max number of records to be used for drilldown. The last 2 records,
{\"_id\": 4, \"tag\": \"Senna\"}
and{\"_id\": 5, \"tag\": \"Senna\"}
, aren’t used.table_create Entries TABLE_HASH_KEY ShortText column_create Entries content COLUMN_SCALAR Text column_create Entries n_likes COLUMN_SCALAR UInt32 column_create Entries tag COLUMN_SCALAR ShortText table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer TokenBigram --normalizer NormalizerAuto column_create Terms entries_key_index COLUMN_INDEX|WITH_POSITION Entries _key column_create Terms entries_content_index COLUMN_INDEX|WITH_POSITION Entries content load --table Entries [ {"_key": "The first post!", "content": "Welcome! This is my first post!", "n_likes": 5, "tag": "Hello"}, {"_key": "Groonga", "content": "I started to use Groonga. It's very fast!", "n_likes": 10, "tag": "Groonga"}, {"_key": "Mroonga", "content": "I also started to use Mroonga. It's also very fast! Really fast!", "n_likes": 15, "tag": "Groonga"}, {"_key": "Good-bye Senna", "content": "I migrated all Senna system!", "n_likes": 3, "tag": "Senna"}, {"_key": "Good-bye Tritonn", "content": "I also migrated all Tritonn system!", "n_likes": 3, "tag": "Senna"} ] select Entries \ --limit -1 \ --output_columns _id,tag \ --drilldown tag \ --drilldown_max_n_target_records 3 # [ # [ # 0, # 1337566253.89858, # 0.000355720520019531 # ], # [ # [ # [ # 5 # ], # [ # [ # "_id", # "UInt32" # ], # [ # "tag", # "ShortText" # ] # ], # [ # 1, # "Hello" # ], # [ # 2, # "Groonga" # ], # [ # 3, # "Groonga" # ], # [ # 4, # "Senna" # ], # [ # 5, # "Senna" # ] # ], # [ # [ # 2 # ], # [ # [ # "_key", # "ShortText" # ], # [ # "_nsubrecs", # "Int32" # ] # ], # [ # "Hello", # 1 # ], # [ # "Groonga", # 2 # ] # ] # ] # ]
Summary#
Improvements#
[httpd] Updated bundled nginx to 1.23.3.
[select][POWER_SET] Vector’s power set is now able to aggregate with the drilldowns.
[select] Specific element of vector column is now able to be search target.
[load] Added support for
YYYY-MM-DD
time format.
[load] Added support for slow log output of
load
.[API] Added new API
grn_is_reference_count_enable()
.[status] Added new items:
back_trace
and/reference_count
.
[AlmaLinux] Added support for AlmaLinux 9.
[escalate] Added a document for the
escalate()
function.[Normalizers] Added
NormalizerHTML
. (Experimental)[httpd] Updated bundled nginx to 1.23.2.
Suppressed logging a lot of same messages when no memory is available.
Changed specification of the
escalate()
function (Experimental) to make it easier to use.[Others: Build with CMake] Added a document about how to build Groonga with CMake.
[Others] Added descriptions about how to enable/disable Apache Arrow support when building with GNU Autotools.
[select] Added a document about drilldowns[${LABEL}].table.
[I18N] Updated the translation procedure.
Added a new function
escalate()
. (experimental)[httpd] Updated bundled nginx to 1.23.1.
[select] Add a document for the
--n_workers
option.
Added new Munin plugins for groonga-delta.
[column_copy] Added support for weight vector.
[Ubuntu] Dropped support for Ubuntu 21.10 (Impish Indri).
[Debian GNU/Linux] Dropped Debian 10 (buster) support.
[select] Improved a little bit of performance for prefix search by search escalation.
[select] Added support for specifying a reference vector column with weight in
drilldowns[LABEL]._key
.[select] Added support for doing drilldown with a reference vector with weight even if we use
query
orfilter
, orpost_filter
.
[Ubuntu] Added support for Ubuntu 22.04 (Jammy Jellyfish).
We don’t provide
groonga-benchmark
.[status] Added a new item
memory_map_size
.
[logical_count] Improved memory usage while
logical_count
executed.[dump] Added support for
MISSING_IGNORE/MISSING_NIL
.[snippet],[snippet_html] Added support for text vector as input.
[
vector_join
] Added a new functionvector_join()
.[Indexing] Ignore too large a token like online index construction.
[logical_range_filter] Added support for reducing reference immediately after processing a shard.
We increased the stability of the feature of recovering on crashes.
Improved performance for mmap if anonymous mmap available.
[Indexing] Added support for the static index construction against the following types of columns.
[column_create] Added new flags
MISSING_*
andINVALID_*
.[dump][column_list] Added support for
MISSING_*
andINVALID_*
flags.[schema] Added support for
MISSING_*
andINVALID_*
flags.We provided the package of Amazon Linux 2.
[Windows] Dropped support for building with Visual Studio 2017.
[query_expand] Added a support for synonym group.
[query_expand] Added a support for text vector and index.
Added support for disabling a backtrace by the environment variable.
[select] Improved performance for
--slices
.[Windows] Added support for Visual Studio 2022.
[select] Added support for specifing max intervals for each elements in near search.
[Groonga HTTP server] We could use
groonga-server-http
even if Groonga of RPM packages.
[sub_filter] Added a new option
pre_filter_threshold
.[index_column_have_source_record] Added a new function
index_column_have_source_record()
.[NormalizerNFKC130] Added a new option
strip
[select] Added new arguments
drilldown_max_n_target_records
anddrilldown[${LABEL}].max_n_target_records
.[httpd] Updated bundled nginx to 1.21.6.
Fixes#
[select] Fix a bug displaying a wrong label in
drilldown
results whencommand_version
is3
.[NormalizerTable] Fix a bug for Groonga to crush with specific definition setting in
NormalizerTable
.
[select][Vector column] Fixed a bug displaying integer in the results when a weight vector column specifies
WEIGHT FLOAT32
.
[select] Fixed a bug that Groonga could crash or return incorrect results when specifying n_workers.
Fixed a bug that Groonga could return incorrect results when we use NormalizerTable and it contains a non-idempotent (results can be changed when executed repeatedly) definition.
Fixed a bug Groonga’s response may be slow when we execute the
request_cancel
while executing a OR search.
Fixed a bug that Groonga may crash when we execute drilldown in a parallel by
n_workers
option.[select] Fixed a bug that the syntax error occurred when we specify a very long expression in
--filter
.
Fixed a bug Groonga’s response may be slow when we execute
request_cancel
while executing a search.Fixed a bug that string list can’t be casted to int32 vector.
Fixed a bug that Groonga Munin Plugins do not work on AlmaLinux 8 and CentOS 7.
Fixed a bug that we may be not able to add a key to a table of patricia trie.