News - 13 series#

Release 13.1.1 - 2024-01-09#

Improvements#

Dropped support for mingw32. [GitHub#1654]
Added support for index search of “vector_column[N] OPERATOR literal” with --match_columns and --query.

Fixes#

[Windows] Bundled groonga-normalizer-mysql again. [GitHub#1655]

Groonga 13.1.0 for Windows didn’t include groonga-normalizer-mysql. This problem only occured in Groonga 13.1.0.

Release 13.1.0 - 2023-12-26#

Improvements#

[select] Groonga also cached trace log.
Added support for outputting dict<string> in a responce of Apache Arrow format.
[Groonga HTTP server] Added support for new content type application/vnd.apache.arrow.stream

[query] Added support empty input as below.

table_create Users TABLE_NO_KEY
column_create Users name COLUMN_SCALAR ShortText

table_create Lexicon TABLE_HASH_KEY ShortText   --default_tokenizer TokenBigramSplitSymbolAlphaDigit   --normalizer NormalizerAuto
column_create Lexicon users_name COLUMN_INDEX|WITH_POSITION Users name
load --table Users
[
{"name": "Alice"},
{"name": "Alisa"},
{"name": "Bob"}
]

select Users   --output_columns name,_score   --filter 'query("name", " 　	")'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        0
      ],
      [
        [
          "name",
          "ShortText"
        ],
        [
          "_score",
          "Int32"
        ]
      ]
    ]
  ]
]

Added support for BFloat16(experimental)

We can just load and select BFloat16. We can’t use arithmetic operations such as bfloat16_value - 1.2.
[column_create] Added new flag WEIGHT_BFLOAT16.

Fixes#

[select] Fixed a bug that when Groonga cached output_pretty=yes result, Groonga returned a query with output_pretty even if we sent a query without output_pretty.
Fixed a wrong data created bug.

In general, users can’t do this explicitly because the command API doesn’t accept GRN_OBJ_{APPEND,PREPEND}. This may be used internally when a dynamic numeric vector column is created and a temporary result set is created (OR is used).

For example, the following query may create wrong data:
```
select TABLE \
  --match_columns TEXT_COLUMN \
  --query 'A B OR C' \
  --columns[NUMERIC_DYNAMIC_COLUMN].stage result_set \
  --columns[NUMERIC_DYNAMIC_COLUMN].type Float32 \
  --columns[NUMERIC_DYNAMIC_COLUMN].flags COLUMN_VECTOR
```
If this is happen, NUMERIC_DYNAMIC_COLUMN contains many garbage elements. It also causes too much memory consumption.

Note that this is caused by an uninitialized variable on stack. So this may or may not be happen.
Fixed a bug that may fail to set valid normalizers/token_filters.

[fuzzy_search] Fixed a crash bug when the following three conditions established.

Query has 2 or more multi-byte characters.
${ASCII}${ASCII}${MULTIBYTE}* characters in a patricia trie table.
WITH_TRANSPOSITION is enabled.

For example, “aaあ” in a patricia trie table with query “あああ” pair has this problem as below.

table_create Users TABLE_NO_KEY
column_create Users name COLUMN_SCALAR ShortText

table_create Names TABLE_PAT_KEY ShortText
column_create Names user COLUMN_INDEX Users name
load --table Users
[
{"name": "aaあ"},
{"name": "あうi"},
{"name": "あう"},
{"name": "あi"},
{"name": "iう"}
]
select Users
  --filter 'fuzzy_search(name, "あiう", {"with_transposition": true, "max_distance": 3})'
  --output_columns 'name, _score'
  --match_escalation_threshold -1

Release 13.0.9 - 2023-10-29#

Improvements#

[select] Changed the default value of --fuzzy_max_expansions from 0 to 10.

--fuzzy_max_expansions can limit number of words that has close edit distance to use search process. This argument can help to balance hit numbers and performance of the search. When --fuzzy_max_expansions is 0, the search use all words that the edit distance are under --fuzzy_max_distance in the vocabulary list.

--fuzzy_max_expansions is 0 (unlimited) may slow down a search. Therefore, the default value of --fuzzy_max_expansions is 10 from this release.
[select] Improved select arguments with addition new argument --fuzzy_with_transposition (experimental).

We can choose edit distance 1 or 2 for the transposition case by using this argument.

If this parameter is yes, the edit distance of this case is 1. It’s 2 otherwise.
[select] Improved select arguments with addition new argument --fuzzy_tokenize.

When --fuzzy_tokenize is yes, Groonga use tokenizer that specifies in --default_tokenizer in typo tolerance search.

The default value of --fuzzy_tokenize is no.　The useful case of --fuzzy_tokenize is the following case.
- Search targets are only Japanese data.
- Specify TokenMecab in --default_tokenizer.
[load] Added support for --ifexists even if we specified apache-arrow into input_type.

[Normalizers] Improved NormalizerNFKC* options with addition new option remove_blank_force.

When remove_blank_force is false, Normalizer doesn’t ignore space as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

load --table Entries
[
{"body": "Groonga はとても速い"},
{"body": "Groongaはとても速い"}
]

select Entries --output_columns \
  'highlight(body, \
    "gaはとても", "<keyword>", "</keyword>", \
    {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"} \
  )'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "highlight",
          null
        ]
      ],
      [
        "Groonga はとても速い"
      ],
      [
        "Groon<keyword>gaはとても</keyword>速い"
      ]
    ]
  ]
]

[select] Improved select arguments with addition new argument --output_trace_log (experimental).

If we specify yes in --output_trace_log and --command_version 3, Groonga output addition new log as below.

table_create Memos TABLE_NO_KEY
column_create Memos content COLUMN_SCALAR ShortText

table_create Lexicon TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram   --normalizer NormalizerNFKC150
column_create Lexicon memos_content   COLUMN_INDEX|WITH_POSITION Memos content

load --table Memos
[
{"content": "This is a pen"},
{"content": "That is a pen"},
{"content": "They are pens"}
]

select Memos \
  --match_columns content \
  --query "Thas OR ere" \
  --fuzzy_max_distance 1 \
  --output_columns *,_score \
  --command_version 3 \
  --output_trace_log yes \
--output_type apache-arrow

return_code: int32
start_time: timestamp[ns]
elapsed_time: double
error_message: string
error_file: string
error_line: uint32
error_function: string
error_input_file: string
error_input_line: int32
error_input_command: string
-- metadata --
GROONGA:data_type: metadata
	return_code	               start_time	elapsed_time	error_message	error_file	error_line	error_function	error_input_file	error_input_line	error_input_command
0	          0	1970-01-01T09:00:00+09:00	    0.000000	       (null)	    (null)	    (null)	        (null)	          (null)	          (null)	             (null)
========================================
depth: uint16
sequence: uint16
name: string
value: dense_union<0: uint32=0, 1: string=1>
elapsed_time: uint64
-- metadata --
GROONGA:data_type: trace_log
	depth	sequence	name	value	elapsed_time
 0	    1	       0	ii.select.input	Thas 	           0
 1	    2	       0	ii.select.exact.n_hits	    0	           1
 2	    2	       0	ii.select.fuzzy.input	Thas 	           2
 3	    2	       1	ii.select.fuzzy.input.actual	that 	           3
 4	    2	       2	ii.select.fuzzy.input.actual	this 	           4
 5	    2	       3	ii.select.fuzzy.n_hits	    2	           5
 6	    1	       1	ii.select.n_hits	    2	           6
 7	    1	       0	ii.select.input	ere  	           7
 8	    2	       0	ii.select.exact.n_hits	    2	           8
 9	    1	       1	ii.select.n_hits	    2	           9
========================================
content: string
_score: double
-- metadata --
GROONGA:n_hits: 2
	content	    _score
0	This is a pen	  1.000000
1	That is a pen	  1.000000

--output_trace_log is valid in only command version 3.

This will be useful for the following cases:

Detect real words used by fuzzy query.
Measure elapsed timeout without seeing query log.

[query] Added support for object literal.
[query_expand] Added support for NPP and ONPP (experimental).

[snippet] Added support for normalizers option.

We can use normalizer with option. For example, when we don’t want to ignore space in snippet() function, we use this option as below.

table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR ShortText

load --table Entries
[
{"content": "Groonga and MySQL"},
{"content": "Groonga and My SQL"}
]

select Entries \
  --output_columns \
    '   snippet(content,   "MySQL", "<keyword>", "</keyword>",   {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"}   )'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "snippet",
          null
        ]
      ],
      [
        [
          "Groonga and <keyword>MySQL</keyword>"
        ]
      ],
      [
        null
      ]
    ]
  ]
]

Fixes#

Fixed a bug in Time OPERATOR Float{,32} comparison. GH-1624[Reported by yssrku]

Microsecond (small value than second) information in Float{,32} isn’t used. This is happen only when Time OPERATOR Float{,32}.

This is happen in load --ifexists 'A OP B || C OP D' as below.

table_create Reports TABLE_HASH_KEY ShortText
column_create Reports content COLUMN_SCALAR Text
column_create Reports modified_at COLUMN_SCALAR Time

load --table Reports
[
{"_key": "a", "content": "", "modified_at": 1663989875.438}
]

load \
  --table Reports \
  --ifexists 'content == "" && modified_at <= 1663989875.437'

However, this isn’t happen in select --filter.

Fixed a bug that alnum(a-zA-Z0-9) + blank may be detected.

If the number of input is 2 such as ab and text with some blanks such as a b is matched, a b is detected. However, it should not be detected in this case.

For example, a i is detected when this bug occures as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

load --table Entries
[
{"body": "Groonga is fast"}
]

select Entries \
  --output_columns 'highlight(body, "ai", "<keyword>", "</keyword>")'

[
  [
    0,0.0,0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight",
          null
        ]
      ],
      [
        "Groong<keyword>a i</keyword>s fast"
      ]
    ]
  ]
]

However, the above result is unexpected result. We don’t want to detect a i in the above case.

Thanks#

yssrku

Release 13.0.8 - 2023-09-29#

Improvements#

[column_create] Improved column_create flags with addition new flags COLUMN_FILTER_SHUFFLE, COLUMN_FILTER_BYTE_DELTA, COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE, and COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES.
Added new bundling library Blosc.

COLUMN_FILTER_SHUFFLE, COLUMN_FILTER_BYTE_DELTA, COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE, and COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES flags are require Blosc.
[status] Improved status output with addition new features "blosc".
[groonga executable file] Improved groonga --version output with addition new value blosc.
[select] Improved select arguments with addition new argument --fuzzy_max_distance (experimental).
[select] Improved select arguments with addition new argument --fuzzy_max_expansions (experimental).
[select] Improved select arguments with addition new argument --fuzzy_max_distance_ratio (experimental).
[select] Improved select arguments with addition new argument --fuzzy_prefix_length (experimental).
[cast] Added support for casting "[0.0, 1.0, 1.1, ...]" to Float/Float32 vector.
[fuzzy_search] Rename max_expansion option to max_expansions option.

max_expansion option is deprecate since this release. However, we can use max_expansion in the feature to backward compatibility.
Rename master branch to main branch.
[RPM] Use CMake for building.
[Debian] Added support for Debian trixie.

Fixes#

[fuzzy_search] Fixed a bug that Groonga may get records that should not match.

[Near phrase search condition][Near phrase search operator] Fixed a bug that Groonga crashed when the first phrase group doesn’t match anything as below.

table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR Text
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenNgram \
  --normalizer NormalizerNFKC121
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION \
  Entries content
load --table Entries
[
{"content": "x y z"}
]
select Entries \
  --match_columns Terms.entries_content.content \
  --query '*NPP1"(NONEXISTENT) (z)"' \
  --output_columns '_score, content'

Release 13.0.7 - 2023-09-12#

Fixes#

[normalize] Fixed a bug that normalize command doesn’t output last offset and type.

normalize command can output offset and type of string after normalize as below, but normalize command doesn’t output the last offset and type by this bug.

table_create Normalizations TABLE_PAT_KEY ShortText
column_create Normalizations normalized COLUMN_SCALAR ShortText
load --table Normalizations
[
{"_key": "あ", "normalized": "<あ>"}
]

normalize   'NormalizerNFKC130("unify_kana", true, "report_source_offset", true),    NormalizerTable("normalized", "Normalizations.normalized",                    "report_source_offset", true)'   "お あ ａ ア ｉ ｱ オ"   REMOVE_BLANK|WITH_TYPES|WITH_CHECKS
[
  [
    0,
    0.0,
    0.0
  ],
  {
    "normalized": "お<あ>a<あ>i<あ>お",
    "types": [
      "hiragana",
      "symbol",
      "hiragana",
      "symbol",
      "alpha",
      "symbol",
      "hiragana",
      "symbol",
      "alpha",
      "symbol",
      "hiragana",
      "symbol",
      "hiragana"
    ],
    "checks": [
      3,
      0,
      0,
      4,
      -1,
      0,
      0,
      -1,
      4,
      4,
      -1,
      0,
      0,
      -1,
      4,
      4,
      -1,
      0,
      0,
      -1,
      4,
      0,
      0
    ],
    "offsets": [
      0,
      4,
      4,
      4,
      8,
      12,
      12,
      12,
      16,
      20,
      20,
      20,
      24
    ]
  }
]

[Normalizers] Fixed a bug that the last offset value may be invalid when we use multiple normalizers.

For the following example, the last offset value is 27 correctly, but it is 17 in the following example by this bug.

table_create Normalizations TABLE_PAT_KEY ShortText
column_create Normalizations normalized COLUMN_SCALAR ShortText
load --table Normalizations
[
{"_key": "あ", "normalized": "<あ>"}
]

normalize   'NormalizerNFKC130("unify_kana", true, "report_source_offset", true),    NormalizerTable("normalized", "Normalizations.normalized",                    "report_source_offset", true)'   "お あ ａ ア ｉ ｱ オ"   REMOVE_BLANK|WITH_TYPES|WITH_CHECKS
[
  [
    0,
    0.0,
    0.0
  ],
  {
    "normalized": "お<あ>a<あ>i<あ>お",
    "types": [
      "hiragana",
      "symbol",
      "hiragana",
      "symbol",
      "alpha",
      "symbol",
      "hiragana",
      "symbol",
      "alpha",
      "symbol",
      "hiragana",
      "symbol",
      "hiragana",
      "null"
    ],
    "checks": [
      3,
      0,
      0,
      4,
      -1,
      0,
      0,
      -1,
      4,
      4,
      -1,
      0,
      0,
      -1,
      4,
      4,
      -1,
      0,
      0,
      -1,
      4,
      0,
      0
    ],
    "offsets": [
      0,
      4,
      4,
      4,
      8,
      12,
      12,
      12,
      16,
      20,
      20,
      20,
      24,
      17
    ]
  }
]

Release 13.0.6 - 2023-08-31#

Improvements#

[highlight_html] Don’t report error when we specify empty string into highlight_html() as below.

highlight_html() just returns an empty text.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("report_source_location", true)' \
  --normalizer 'NormalizerNFKC150'
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body

load --table Entries
[
{"body": "ab cd ed gh"}
]

select Entries \
  --match_columns body \
  --query 'ab' \
  --output_columns 'highlight_html("", Terms)'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight_html",null
        ]
      ],
      [
        ""
      ]
    ]
  ]
]

Added support aggregator_* for dynamic columns and pseudo columns.

Pseudo column is a column with _ prefix.(e.g. _id, _nsubrecs, …).

Fixes#

[CMake] Fixed a build error with CMake when both of msgpack and msgpackc-cxx are installed.

Please refer the comment of groonga/groonga#1601 about details.

Fixed a parse bug when we use x OR <0.0y with QUERY_NO_SYNTAX_ERROR.

Records that should match may be not matched.

For example, if we execute the following query, {"_key": "name yyy"} should match but {"_key": "name yyy"} is not match.

table_create Names TABLE_PAT_KEY ShortText
table_create Tokens TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto
column_create Tokens names_key COLUMN_INDEX|WITH_POSITION Names _key

load --table Names
[
{"_key": "name yyy"}
]

select Names \
  --match_columns "_key" \
  --query "xxx OR <0.0yyy" \
  --query_flags ALLOW_PRAGMA|ALLOW_COLUMN|QUERY_NO_SYNTAX_ERROR
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        0
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ]
      ]
    ]
  ]
]

[highlight_html] Fixed a bug that highlight position may be incorrect.

For example, this bug occures when we specify as highlight target both of keyword with the number of characters is one and keyword with the number of characters is two.

Release 13.0.5 - 2023-08-02#

Fixes#

Fixed a bug that index creation may be failed.

Groonga v13.0.2, v13.0.3, and v13.0.4 have this bug. Therefore, If you already have used the above version, we highly recommend that you use Groonga v13.0.5 or later.

[Near phrase search condition][Near phrase search operator] Fixed a bug that Groonga may crash when we specify invalid syntax query.

For example, Groonga is supported to occures error in the following case. However if Groonga has this bug, Groonga is crashed in the following case.
(We want to notice that one too many a close parenthesis in the value of “–query” in the following case.)

table_create Entries TABLE_NO_KEY
[[0,0.0,0.0],true]
column_create Entries content COLUMN_SCALAR Text
[[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram   --normalizer NormalizerNFKC121
[[0,0.0,0.0],true]
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
[[0,0.0,0.0],true]
load --table Entries
[
{"content": "a b c"}
]
[[0,0.0,0.0],1]
select Entries   --match_columns content   --query '*NPP2"(a b))"'   --output_columns '_score, content'

Release 13.0.4 - 2023-07-26#

Improvements#

[Windows] Stopped providing 32-bit packages.

Fixes#

[Debian GNU/Linux] [Ubuntu] Fixed the default configuration file path for QueryExpanderTSV.
[CMake] Fixed a bug that some errors may be reported when CMake 3.16 or 3.17 are used.

Release 13.0.3 - 2023-07-24#

Improvements#

[groonga-httpd] Extracted to groonga-nginx. We stopped providing the groonga-httpd package.

If you’re an user of Debian GNU/Linux 12+ or Ubuntu 23.10+, you can use the libnginx-mod-http-groonga package with the default nginx package. See groonga-nginx’s README for details.

If you’re an user of old Debian/Ubuntu or RHEL related distributions, you can’t use any groonga-httpd equivalent package. You can use Groonga HTTP server instead. If Groonga HTTP server isn’t suitable for your use case, please report it to Discussions with your use case.
[Ubuntu] Added support for Ubuntu 23.10 (Mantic Minotaur).
[Debian GNU/Linux] Enabled xxHash support.
[Ubuntu] Enabled xxHash support.

Fixes#

Fixed a bug that the source archive can’t be built with CMake.

Release 13.0.2 - 2023-07-12#

Improvements#

[Ubuntu] Dropped support for Ubuntu 18.04 (Bionic Beaver).
[Ubuntu] Added support for Ubuntu 23.04 (Lunar Lobster).
[Debian GNU/Linux] Added support for Debian GNU/Linux 12 (bookworm).
[Oracle Linux] Dropped support for Oracle Linux. Use AlmaLinux packages instead.
[grn_highlighter] Added support for changing tag.

GH-1453 [Reported by askdkc]
[grn_highlighter] Added support for customizing normalizers.
[grn_highlighter] Added support for customizing HTML mode.
[grn_highlighter] Added support for multiple tags.
[highlight] Added the sequential_class_tag_mode option.
[highlight_html] Added the sequential_class_tag_mode option.
[reference_acquire] Changed to refer index columns that are for the target object when --recursive dependent is used.

If the target object is a column, index columns for the column is also referred.

If the target object is a table, index columns for the table is also referred.

If the target object is a DB, all tables are processed as target objects.
[CMake] Changed to require CMake 3.16 or later. We’ll use CMake instead of GNU Autotools as our recommended build tool in near future and drop support for GNU Autotools eventually.
[CMake] Added support for CMake package. You can use it by find_package(Groonga).
[Packaging] Changed to use newer GNU Autotools to generate configure in the source archive.

[ranguba/rroonga#220 <https://github.com/ranguba/rroonga/issues/220>_] [Reported by ZangRuochen]
[reference_acquire] Optimized reference count implementation for built-in objects.
Added support for logging backtrace on SIGABRT.

Fixes#

[Ordered near phrase product search condition] [Ordered near phrase product search operator] Fixed a search bug that records that should be matched may not be matched.

It’s happen when multiple 3+ tokens are overlapped in query. For example, abc and abcd are an invalid combination. If shorter one (abc) exists before longer one (abcd), this bug is happen. For example, ONPP1 "(abcd abc) (1 2)" works but ONPP1 "(abc abcd) (1 2)" doesn’t work.
Fixed a bug that invalid weight may be used with multiple adjusts.
[Query syntax] Fixed a typo.

[GH-1560 <https://github.com/groonga/groonga/issues/1560>_] [Patch by Dylan Golow]
[Near phrase search condition] [Near phrase search operator] [Near phrase product search condition] [Near phrase product search operator] [Ordered near phrase search condition] [Ordered near phrase search operator] [Ordered near phrase product search condition] [Ordered near phrase product search operator] Fixed an invalid interval calculation when additional_last_interval is used.

For example, let’s think about *NP3,-1"aaa bbb .$" against aaaxxxbbbcdefghi. In this case, the number of tokens between aaa and bbb must be 3 but it was 7.
[Near phrase product search condition] [Near phrase product search operator] [Ordered near phrase product search condition] [Ordered near phrase product search operator] Fixed infinite loop bugs when the same phrase exists in the same phrase group.

For example, *NPP1 "(abcd abc abcd bcde) (efghi)" is a bad query because the first phrase group has two abcd phrases.

For example, *NPP1 "(abcde \"abc de\") (efghi)" is a bad query because the first phrase group has abcde and "abc de". They are “logically” the same phrases.
Fixed a bug that internal lock count may not be decreased when lock acquisition is failed. In normal use-case, this will not be a real problem.

Thanks#

askdkc
Dylan Golow
ZangRuochen

Release 13.0.1 - 2023-03-24#

Improvements#

[highlight_html] Added support for prefix search.

We can now use prefix search in highlight_html.

Note that highlight keyword is also highlighted not only at the first but also in the middle or at the end.

table_create Tags TABLE_NO_KEY
column_create Tags name COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --normalizer 'NormalizerNFKC150'
column_create Terms tags_name COLUMN_INDEX Tags name

load --table Tags
[
{"name": "Groonga"}
]

select Tags \
  --query "name:^g" \
  --output_columns "highlight_html(name)"
# [
#   [
#     0,
#     0.0,
#     0.0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "highlight_html",
#           null
#         ]
#       ],
#       [
#         "<span class=\"keyword\">G</span>roon<span class=\"keyword\">g</span>a"
#       ]
#     ]
#   ]
# ]

[Normalizers] Added new options for NormalizerNFKC*.

unify_kana_prolonged_sound_mark

We can now normalize prolonged_sound_mark with this option as below.

ァー -> ァア, アー -> アア, ヵー -> ヵア, カー -> カア, ガー -> ガア, サー -> サア, ザー -> ザア,
ター -> タア, ダー -> ダア, ナー -> ナア, ハー -> ハア, バー -> バア, パー -> パア, マー -> マア,
ャー -> ャア, ヤー -> ヤア, ラー -> ラア, ヮー -> ヮア, ワー -> ワア, ヷー -> ヷア,

ィー -> ィイ, イー -> イイ, キー -> キイ, ギー -> ギイ, シー -> シイ, ジー -> ジイ, チー -> チイ,
ヂー -> ヂイ, ニー -> ニイ, ヒー -> ヒイ, ビー -> ビイ, ピー -> ピイ, ミー -> ミイ, リー -> リイ,
ヰー -> ヰイ, ヸー -> ヸイ,

ゥー -> ゥウ, ウー -> ウウ, クー -> クウ, グー -> グウ, スー -> スウ, ズー -> ズウ, ツー -> ツウ,
ヅー -> ヅウ, ヌー -> ヌウ, フー -> フウ, ブー -> ブウ, プー -> プウ, ムー -> ムウ, ュー -> ュウ,
ユー -> ユウ, ルー -> ルウ, ヱー -> ヱウ, ヴー -> ヴウ,

ェー -> ェエ, エー -> エエ, ヶー -> ヶエ, ケー -> ケエ, ゲー -> ゲエ, セー -> セエ, ゼー -> ゼエ,
テー -> テエ, デー -> デエ, ネー -> ネエ, ヘー -> ヘエ, ベー -> ベエ, ペー -> ペエ, メー -> メエ,
レー -> レエ, ヹー -> ヹエ,

ォー -> ォオ, オー -> オオ, コー -> コオ, ゴー -> ゴオ, ソー -> ソオ, ゾー -> ゾオ, トー -> トオ,
ドー -> ドオ, ノー -> ノオ, ホー -> ホオ, ボー -> ボオ, ポー -> ポオ, モー -> モオ, ョー -> ョオ,
ヨー -> ヨオ, ロー -> ロオ, ヲー -> ヲオ, ヺー -> ヺオ,

ンー -> ンン

ぁー -> ぁあ, あー -> ああ, ゕー -> ゕあ, かー -> かあ, がー -> があ, さー -> さあ, ざー -> ざあ,
たー -> たあ, だー -> だあ, なー -> なあ, はー -> はあ, ばー -> ばあ, ぱー -> ぱあ, まー -> まあ,
ゃー -> ゃあ, やー -> やあ, らー -> らあ, ゎー -> ゎあ, わー -> わあ

ぃー -> ぃい, いー -> いい, きー -> きい, ぎー -> ぎい, しー -> しい, じー -> じい, ちー -> ちい,
ぢー -> ぢい, にー -> にい, ひー -> ひい, びー -> びい, ぴー -> ぴい, みー -> みい, りー -> りい,
ゐー -> ゐい

ぅー -> ぅう, うー -> うう, くー -> くう, ぐー -> ぐう, すー -> すう, ずー -> ずう, つー -> つう,
づー -> づう, ぬー -> ぬう, ふー -> ふう, ぶー -> ぶう, ぷー -> ぷう, むー -> むう, ゅー -> ゅう,
ゆー -> ゆう, るー -> るう, ゑー -> ゑう, ゔー -> ゔう

ぇー -> ぇえ, えー -> ええ, ゖー -> ゖえ, けー -> けえ, げー -> げえ, せー -> せえ, ぜー -> ぜえ,
てー -> てえ, でー -> でえ, ねー -> ねえ, へー -> へえ, べー -> べえ, ぺー -> ぺえ, めー -> めえ,
れー -> れえ

ぉー -> ぉお, おー -> おお, こー -> こお, ごー -> ごお, そー -> そお, ぞー -> ぞお, とー -> とお,
どー -> どお, のー -> のお, ほー -> ほお, ぼー -> ぼお, ぽー -> ぽお, もー -> もお, ょー -> ょお,
よー -> よお, ろー -> ろお, をー -> をお

んー -> んん

Here is an example of unify_kana_prolonged_sound_mark.

table_create --name Animals --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Animals --name name --type ShortText
column_create --table Animals --name sound --type ShortText
load --table Animals
[
{"_key":"1","name":"羊", "sound":"メーメー"},
]

table_create \
  --name idx_animals_sound \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_kana_prolonged_sound_mark", true)'
column_create --table idx_animals_sound --name animals_sound --flags COLUMN_INDEX|WITH_POSITION --type Animals --source sound

select --table Animals --query sound:@メエメエ
# [
#   [
#     0,
#     1677829950.652696,
#     0.01971983909606934
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ],
#         [
#           "sound",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "1",
#         "羊",
#         "メーメー"
#       ]
#     ]
#   ]
# ]

unify_kana_hyphen

We can now normalize hyphen with this option as below.

ァ- -> ァア, ア- -> アア, ヵ- -> ヵア, カ- -> カア, ガ- -> ガア, サ- -> サア, ザ- -> ザア,
タ- -> タア, ダ- -> ダア, ナ- -> ナア, ハ- -> ハア, バ- -> バア, パ- -> パア, マ- -> マア,
ャ- -> ャア, ヤ- -> ヤア, ラ- -> ラア, ヮ- -> ヮア, ワ- -> ワア, ヷ- -> ヷア,

ィ- -> ィイ, イ- -> イイ, キ- -> キイ, ギ- -> ギイ, シ- -> シイ, ジ- -> ジイ, チ- -> チイ,
ヂ- -> ヂイ, ニ- -> ニイ, ヒ- -> ヒイ, ビ- -> ビイ, ピ- -> ピイ, ミ- -> ミイ, リ- -> リイ,
ヰ- -> ヰイ, ヸ- -> ヸイ,

ゥ- -> ゥウ, ウ- -> ウウ, ク- -> クウ, グ- -> グウ, ス- -> スウ, ズ- -> ズウ, ツ- -> ツウ,
ヅ- -> ヅウ, ヌ- -> ヌウ, フ- -> フウ, ブ- -> ブウ, プ- -> プウ, ム- -> ムウ, ュ- -> ュウ,
ユ- -> ユウ, ル- -> ルウ, ヱ- -> ヱウ, ヴ- -> ヴウ,

ェ- -> ェエ, エ- -> エエ, ヶ- -> ヶエ, ケ- -> ケエ, ゲ- -> ゲエ, セ- -> セエ, ゼ- -> ゼエ,
テ- -> テエ, デ- -> デエ, ネ- -> ネエ, ヘ- -> ヘエ, ベ- -> ベエ, ペ- -> ペエ, メ- -> メエ,
レ- -> レエ, ヹ- -> ヹエ,

ォ- -> ォオ, オ- -> オオ, コ- -> コオ, ゴ- -> ゴオ, ソ- -> ソオ, ゾ- -> ゾオ, ト- -> トオ,
ド- -> ドオ, ノ- -> ノオ, ホ- -> ホオ, ボ- -> ボオ, ポ- -> ポオ, モ- -> モオ, ョ- -> ョオ,
ヨ- -> ヨオ, ロ- -> ロオ, ヲ- -> ヲオ, ヺ- -> ヺオ,

ン- -> ンン

ぁ- -> ぁあ, あ- -> ああ, ゕ- -> ゕあ, か- -> かあ, が- -> があ, さ- -> さあ, ざ- -> ざあ,
た- -> たあ, だ- -> だあ, な- -> なあ, は- -> はあ, ば- -> ばあ, ぱ- -> ぱあ, ま- -> まあ,
ゃ- -> ゃあ, や- -> やあ, ら- -> らあ, ゎ- -> ゎあ, わ- -> わあ

ぃ- -> ぃい, い- -> いい, き- -> きい, ぎ- -> ぎい, し- -> しい, じ- -> じい, ち- -> ちい,
ぢ- -> ぢい, に- -> にい, ひ- -> ひい, び- -> びい, ぴ- -> ぴい, み- -> みい, り- -> りい,
ゐ- -> ゐい

ぅ- -> ぅう, う- -> うう, く- -> くう, ぐ- -> ぐう, す- -> すう, ず- -> ずう, つ- -> つう,
づ- -> づう, ぬ- -> ぬう, ふ- -> ふう, ぶ- -> ぶう, ぷ- -> ぷう, む- -> むう, ゅ- -> ゅう,
ゆ- -> ゆう, る- -> るう, ゑ- -> ゑう, ゔ- -> ゔう

ぇ- -> ぇえ, え- -> ええ, ゖ- -> ゖえ, け- -> けえ, げ- -> げえ, せ- -> せえ, ぜ- -> ぜえ,
て- -> てえ, で- -> でえ, ね- -> ねえ, へ- -> へえ, べ- -> べえ, ぺ- -> ぺえ, め- -> めえ,
れ- -> れえ

ぉ- -> ぉお, お- -> おお, こ- -> こお, ご- -> ごお, そ- -> そお, ぞ- -> ぞお, と- -> とお,
ど- -> どお, の- -> のお, ほ- -> ほお, ぼ- -> ぼお, ぽ- -> ぽお, も- -> もお, ょ- -> ょお,
よ- -> よお, ろ- -> ろお, を- -> をお

ん- -> んん

Here is an example of unify_kana_hyphen.

table_create --name Animals --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Animals --name name --type ShortText
column_create --table Animals --name sound --type ShortText
load --table Animals
[
{"_key":"1","name":"羊", "sound":"メ-メ-"},
]

table_create \
  --name idx_animals_sound \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_kana_hyphen", true)'
column_create --table idx_animals_sound --name animals_sound --flags COLUMN_INDEX|WITH_POSITION --type Animals --source sound

select --table Animals --query sound:@メエメエ
# [
#   [
#     0,1677829950.652696,
#     0.01971983909606934
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ],
#         [
#           "sound",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "1",
#         "羊",
#         "メ-メ-"
#       ]
#     ]
#   ]
# ]

[Near search condition][Near search operator] Added a new option ${MIN_INTERVAL} for near-search family.

We can now specify the minimum interval between phrases (words) with ${MIN_INTERVAL}. The interval between phrases (words) must be at least this value.

Here are new syntax:

*N${MAX_INTERVAL},${MAX_TOKEN_INTERVAL_1}|${MAX_TOKEN_INTERVAL_2}|...,${MIN_INTERVAL} "word1 word2 ..."
*NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "phrase1 phrase2 ..."
*NPP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "(phrase1-1 phrase1-2 ...) (phrase2-1 phrase2-2 ...) ..."
*ONP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "phrase1 phrase2 ..."
*ONPP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL},${MAX_PHRASE_INTERVAL_1}|${MAX_PHRASE_INTERVAL_2}|...,${MIN_INTERVAL} "(phrase1-1 phrase1-2 ...) (phrase2-1 phrase2-2 ...) ..."

The default value of ${MIN_INTERVAL} is INT32_MIN (-2147483648). We use the default value when ${MIN_INTERVAL} is omitted.

This option is useful when we want to ignore overlapped phrases.

The interval for *NP is culculated as interval between the top tokens of phrases - tokens in the left phrase + 1.

When a tokenizer is Bigram, for exmaple, 東京 has one token 東京, also 京都 has one token 京都.

Considering 東京都 as a target value of *NP "東京京都":

interval between the top tokens of phrases: 1 (interval between 東京 and 京都)
tokens in the left phrase: 1 ( 東京 )

The interval for *NP of 東京都 is 1 - 1 + 1 = 1.

As a result, the interval for *NP is greater than 1 when 東京 and 京都 are not overlapped.

Here is an example for ignoring overlapped phrases.

table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR Text

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("unify_alphabet", false, \
                                  "unify_digit", false)' \
  --normalizer NormalizerNFKC150
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content

load --table Entries
[
{"content": "東京都"},
{"content": "東京京都"}
]

select Entries \
  --match_columns content \
  --query '*NP-1,0,,2"東京 京都"' \
  --output_columns '_score, content'
# [
#   [
#     0,
#     0.0,
#     0.0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_score",
#           "Int32"
#         ],
#         [
#           "content",
#           "Text"
#         ]
#       ],
#       [
#         1,
#         "東京京都"
#       ]
#     ]
#   ]
# ]

In the example above, 東京都 is not matched as the interval is 1 but 東京京都 is matched as the interval is 2.

[Normalizers] Added support for new values in the unify_katakana_trailing_o option.

We added support for normalizing the following new values in the unify_katakana_trailing_o option because a vowel of those left letters is O.
- ォオ -> ォウ
- ョオ -> ョウ
- ヺオ -> ヺウ
Add support for MessagePack v6.0.0. [GitHub#1536][Reported by Carlo Cabrera]

Groonga can not found MessagePack v6.0.0 or later when we execute configure or cmake until now. Groonga can found MessagePack since this release even if the version of MessagePack is v6.0.0 or later.

Fixes#

[Normalizers] Fixed a bug that NormalizerNFKC* did incorrect normalization.

This bug occured when unify_kana_case and unify_katakana_v_sounds used at the same time.

For example, ヴァ was normalized to バア with unify_kana_case and unify_katakana_v_sounds, but ヴァ should be normalized to バ.

This was because ヴァ was normalized to ヴア with unify_kana_case, and after that, ヴア was normalized to バア with unify_katakana_v_sounds. We fixed to normalize characters with unify_katakana_v_sounds before unify_kana_case.

Here is an example of the bug in the previous version.
```
normalize \
  'NormalizerNFKC150("unify_katakana_v_sounds", true, \
                     "unify_kana_case", true)' \
  "ヴァーチャル"
#[
#  [
#    0,
#    1678097412.913053,
#    0.00019073486328125
#  ],
#  {
#    "normalized":"ブアーチヤル",
#    "types":[],
#    "checks":[]
#  }
#]
```
From this version, ヴァーチャル is normalized to バーチヤル.
[Ordered near phrase search condition][Ordered near phrase search operator] Fixed a bug that ${MAX_PHRASE_INTERVALS} doesn’t work correctly.

When this bug occured, intervals are regarded as 0. Therefore, if this bug occurs, records may hit too many.

This bug occured when:
1. Use *ONP with ${MAX_PHRASE_INTERVALS}
2. The number of tokens in the matched left phrase is greater than or equal to the number of ${MAX_PHRASE_INTERVALS} elements.
Here is an example of this bug.
```
table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR Text
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("unify_alphabet", false, \
                                  "unify_digit", false)' \
  --normalizer NormalizerNFKC150
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
load --table Entries
[
{"content": "abcXYZdef"},
{"content": "abcdef"},
{"content": "abc123456789def"},
{"content": "abc12345678def"},
{"content": "abc1de2def"}
]
select Entries --filter 'content *ONP-1,0,1 "abc def"' --output_columns '_score, content'
#[
#  [
#    0,
#    0.0,
#    0.0
#  ],
#  [
#    [
#      [
#        5
#      ],
#      [
#        [
#          "_score",
#          "Int32"
#        ],
#        [
#          "content",
#          "Text"
#        ]
#      ],
#      [
#        1,
#        "abcXYZdef"
#      ],
#      [
#        1,
#        "abcdef"
#      ],
#      [
#        1,
#        "abc123456789def"
#      ],
#      [
#        1,
#        "abc12345678def"
#      ],
#      [
#        1,
#        "abc1de2def"
#      ]
#    ]
#  ]
#]
```
In the example above, the first element interval is specified as 1 with *ONP-1,0,1 "abc def", but all content is matched, including those farther than 1.

This is because the example satisfies the condition for the bug and the interval is regarded as 0.
1. Use *ONP with ${MAX_PHRASE_INTERVALS}
  
  *ONP-1,0,1 "abc def" specifies ${MAX_PHRASE_INTERVALS}.
2. The number of tokens in the matched left phrase is greater than or equal to the number of ${MAX_PHRASE_INTERVALS} elements.
  - The matched left phrase: abc
    - Included tokens: ab, bc
    - Tokenized with TokenNgram("unify_alphabet", false, "unify_digit", false)
  - The number of elements specified with max_element_intervals: 1
    - 1 of *ONP-1,0,1 "abc def"
  - The number of tokens in the left phrase (2) > the number of elements specified with max_element_intervals (1)
[Near phrase search condition][Near phrase search operator] Fixed a bug that phrases specified as last didn’t used as last in near-phrase family.

When this bug occured, ${ADDITIONAL_LAST_INTERVAL} was ignored and only ${MAX_INTERVAL} was used.

This bug occured when:
1. A phrase specified as last contains multiple tokens.
2. The size of the last token of the phrase is smaller than or equals to the sizes of other tokens in the phrase.
  - The token size is the number of tokens appeared in the all records.
Here is an example of this bug.
```
table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR Text

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("unify_alphabet", false, \
                                  "unify_digit", false)' \
  --normalizer NormalizerNFKC150
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content

load --table Entries
[
{"content": "abc123456789defg"},
{"content": "dededede"}
]

select Entries \
  --filter 'content *NP10,1"abc defg$"' \
  --output_columns '_score, content'
#[
#  [
#    0,
#    0.0,
#    0.0
#  ],
#  [
#    [
#      [
#        0
#      ],
#      [
#        [
#          "_score",
#          "Int32"
#        ],
#        [
#          "content",
#          "Text"
#        ]
#      ]
#    ]
#  ]
#]
```
In the example above, for abc123456789defg, the interval abc to defg is 11. ${MAX_INTERVAL} is 10 and ${ADDITIONAL_LAST_INTERVAL} is 1, so a threshold for matching last phrase is 11. So it should be matched, but isn’t.

This is because the example satisfies the condition for the bug as below, and only ``${MAX_INTERVAL}` is used.
1. A phrase specified as last contains multiple tokens.
  
  defg$ is specified as last because the suffix is $.
  
  defg$ is tokenized to de, ef, fg with TokenNgram("unify_alphabet", false, "unify_digit", false)
2. The size of the last token of the phrase is smaller than or equals to the sizes of other tokens in the phrase.
  
  fg is the last token of defg$. abc123456789defg contains one fg and de, and dededede contains 4 de.
  
  So, the size of fg is 1 and de is 5.
[Near phrase search condition] Fixed interval calculation.

If we use near phrase search, records may hit too many by this bug.
[highlight_html] Fixed a bug that highlight position may move over when we use loose_symbol=true.

Thanks#

Carlo Cabrera

Release 13.0.0 - 2023-02-09#

This is a major version up! But It keeps backward compatibility. We can upgrade to 13.0.0 without rebuilding database.

First of all, we introduce the main changes in 13.0.0. Then, we introduce the hilight and summary of changes from Groonga 12.0.0 to 12.1.2.

Improvements#

[Normalizers] Added a new Normalizer NormalizerNFKC150 based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 15.0.
[Token filters] Added a new TokenFilter TokenFilterNFKC150 based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 15.0.

[NormalizerNFKC150] Added new options for NormalizerNFKC* as below.

unify_katakana_gu_small_sounds

We can normalize “グァ -> ガ”, “グィ -> ギ”, “グェ -> ゲ”, and “グォ -> ゴ” with this option.

Here is an example of unify_katakana_gu_small_sounds option.

table_create --name Countries --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Countries --name name --type ShortText
load --table Countries
[
{"_key":"JP","name":"日本"},
{"_key":"GT","name":"グァテマラ共和国"},
]

table_create \
  --name idx_contry_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_gu_small_sounds", true)'
column_create --table idx_contry_name --name contry_name --flags COLUMN_INDEX|WITH_POSITION --type Countries --source name

select --table Countries --query name:@ガテマラ共和国
# [
#   [0,
#    0,
#    0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ]
#       ],
#       [
#         2,
#         "GT",
#         "グァテマラ共和国"
#       ]
#     ]
#   ]
# ]

unify_katakana_di_sound

We can normalize “ヂ -> ジ” with this option.

Here is an example of unify_katakana_di_sound option.

table_create --name Foods --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Foods --name name --type ShortText
load --table Foods
[
{"_key":"1","name":"チジミ"},
{"_key":"2","name":"パジョン"},
]

table_create \
  --name idx_food_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_di_sound", true)'
column_create --table idx_food_name --name food_name --flags COLUMN_INDEX|WITH_POSITION --type Foods --source name

select --table Foods --query name:@チヂミ
# [
#   [
#     0,
#     0,
#     0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "1",
#         "チジミ"
#       ]
#     ]
#   ]
# ]

unify_katakana_wo_sound

We can normalize “ヲ -> オ” with this option.

Here is an example of unify_katakana_wo_sound option.

table_create --name Foods --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Foods --name name --type ShortText
load --table Foods
[
{"_key":"1","name":"アヲハタ"},
{"_key":"2","name":"ヴェルデ"},
{"_key":"3","name":"ランプ"},
]

table_create \
  --name idx_food_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_wo_sound", true)'
column_create --table idx_food_name --name food_name --flags COLUMN_INDEX|WITH_POSITION --type Foods --source name

select --table Foods --query name:@アオハタ
# [
#   [
#     0,
#     0,
#     0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "1",
#         "アヲハタ"
#       ]
#     ]
#   ]
# ]

unify_katakana_zu_small_sounds

We can normalize “ズァ -> ザ”, “ズィ -> ジ”, “ズェ -> ゼ”, and “ズォ -> ゾ” with this option.

Here is an example of unify_katakana_zu_small_sounds option.

table_create --name Cities --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Cities --name name --type ShortText
load --table Cities
[
{"_key":"1","name":"ガージヤーバード"},
{"_key":"2","name":"デリー"},
]

table_create \
  --name idx_city_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_zu_small_sounds", true)'
column_create --table idx_city_name --name city_name --flags COLUMN_INDEX|WITH_POSITION --type Cities --source name

select --table Cities --query name:@ガーズィヤーバード
# [
#   [
#     0,
#     0,
#     0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "1",
#         "ガージヤーバード"
#       ]
#     ]
#   ]
# ]

unify_katakana_du_sound

We can normalize “ヅ -> ズ” with this option.

Here is an example of unify_katakana_du_sound option.

table_create --name Plants --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Plants --name name --type ShortText
load --table Plants
[
{"_key":"1","name":"ハスノカヅラ"},
{"_key":"2","name":"オオツヅラフジ"},
{"_key":"3","name":"アオツヅラフジ"},
]

table_create \
  --name idx_plant_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_du_sound", true)'
column_create --table idx_plant_name --name plant_name --flags COLUMN_INDEX|WITH_POSITION --type Plants --source name

select --table Plants --query name:@ツズラ
# [
#   [
#     0,
#     0,
#     0
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ]
#       ],
#       [
#         2,
#         "2",
#         "オオツヅラフジ"
#       ],
#       [
#         3,
#         "3",
#         "アオツヅラフジ"
#       ]
#     ]
#   ]
# ]

unify_katakana_trailing_o

We can normalize following characters with this option.

“オオ -> オウ”
“コオ -> コウ”
“ソオ -> ソウ”
“トオ -> トウ”
“ノオ -> ノウ”
“ホオ -> ホウ”
“モオ -> モウ”
“ヨオ -> ヨウ”
“ロオ -> ロウ”
“ゴオ -> ゴウ”
“ゾオ -> ゾウ”
“ドオ -> ドウ”
“ボオ -> ボウ”
“ポオ -> ポウ”

Here is an example of unify_katakana_trailing_o option.

table_create --name Sharks --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Sharks --name name --type ShortText
load --table Sharks
[
{"_key":"1","name":"ホオジロザメ"},
{"_key":"2","name":"ジンベイザメ"},
]

table_create \
  --name idx_shark_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_trailing_o", true)'
column_create --table idx_shark_name --name shark_name --flags COLUMN_INDEX|WITH_POSITION --type Sharks --source name

select --table Sharks --query name:@ホウジロザメ
# [
#   [
#     0,
#     0,
#     0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "_key",
#           "ShortText"
#         ],
#         [
#           "name",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "1",
#         "ホオジロザメ"
#       ]
#     ]
#   ]
# ]

unify_katakana_du_small_sounds

We can normalize “ヅァ -> ザ”, “ヅィ -> ジ”, “ヅェ -> ゼ”, and “ヅォ -> ゾ” with this option.

Here is an example of unify_katakana_du_small_sounds option.

table_create --name Airports --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Airports --name name --type ShortText
load --table Airports
[
{"_key":"HER","name":"イラクリオ・ニコスカザンヅァキス国際空港"},
{"_key":"ATH","name":"アテネ国際空港"},
]

table_create \
  --name idx_airport_name \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_katakana_du_small_sounds", true)'
column_create --table idx_airport_name --name airport_name --flags COLUMN_INDEX|WITH_POSITION --type Airports --source name

select --table Airports --query name:@ニコスカザンザキス
# [
#   [
#     [
#       1
#     ],
#     [
#       [
#         "_id",
#         "UInt32"
#       ],
#       [
#         "_key",
#         "ShortText"
#       ],
#       [
#         "name",
#         "ShortText"
#       ]
#     ],
#     [
#       1,
#       "HER",
#       "イラクリオ・ニコスカザンヅァキス国際空港"
#     ]
#   ]
# ]

[Oracle Linux] Added newly support for Oracle Linux 8 and 9.

Thanks#

Atsushi Shinoda
i10a
naoa
shinonon
Zhanzhao (Deo) Liang
David CARLIER

Higlight and Summary of changes from 12.0.0 to 12.1.2#

Higlight#

[Release 12.0.9 - 2022-10-28]

[Normalizers] Added NormalizerHTML. (Experimental)

NormalizerHTML is a normalizer for HTML.

Currently NormalizerHTML supports removing tags like  or  and expanding character references like & or &.

Here are sample queries for NormalizerHTML.
```
normalize NormalizerHTML " Groonga &amp; Mroonga &#38; Rroonga "
# [[0,1666923364.883798,0.0005481243133544922],{"normalized":" Groonga & Mroonga & Rroonga ","types":[],"checks":[]}]
```
In this sample  and  are removed, and & and & are expanded to &.

We can specify whether removing the tags with the remove_tag option. (The default value of the remove_tag option is true.)
```
normalize 'NormalizerHTML("remove_tag", false)' " Groonga &amp; Mroonga &#38; Rroonga "
# [[0,1666924069.278549,0.0001978874206542969],{"normalized":" Groonga & Mroonga & Rroonga ","types":[],"checks":[]}]
```
In this sample,  and  are not removed.

We can specify whether expanding the character references with the expand_character_reference option. (The default value of the expand_character_reference option is true.)
```
normalize 'NormalizerHTML("expand_character_reference", false)' " Groonga &amp; Mroonga &#38; Rroonga "
# [[0,1666924357.099782,0.0002346038818359375],{"normalized":" Groonga &amp; Mroonga &#38; Rroonga ","types":[],"checks":[]}]
```
In this sample, & and & are not expanded.

[Release 12.0.3 - 2022-04-29]

[snippet],[snippet_html] Added support for text vector as input. [groonga-dev,04956][Reported by shinonon]

For example, we can extract snippets of target text around search keywords against vector in JSON data as below.

table_create Entries TABLE_NO_KEY
column_create Entries title COLUMN_SCALAR ShortText
column_create Entries contents COLUMN_VECTOR ShortText

table_create Tokens TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram   --normalizer NormalizerNFKC130
column_create Tokens entries_title COLUMN_INDEX|WITH_POSITION Entries title
column_create Tokens entries_contents COLUMN_INDEX|WITH_SECTION|WITH_POSITION   Entries contents

load --table Entries
[
{
  "title": "Groonga and MySQL",
  "contents": [
    "Groonga is a full text search engine",
    "MySQL is a RDBMS",
    "Mroonga is a MySQL storage engine based on Groonga"
  ]
}
]

select Entries\
  --output_columns 'snippet_html(contents), contents'\
  --match_columns 'title'\
  --query Groonga
# [
#   [
#     0,
#     0.0,
#     0.0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "snippet_html",
#           null
#         ],
#         [
#           "contents",
#           "ShortText"
#         ]
#       ],
#       [
#         [
#           "<span class=\"keyword\">Groonga</span> is a full text search engine",
#           "Mroonga is a MySQL storage engine based on <span class=\"keyword\">Groonga</span>"
#         ],
#         [
#           "Groonga is a full text search engine",
#           "MySQL is a RDBMS",
#           "Mroonga is a MySQL storage engine based on Groonga"
#         ]
#       ]
#     ]
#   ]
# ]

Until now, if we specified snippet* like --output_columns 'snippet_html(contents[1]), we could extract snippets of target text around search keywords against the vector as below. However, we didn’t know which we should output elements. Because we didn’t know which element was hit on search.

select Entries\
  --output_columns 'snippet_html(contents[0]), contents'\
  --match_columns 'title'\
  --query Groonga
# [
#   [
#     0,
#     0.0,
#     0.0
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "snippet_html",
#           null
#         ],
#         [
#           "contents",
#           "ShortText"
#         ]
#       ],
#       [
#         [
#           "<span class=\"keyword\">Groonga</span> is a full text search engine"
#         ],
#         [
#           "Groonga is a full text search engine",
#           "MySQL is a RDBMS",
#           "Mroonga is a MySQL storage engine based on Groonga"
#         ]
#       ]
#     ]
#   ]
# ]

[Release 12.0.1 - 2022-02-28]

[query_expand] Added a support for synonym group.

Until now, We had to each defined a keyword and synonyms of the keyword as below when we use the synonym search.
```
table_create Thesaurus TABLE_PAT_KEY ShortText --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Thesaurus synonym COLUMN_VECTOR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Thesaurus
[
{"_key": "mroonga", "synonym": ["mroonga", "tritonn", "groonga mysql"]},
{"_key": "groonga", "synonym": ["groonga", "senna"]}
]
```
In the above case, if we search mroonga, Groonga search mroonga OR tritonn OR "groonga mysql" as we intended. However, if we search tritonn, Groonga search only tritonn. If we want to search tritonn OR mroonga OR "groonga mysql" even if we search tritonn, we need had added a definition as below.
```
load --table Thesaurus
[
{"_key": "tritonn", "synonym": ["tritonn", "mroonga", "groonga mysql"]},
]
```
In many cases, if we expand mroonga to mroonga OR tritonn OR "groonga mysql", we feel we want to expand tritonn and "groonga mysql" to mroonga OR tritonn OR "groonga mysql". However, until now, we had needed additional definitions in such a case. Therefore, if target keywords for synonyms are many, we are troublesome to define synonyms. Because we need to define many similar definitions.

In addition, when we remove synonyms, we are troublesome because we need to execute remove against many records.

We can make a group by deciding on a representative synonym record since this release. For example, the all following keywords are the “mroonga” group.
```
load --table Synonyms
[
  {"_key": "mroonga": "representative": "mroonga"}
]

load --table Synonyms
[
  {"_key": "tritonn": "representative": "mroonga"},
  {"_key": "groonga mysql": "representative": "mroonga"}
]
```
In this case, mroonga is expanded to mroonga OR tritonn OR "groonga mysql". In addition, tritonn and "groonga mysql" are also expanded to mroonga OR tritonn OR "groonga mysql".

When we want to remove synonyms, we execute just remove against a target record. For example, if we want to remove "groonga mysql" from synonyms, we just remove {"_key": "groonga mysql": "representative": "mroonga"}.

[Release 12.0.0 - 2022-02-09]

[index_column_have_source_record] Added a new function index_column_have_source_record().

We can confirm whether a token that is existing in the index is included in any of the records that are registered in Groonga or not.

Groonga does not remove a token even if the token become never used from records in Groonga by updating records. Therefore, for example, when we use the feature of autocomplete, Groonga may return a token that is not included in any of the records as candidates for search words. However, we can become that we don’t return the needless token by using this function.

Because this function can detect a token that is not included in any of the records.

[select] Added new arguments drilldown_max_n_target_records and drilldown[${LABEL}].max_n_target_records.

We can specify the max number of records of the drilldown target table (filtered result) to use drilldown. If the number of filtered result is larger than the specified value, some records in filtered result aren’t used for drilldown. The default value of this arguments are -1. If these arguments are set -1, Groonga uses all records for drilldown.

This argument is useful when filtered result may be very large. Because a drilldown against large filtered result may be slow. We can limit the max number of records to be used for drilldown by this feature.

Here is an example to limit the max number of records to be used for drilldown. The last 2 records, {\"_id\": 4, \"tag\": \"Senna\"} and {\"_id\": 5, \"tag\": \"Senna\"}, aren’t used.

 table_create Entries TABLE_HASH_KEY ShortText
 column_create Entries content COLUMN_SCALAR Text
 column_create Entries n_likes COLUMN_SCALAR UInt32
 column_create Entries tag COLUMN_SCALAR ShortText

 table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer TokenBigram --normalizer NormalizerAuto
 column_create Terms entries_key_index COLUMN_INDEX|WITH_POSITION Entries _key
 column_create Terms entries_content_index COLUMN_INDEX|WITH_POSITION Entries content
 load --table Entries
 [
 {"_key":    "The first post!",
  "content": "Welcome! This is my first post!",
  "n_likes": 5,
  "tag": "Hello"},
 {"_key":    "Groonga",
  "content": "I started to use Groonga. It's very fast!",
  "n_likes": 10,
  "tag": "Groonga"},
 {"_key":    "Mroonga",
  "content": "I also started to use Mroonga. It's also very fast! Really fast!",
  "n_likes": 15,
  "tag": "Groonga"},
 {"_key":    "Good-bye Senna",
  "content": "I migrated all Senna system!",
  "n_likes": 3,
  "tag": "Senna"},
 {"_key":    "Good-bye Tritonn",
  "content": "I also migrated all Tritonn system!",
  "n_likes": 3,
  "tag": "Senna"}
 ]

 select Entries \
   --limit -1 \
   --output_columns _id,tag \
   --drilldown tag \
   --drilldown_max_n_target_records 3
 # [
 #   [
 #     0,
 #     1337566253.89858,
 #     0.000355720520019531
 #   ],
 #   [
 #     [
 #       [
 #         5
 #       ],
 #       [
 #         [
 #           "_id",
 #           "UInt32"
 #         ],
 #         [
 #           "tag",
 #           "ShortText"
 #         ]
 #       ],
 #       [
 #         1,
 #         "Hello"
 #       ],
 #       [
 #         2,
 #         "Groonga"
 #       ],
 #       [
 #         3,
 #         "Groonga"
 #       ],
 #       [
 #         4,
 #         "Senna"
 #       ],
 #       [
 #         5,
 #         "Senna"
 #       ]
 #     ],
 #     [
 #       [
 #         2
 #       ],
 #       [
 #         [
 #           "_key",
 #           "ShortText"
 #         ],
 #         [
 #           "_nsubrecs",
 #           "Int32"
 #         ]
 #       ],
 #       [
 #         "Hello",
 #         1
 #       ],
 #       [
 #         "Groonga",
 #         2
 #       ]
 #     ]
 #   ]
 # ]

Summary#

Improvements#

[Release 12.1.2 - 2023-01-29]

[httpd] Updated bundled nginx to 1.23.3.

[Release 12.1.1 - 2023-01-06]

[select][POWER_SET] Vector’s power set is now able to aggregate with the drilldowns.
[select] Specific element of vector column is now able to be search target.
[load] Added support for YYYY-MM-DD time format.

[Release 12.1.0 - 2022-11-29]

[load] Added support for slow log output of load.
[API] Added new API grn_is_reference_count_enable().
[status] Added new items: back_trace and /reference_count.

[Release 12.0.9 - 2022-10-28]

[AlmaLinux] Added support for AlmaLinux 9.
[escalate] Added a document for the escalate() function.
[Normalizers] Added NormalizerHTML. (Experimental)
[httpd] Updated bundled nginx to 1.23.2.
Suppressed logging a lot of same messages when no memory is available.

[Release 12.0.8 - 2022-10-03]

Changed specification of the escalate() function (Experimental) to make it easier to use.
[Others: Build with CMake] Added a document about how to build Groonga with CMake.
[Others] Added descriptions about how to enable/disable Apache Arrow support when building with GNU Autotools.
[select] Added a document about drilldowns[${LABEL}].table.
[I18N] Updated the translation procedure.

[Release 12.0.7 - 2022-08-29]

Added a new function escalate(). (experimental)
[httpd] Updated bundled nginx to 1.23.1.
[select] Add a document for the --n_workers option.

[Release 12.0.6 - 2022-08-04]

Added new Munin plugins for groonga-delta.
[column_copy] Added support for weight vector.
[Ubuntu] Dropped support for Ubuntu 21.10 (Impish Indri).
[Debian GNU/Linux] Dropped Debian 10 (buster) support.

[Release 12.0.5 - 2022-06-29]

[select] Improved a little bit of performance for prefix search by search escalation.
[select] Added support for specifying a reference vector column with weight in drilldowns[LABEL]._key.
[select] Added support for doing drilldown with a reference vector with weight even if we use query or filter, or post_filter.

[Release 12.0.4 - 2022-06-06]

[Ubuntu] Added support for Ubuntu 22.04 (Jammy Jellyfish).
We don’t provide groonga-benchmark.
[status] Added a new item memory_map_size.

[Release 12.0.3 - 2022-04-29]

[logical_count] Improved memory usage while logical_count executed.
[dump] Added support for MISSING_IGNORE/MISSING_NIL.
[snippet],[snippet_html] Added support for text vector as input.
[vector_join] Added a new function vector_join().
[Indexing] Ignore too large a token like online index construction.

[Release 12.0.2 - 2022-03-29]

[logical_range_filter] Added support for reducing reference immediately after processing a shard.
We increased the stability of the feature of recovering on crashes.
Improved performance for mmap if anonymous mmap available.
[Indexing] Added support for the static index construction against the following types of columns.
[column_create] Added new flags MISSING_* and INVALID_*.
[dump][column_list] Added support for MISSING_* and INVALID_* flags.
[schema] Added support for MISSING_* and INVALID_* flags.
We provided the package of Amazon Linux 2.
[Windows] Dropped support for building with Visual Studio 2017.

[Release 12.0.1 - 2022-02-28]

[query_expand] Added a support for synonym group.
[query_expand] Added a support for text vector and index.
Added support for disabling a backtrace by the environment variable.
[select] Improved performance for --slices.
[Windows] Added support for Visual Studio 2022.
[select] Added support for specifing max intervals for each elements in near search.
[Groonga HTTP server] We could use groonga-server-http even if Groonga of RPM packages.

[Release 12.0.0 - 2022-02-09]

[sub_filter] Added a new option pre_filter_threshold.
[index_column_have_source_record] Added a new function index_column_have_source_record().
[NormalizerNFKC130] Added a new option strip
[select] Added new arguments drilldown_max_n_target_records and drilldown[${LABEL}].max_n_target_records.
[httpd] Updated bundled nginx to 1.21.6.

Fixes#

[Release 12.1.1 - 2023-01-06]

[select] Fix a bug displaying a wrong label in drilldown results when command_version is 3.
[NormalizerTable] Fix a bug for Groonga to crush with specific definition setting in NormalizerTable.

[Release 12.1.0 - 2022-11-29]

[select][Vector column] Fixed a bug displaying integer in the results when a weight vector column specifies WEIGHT FLOAT32.

[Release 12.0.9 - 2022-10-28]

[select] Fixed a bug that Groonga could crash or return incorrect results when specifying n_workers.

[Release 12.0.8 - 2022-10-03]

Fixed a bug that Groonga could return incorrect results when we use NormalizerTable and it contains a non-idempotent (results can be changed when executed repeatedly) definition.

[Release 12.0.7 - 2022-08-29]

Fixed a bug Groonga’s response may be slow when we execute the request_cancel while executing a OR search.

[Release 12.0.6 - 2022-08-04]

Fixed a bug that Groonga may crash when we execute drilldown in a parallel by n_workers option.
[select] Fixed a bug that the syntax error occurred when we specify a very long expression in --filter.

[Release 12.0.4 - 2022-06-06]

Fixed a bug Groonga’s response may be slow when we execute request_cancel while executing a search.
Fixed a bug that string list can’t be casted to int32 vector.
Fixed a bug that Groonga Munin Plugins do not work on AlmaLinux 8 and CentOS 7.

[Release 12.0.3 - 2022-04-29]

Fixed a bug that we may be not able to add a key to a table of patricia trie.