Groonga 13.0.9 has been released

BloGroonga

2023-10-31

Groonga 13.0.9 has been released

Groonga 13.0.9 has been released!

How to install: Install

Changes

Here are important changes in this release:

Improvements

[select] Changed the default value of --fuzzy_max_expansions from 0 to 10.

--fuzzy_max_expansions can limit number of words that has close edit distance to use search process. This argument can help to balance hit numbers and performance of the search. When --fuzzy_max_expansions is 0, the search use all words that the edit distance are under --fuzzy_max_distance in the vocabulary list.

--fuzzy_max_expansions is 0 (unlimited) may slow down a search. Therefore, the default value of --fuzzy_max_expansions is 10 from this release.
[select] Improved select arguments with addition new argument --fuzzy_with_transposition (experimental).

We can choose edit distance 1 or 2 for the transposition case by using this argument.

If this parameter is yes, the edit distance of this case is 1. It's 2 otherwise.
[select] Improved select arguments with addition new argument --fuzzy_tokenize.

When --fuzzy_tokenize is yes, Gronga use tokenizer that specifies in --default_tokenizer in typo tolerance search.

The default value of --fuzzy_tokenize is no.　The useful case of --fuzzy_tokenize is the following case.
- Search targets are only Japanese data.
- Specify TokenMecab in --default_tokenizer.
[load] Added support for --ifexists even if we specified apache-arrow into input_type.

[normalizers] Improved NormalizerNFKC* options with addition new option remove_blank_force.

When remove_blank_force is false, Normalizer doesn't ignore space as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

load --table Entries
[
{"body": "Groonga はとても速い"},
{"body": "Groongaはとても速い"}
]

select Entries --output_columns \
  'highlight(body, \
    "gaはとても", "<keyword>", "</keyword>", \
    {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"} \
  )'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "highlight",
          null
        ]
      ],
      [
        "Groonga はとても速い"
      ],
      [
        "Groon<keyword>gaはとても</keyword>速い"
      ]
    ]
  ]
]

[select] Improved select arguments with addition new argument --output_trace_log (experimental).

If we specify yes in --output_trace_log and --command_version 3, Groonga output addition new log as below.

table_create Memos TABLE_NO_KEY
column_create Memos content COLUMN_SCALAR ShortText

table_create Lexicon TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram   --normalizer NormalizerNFKC150
column_create Lexicon memos_content   COLUMN_INDEX|WITH_POSITION Memos content

load --table Memos
[
{"content": "This is a pen"},
{"content": "That is a pen"},
{"content": "They are pens"}
]

select Memos \
  --match_columns content \
  --query "Thas OR ere" \
  --fuzzy_max_distance 1 \
  --output_columns *,_score \
  --command_version 3 \
  --output_trace_log yes \
--output_type apache-arrow

return_code: int32
start_time: timestamp[ns]
elapsed_time: double
error_message: string
error_file: string
error_line: uint32
error_function: string
error_input_file: string
error_input_line: int32
error_input_command: string
-- metadata --
GROONGA:data_type: metadata
	return_code	               start_time	elapsed_time	error_message	error_file	error_line	error_function	error_input_file	error_input_line	error_input_command
0	          0	1970-01-01T09:00:00+09:00	    0.000000	       (null)	    (null)	    (null)	        (null)	          (null)	          (null)	             (null)
========================================
depth: uint16
sequence: uint16
name: string
value: dense_union<0: uint32=0, 1: string=1>
elapsed_time: uint64
-- metadata --
GROONGA:data_type: trace_log
	depth	sequence	name	value	elapsed_time
 0	    1	       0	ii.select.input	Thas 	           0
 1	    2	       0	ii.select.exact.n_hits	    0	           1
 2	    2	       0	ii.select.fuzzy.input	Thas 	           2
 3	    2	       1	ii.select.fuzzy.input.actual	that 	           3
 4	    2	       2	ii.select.fuzzy.input.actual	this 	           4
 5	    2	       3	ii.select.fuzzy.n_hits	    2	           5
 6	    1	       1	ii.select.n_hits	    2	           6
 7	    1	       0	ii.select.input	ere  	           7
 8	    2	       0	ii.select.exact.n_hits	    2	           8
 9	    1	       1	ii.select.n_hits	    2	           9
========================================
content: string
_score: double
-- metadata --
GROONGA:n_hits: 2
	content	    _score
0	This is a pen	  1.000000
1	That is a pen	  1.000000

--output_trace_log is valid in only command version 3.

This will be useful for the following cases:

Detect real words used by fuzzy query.
Measure elapsed timeout without seeing query log.

[snippet] Added support for normalizers option.

We can use normalizer with option. For example, when we don't want to ignore space in snippet() function, we use this option as below.

table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR ShortText

load --table Entries
[
{"content": "Groonga and MySQL"},
{"content": "Groonga and My SQL"}
]

select Entries \
  --output_columns \
    '   snippet(content,   "MySQL", "<keyword>", "</keyword>",   {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"}   )'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "snippet",
          null
        ]
      ],
      [
        [
          "Groonga and <keyword>MySQL</keyword>"
        ]
      ],
      [
        null
      ]
    ]
  ]
]

Fixes

Fixed a bug in Time OPERATOR Float{,32} comparison. GH-1624[Reported by yssrku]

Microsecond (small value than second) information in Float{,32} isn't used. This is happen only when Time OPERATOR Float{,32}.

This is happen in load --ifexists 'A OP B || C OP D' as below.

table_create Reports TABLE_HASH_KEY ShortText
column_create Reports content COLUMN_SCALAR Text
column_create Reports modified_at COLUMN_SCALAR Time

load --table Reports
[
{"_key": "a", "content": "", "modified_at": 1663989875.438}
]

load \
  --table Reports \
  --ifexists 'content == "" && modified_at <= 1663989875.437'

However, this isn't happen in select --filter.

Fixed a bug that alnum(a-zA-Z0-9) + blank may be detected.

If the number of input is 2 such as ab and text with some blanks such as a b is matched, a b is detected. However, it should not be detected in this case.

For example, a i is detected when this bug occures as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

load --table Entries
[
{"body": "Groonga is fast"}
]

select Entries \
  --output_columns 'highlight(body, "ai", "<keyword>", "</keyword>")'

[
  [
    0,0.0,0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight",
          null
        ]
      ],
      [
        "Groong<keyword>a i</keyword>s fast"
      ]
    ]
  ]
]

However, the above result is unexpected result. We don't want to detect a i in the above case.

Conclusion

Please refert to the following news for more details. News Release 13.0.9

Let's search by Groonga!

BloGroonga