BloGroonga

2023-10-31

Groonga 13.0.9 has been released

Groonga 13.0.9 has been released!

How to install: Install

Changes

Here are important changes in this release:

Improvements

  • [select] Changed the default value of --fuzzy_max_expansions from 0 to 10.

    --fuzzy_max_expansions can limit number of words that has close edit distance to use search process. This argument can help to balance hit numbers and performance of the search. When --fuzzy_max_expansions is 0, the search use all words that the edit distance are under --fuzzy_max_distance in the vocabulary list.

    --fuzzy_max_expansions is 0 (unlimited) may slow down a search. Therefore, the default value of --fuzzy_max_expansions is 10 from this release.

  • [select] Improved select arguments with addition new argument --fuzzy_with_transposition (experimental).

    We can choose edit distance 1 or 2 for the transposition case by using this argument.

    If this parameter is yes, the edit distance of this case is 1. It's 2 otherwise.

  • [select] Improved select arguments with addition new argument --fuzzy_tokenize.

    When --fuzzy_tokenize is yes, Gronga use tokenizer that specifies in --default_tokenizer in typo tolerance search.

    The default value of --fuzzy_tokenize is no. The useful case of --fuzzy_tokenize is the following case.

    • Search targets are only Japanese data.
    • Specify TokenMecab in --default_tokenizer.
  • [load] Added support for --ifexists even if we specified apache-arrow into input_type.

  • [normalizers] Improved NormalizerNFKC* options with addition new option remove_blank_force.

    When remove_blank_force is false, Normalizer doesn't ignore space as below.

    table_create Entries TABLE_NO_KEY
    column_create Entries body COLUMN_SCALAR ShortText
    
    load --table Entries
    [
    {"body": "Groonga はとても速い"},
    {"body": "Groongaはとても速い"}
    ]
    
    select Entries --output_columns \
      'highlight(body, \
        "gaはとても", "<keyword>", "</keyword>", \
        {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"} \
      )'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            2
          ],
          [
            [
              "highlight",
              null
            ]
          ],
          [
            "Groonga はとても速い"
          ],
          [
            "Groon<keyword>gaはとても</keyword>速い"
          ]
        ]
      ]
    ]
    
  • [select] Improved select arguments with addition new argument --output_trace_log (experimental).

    If we specify yes in --output_trace_log and --command_version 3, Groonga output addition new log as below.

    table_create Memos TABLE_NO_KEY
    column_create Memos content COLUMN_SCALAR ShortText
    
    table_create Lexicon TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram   --normalizer NormalizerNFKC150
    column_create Lexicon memos_content   COLUMN_INDEX|WITH_POSITION Memos content
    
    load --table Memos
    [
    {"content": "This is a pen"},
    {"content": "That is a pen"},
    {"content": "They are pens"}
    ]
    
    select Memos \
      --match_columns content \
      --query "Thas OR ere" \
      --fuzzy_max_distance 1 \
      --output_columns *,_score \
      --command_version 3 \
      --output_trace_log yes \
    --output_type apache-arrow
    
    return_code: int32
    start_time: timestamp[ns]
    elapsed_time: double
    error_message: string
    error_file: string
    error_line: uint32
    error_function: string
    error_input_file: string
    error_input_line: int32
    error_input_command: string
    -- metadata --
    GROONGA:data_type: metadata
    	return_code	               start_time	elapsed_time	error_message	error_file	error_line	error_function	error_input_file	error_input_line	error_input_command
    0	          0	1970-01-01T09:00:00+09:00	    0.000000	       (null)	    (null)	    (null)	        (null)	          (null)	          (null)	             (null)
    ========================================
    depth: uint16
    sequence: uint16
    name: string
    value: dense_union<0: uint32=0, 1: string=1>
    elapsed_time: uint64
    -- metadata --
    GROONGA:data_type: trace_log
    	depth	sequence	name	value	elapsed_time
     0	    1	       0	ii.select.input	Thas 	           0
     1	    2	       0	ii.select.exact.n_hits	    0	           1
     2	    2	       0	ii.select.fuzzy.input	Thas 	           2
     3	    2	       1	ii.select.fuzzy.input.actual	that 	           3
     4	    2	       2	ii.select.fuzzy.input.actual	this 	           4
     5	    2	       3	ii.select.fuzzy.n_hits	    2	           5
     6	    1	       1	ii.select.n_hits	    2	           6
     7	    1	       0	ii.select.input	ere  	           7
     8	    2	       0	ii.select.exact.n_hits	    2	           8
     9	    1	       1	ii.select.n_hits	    2	           9
    ========================================
    content: string
    _score: double
    -- metadata --
    GROONGA:n_hits: 2
    	content	    _score
    0	This is a pen	  1.000000
    1	That is a pen	  1.000000
    

    --output_trace_log is valid in only command version 3.

    This will be useful for the following cases:

    • Detect real words used by fuzzy query.
    • Measure elapsed timeout without seeing query log.
  • [snippet] Added support for normalizers option.

    We can use normalizer with option. For example, when we don't want to ignore space in snippet() function, we use this option as below.

    table_create Entries TABLE_NO_KEY
    column_create Entries content COLUMN_SCALAR ShortText
    
    load --table Entries
    [
    {"content": "Groonga and MySQL"},
    {"content": "Groonga and My SQL"}
    ]
    
    select Entries \
      --output_columns \
        '   snippet(content,   "MySQL", "<keyword>", "</keyword>",   {"normalizers": "NormalizerNFKC150(\\"remove_blank_force\\", false)"}   )'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            2
          ],
          [
            [
              "snippet",
              null
            ]
          ],
          [
            [
              "Groonga and <keyword>MySQL</keyword>"
            ]
          ],
          [
            null
          ]
        ]
      ]
    ]
    

Fixes

  • Fixed a bug in Time OPERATOR Float{,32} comparison. GH-1624[Reported by yssrku]

    Microsecond (small value than second) information in Float{,32} isn't used. This is happen only when Time OPERATOR Float{,32}.

    This is happen in load --ifexists 'A OP B || C OP D' as below.

    table_create Reports TABLE_HASH_KEY ShortText
    column_create Reports content COLUMN_SCALAR Text
    column_create Reports modified_at COLUMN_SCALAR Time
    
    load --table Reports
    [
    {"_key": "a", "content": "", "modified_at": 1663989875.438}
    ]
    
    load \
      --table Reports \
      --ifexists 'content == "" && modified_at <= 1663989875.437'
    

    However, this isn't happen in select --filter.

  • Fixed a bug that alnum(a-zA-Z0-9) + blank may be detected.

    If the number of input is 2 such as ab and text with some blanks such as a b is matched, a b is detected. However, it should not be detected in this case.

    For example, a i is detected when this bug occures as below.

    table_create Entries TABLE_NO_KEY
    column_create Entries body COLUMN_SCALAR ShortText
    
    load --table Entries
    [
    {"body": "Groonga is fast"}
    ]
    
    select Entries \
      --output_columns 'highlight(body, "ai", "<keyword>", "</keyword>")'
    
    [
      [
        0,0.0,0.0
      ],
      [
        [
          [
            1
          ],
          [
            [
              "highlight",
              null
            ]
          ],
          [
            "Groong<keyword>a i</keyword>s fast"
          ]
        ]
      ]
    ]
    

    However, the above result is unexpected result. We don't want to detect a i in the above case.

Conclusion

Please refert to the following news for more details. News Release 13.0.9

Let's search by Groonga!