News - 14 series#

Release 14.0.4 - 2024-05-29#

Fixes#

  • [query_parallel_or] Fixed a bug that the match_escalation_threshold or force_match_escalation options were ignored when using query_parallel_or().

    Before the fix, even when match_escalation_threshold was set to disable match escalation, the results still escalated when we use query_parallel_or(). This problem occurred only query_parallel_or(). query don’t occur this problem.

    Generally, we don’t disable match escalation. Because we want to get something search results. The number of hits is 0 is the unwelcome result of us. Therefore, this problem has no effect by many users. However, it has effect by user who use stop word as below.

    plugin_register token_filters/stop_word
    
    table_create Memos TABLE_NO_KEY
    column_create Memos content COLUMN_SCALAR ShortText
    
    table_create Terms TABLE_PAT_KEY ShortText \
      --default_tokenizer TokenBigram \
      --normalizer NormalizerAuto \
      --token_filters TokenFilterStopWord
    column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
    column_create Terms is_stop_word COLUMN_SCALAR Bool
    load --table Terms
    [
    {"_key": "and", "is_stop_word": true}
    ]
    
    load --table Memos
    [
    {"content": "Hello"},
    {"content": "Hello and Good-bye"},
    {"content": "Good-bye"}
    ]
    
    select Memos \
      --filter 'query_parallel_or(["content", "content", "content", "content"], \
                "and", \
                {"options": {"TokenFilterStopWord.enable": true}})' \
      --match_escalation_threshold -1 \
      --sort_keys -_score
    

    We don’t want to match a keyword that is registered as stopword. Therefore, we set -1 to match_escalation_threshold in the above example. We expect that Groonga doesn’t return records in the above example because of escalation disable and search keyword(and) is registered as stopword. However, If this problem occur, Groonga returns match record. Because if we use query_parallel_or(), match_escalation_threshold doesn’t work.

  • Fixed a bug that full-text search againt a reference column of a vector didn’t work.

    This problem has occured Groonga v14.0.0 or later. This problem has effect if we execute full-text search against a reference column of a vector.

    We expected that Groonga returns [1, "Linux MySQL"] and [2, "MySQL Groonga"] as below example. However, before the fix, Groonga always returned 0 hits as below because of we executed full-text search on a reference column of a vector.

    table_create bugs TABLE_PAT_KEY UInt32
    
    table_create tags TABLE_PAT_KEY ShortText --default_tokenizer TokenDelimit
    column_create tags name COLUMN_SCALAR ShortText
    
    column_create bugs tags COLUMN_VECTOR tags
    
    load --table bugs
    [
    ["_key", "tags"],
    [1, "Linux MySQL"],
    [2, "MySQL Groonga"],
    [3, "Mroonga"]
    ]
    
    column_create tags bugs_tags_index COLUMN_INDEX bugs tags
    
    select --table bugs --filter 'tags @ "MySQL"'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            0
          ],
          [
            [
              "_id",
              "UInt32"
            ],
            [
              "_key",
              "UInt32"
            ],
            [
              "tags",
              "tags"
            ]
          ]
        ]
      ]
    ]
    

Release 14.0.3 - 2024-05-09#

Improvements#

  • We optimized performance as below.

    • We optimized performance of OR and AND search when the number of hits were many.

    • We optimized performance of prefix search(@^).

    • We optimized performance of AND search when the number of records of A more than B in condition of A AND B.

    • We optimized performance of search when we used many dynamic columns.

  • [TokenNgram] Added new option ignore_blank.

    We can replace TokenBigramIgnoreBlank with TokenNgram("ignore_blank", true) as below.

    Here is example of use TokenBigram.

    tokenize TokenBigram "! ! !" NormalizerAuto
    [
      [
        0,
        1715155644.64263,
        0.001013517379760742
      ],
      [
        {
          "value": "!",
          "position": 0,
          "force_prefix": false,
          "force_prefix_search": false
        },
        {
          "value": "!",
          "position": 1,
          "force_prefix": false,
          "force_prefix_search": false
        },
        {
          "value": "!",
          "position": 2,
          "force_prefix": false,
          "force_prefix_search": false
        }
      ]
    ]
    

    Here is example of use TokenBigramIgnoreBlank.

    tokenize TokenBigramIgnoreBlank "! ! !" NormalizerAuto
    [
      [
        0,
        1715155680.323451,
        0.0009913444519042969
      ],
      [
        {
          "value": "!!!",
          "position": 0,
          "force_prefix": false,
          "force_prefix_search": false
        }
      ]
    ]
    

    Here is example of use TokenNgram("ignore_blank", true).

    tokenize 'TokenNgram("ignore_blank", true)' "! ! !" NormalizerAuto
    [
      [
        0,
        1715155762.340685,
        0.001041412353515625
      ],
      [
        {
          "value": "!!!",
          "position": 0,
          "force_prefix": false,
          "force_prefix_search": false
        }
      ]
    ]
    
  • [Ubuntu] Add support for Ubuntu 24.04 LTS (Noble Numbat).

Fixes#

  • [request_cancel] Fix a bug that Groonga may crash when we execute request_cancel command while we execute the other query.

  • Fixed the unexpected error when using --post_filter with --offset greater than the post-filtered result

    In the same situation, using --filter with --offset doesn’t raise the error. This inconsistency in behavior between --filter and --post-filter has now been resolved.

    table_create Users TABLE_PAT_KEY ShortText
    column_create Users age COLUMN_SCALAR UInt32
    load --table Users
    [
      ["_key", "age"],
      ["Alice", 21],
      ["Bob", 22],
      ["Chris", 23],
      ["Diana", 24],
      ["Emily", 25]
    ]
    select Users \
      --filter 'age >= 22' \
      --post_filter 'age <= 24' \
      --offset 3 \
      --sort_keys -age --output_pretty yes
    [
      [
        -68,
        1715224057.317582,
        0.001833438873291016,
        "[table][sort] grn_output_range_normalize failed",
        [
          [
            "grn_table_sort",
            "/home/horimoto/Work/free-software/groonga.tag/lib/sort.c",
            1052
          ]
        ]
      ]
    ]
    
  • Fixed a bug where incorrect search result could be returned when not all phrases within (...) matched using near phrase product.

    For example, there is no record which matched (2) condition using --query '*NPP1"(a) (2)"'. In this case, the expected behavior would be return no record. However, the actual behavior was equal to the query --query '*NPP and "(a)" as below. This means that despite no records matched (2), records like ax1 and axx1 were incorrectly returned.

    table_create Entries TABLE_NO_KEY
    column_create Entries content COLUMN_SCALAR Text
    
    table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram
    column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
    load --table Entries
    [
    {"content": "ax1"},
    {"content": "axx1"}
    ]
    
    select Entries \
      --match_columns content \
      --query '*NPP1"(a) (2)"' \
      --output_columns 'content'
    [
      [
        0,
        1715224211.050228,
        0.001366376876831055
      ],
      [
        [
          [
            2
          ],
          [
            [
              "content",
              "Text"
            ]
          ],
          [
            "ax1"
          ],
          [
            "axx1"
          ]
        ]
      ]
    ]
    
  • Fixed a bug that rehash failed or data in a table broke when rehash occurred that the table with TABLE_HASH_KEY has 2^28 or more records.

  • Fixed a bug that highlight position slipped out of place in the following cases.

    • If full width space existed before highlight target characters as below.

      We expected that Groonga returned "Groonga <span class=\"keyword\">高</span>速!". However, Groonga returned "Groonga <span class=\"keyword\">高速</span>!" as below.

      table_create Entries TABLE_NO_KEY
      column_create Entries body COLUMN_SCALAR ShortText
      
      table_create Terms TABLE_PAT_KEY ShortText \
        --default_tokenizer 'TokenNgram("report_source_location", true)' \
        --normalizer 'NormalizerNFKC150("report_source_offset", true)'
      column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
      
      load --table Entries
      [
      {"body": "Groonga 高速!"}
      ]
      select Entries \
        --output_columns \
        --match_columns body \
        --query '高' \
        --output_columns 'highlight_html(body, Terms)'
      [
        [
          0,
          1715215640.979517,
          0.001608610153198242
        ],
        [
          [
            [
              1
            ],
            [
              [
                "highlight_html",
                null
              ]
            ],
            [
              "Groonga <span class=\"keyword\">高速</span>!"
            ]
          ]
        ]
      ]
      
    • If we used TokenNgram("loose_blank", true) and if highlight target characters included full width space as below.

      We expected that Groonga returned "<span class=\"keyword\">山田 太郎</span>". However, Groonga returned "<span class=\"keyword\">山田 太</span>" as below.

      table_create Entries TABLE_NO_KEY
      column_create Entries body COLUMN_SCALAR ShortText
      
      table_create Terms TABLE_PAT_KEY ShortText \
        --default_tokenizer 'TokenNgram("loose_blank", true, "report_source_location", true)' \
        --normalizer 'NormalizerNFKC150("report_source_offset", true)'
      column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
      
      load --table Entries
      [
      {"body": "山田 太郎"}
      ]
      
      select Entries --output_columns \
        --match_columns body --query '山田太郎' \
        --output_columns 'highlight_html(body, Terms)' --output_pretty yes
      [
        [
          0,
          1715220409.096246,
          0.0004854202270507812
        ],
        [
          [
            [
              1
            ],
            [
              [
                "highlight_html",
                null
              ]
            ],
            [
              "<span class=\"keyword\">山田 太</span>"
            ]
          ]
        ]
      ]
      
    • If white space existed in the front of highlight target characters as below.

      We expected that Groonga returned " <span class=\"keyword\">山</span>田太郎". However, Groonga returned " <span class=\"keyword\">山</span>" as below.

      table_create Entries TABLE_NO_KEY
      column_create Entries body COLUMN_SCALAR ShortText
      
      table_create Terms TABLE_PAT_KEY ShortText \
        --default_tokenizer 'TokenNgram("report_source_location", true)' \
        --normalizer 'NormalizerNFKC150("report_source_offset", true)'
      column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
      
      load --table Entries
      [
      {"body": " 山田太郎"}
      ]
      
      select Entries \
        --output_columns \
        --match_columns body \
        --query '山' \
        --output_columns 'highlight_html(body, Terms)' --output_pretty yes
      [
        [
          0,
          1715221627.002193,
          0.001977920532226562
        ],
        [
          [
            [
              1
            ],
            [
              [
                "highlight_html",
                null
              ]
            ],
            [
              " <span class=\"keyword\">山</span>"
            ]
          ]
        ]
      ]
      
    • If the second character of highlight target was full width space as below.

      We expected that Groonga returned "<span class=\"keyword\">山 田</span>太郎". However, Groonga returned "<span class=\"keyword\">山 田太</span>郎" as below.

      table_create Entries TABLE_NO_KEY
      column_create Entries body COLUMN_SCALAR ShortText
      
      table_create Terms TABLE_PAT_KEY ShortText \
        --default_tokenizer 'TokenNgram("report_source_location", true)' \
        --normalizer 'NormalizerNFKC150("report_source_offset", true)'
      column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
      
      load --table Entries
      [
      {"body": "山 田太郎"}
      ]
      
      select Entries \
        --output_columns \
        --match_columns body \
        --query '山 田' \
        --output_columns 'highlight_html(body, Terms)'
      [
        [
          0,
          1715222501.496007,
          0.0005536079406738281
        ],
        [
          [
            [
              0
            ],
            [
              [
                "highlight_html",
                "<span class=\"keyword\">山 田太</span>郎"
              ]
            ]
          ]
        ]
      ]
      

Release 14.0.2 - 2024-03-29#

Improvements#

  • Reduced a log level of a log when Groonga setting normalizers/tokenizer/token_filters against temporary table.

    For example, the target log of this modification is the following log.

    DDL:1234567890:set_normalizers NormalizerAuto
    

    PGroonga sets normalizers against temporary table on start. So, this log becomes noise. Because this log become output when PGroonga start because of PGroonga’s default log level is notice.

    Therefore, we reduce log level to debug for the log since this release. Thus, this log does not output when PGroonga start in default.

Release 14.0.1 - 2024-03-14#

Improvements#

  • [load] Stopped reporting an error when we load key that becomes an empty key by normalization.

    "-" becomes "" with NormalizerNFKC150("remove_symbol", true). So the following case reports a “empty key” error.

    table_create Values TABLE_HASH_KEY ShortText \
      --normalizers 'NormalizerNFKC150("remove_symbol", true)'
    table_create Data TABLE_NO_KEY
    column_create Data value COLUMN_SCALAR Values
    load --table Data
    [
    {"value": "-"}
    ]
    

    However, if we many load in such data, many error log are generated. Because Groonga output many “empty key” error because of Groonga can’t register empty string to index.

    No problem even if empty string can’t register to index in such case. Because we don’t match anything even if we search by empty string. So, we stop reporting an “empty key” error in such case.

Fixes#

  • Fixed a crash bug if a request is canceled between or range search.

    This bug doesn’t necessarily occur. This bug occur when we cancel a request in the specific timing. This bug occur easily when search time is long such as sequential search.

  • Fixed a bug that highlight_html may return invalid result when the following conditions are met.

    • We use multiple normalizers such as NormalizerTable and NormalizerNFKC150.

    • We highlight string include whitespace.

    For example, this bug occur such as the following case.

    table_create NormalizationsIndex TABLE_PAT_KEY ShortText --normalizer NormalizerAuto
    
    table_create Normalizations TABLE_HASH_KEY UInt64
    column_create Normalizations normalized COLUMN_SCALAR LongText
    column_create Normalizations target COLUMN_SCALAR NormalizationsIndex
    
    column_create NormalizationsIndex index COLUMN_INDEX Normalizations target
    
    
    table_create Lexicon TABLE_PAT_KEY ShortText \
      --normalizers 'NormalizerTable("normalized", \
                                     "Normalizations.normalized", \
                                     "target", \
                                     "target"), NormalizerNFKC150'
    
    table_create Names TABLE_HASH_KEY UInt64
    column_create Names name COLUMN_SCALAR Lexicon
    
    load --table Names
    [
    ["_key","name"],
    [1,"Sato Toshio"]
    ]
    
    select Names \
      --query '_key:1 OR name._key:@"Toshio"' \
      --output_columns 'highlight_html(name._key, Lexicon)
    
    [
      [
        0,
        1710401574.332274,
        0.001911401748657227
      ],
      [
        [
          [
            1
          ],
          [
            [
              "highlight_html",
              null
            ]
          ],
          [
            "sato <span class=\"keyword\">toshi</span>o"
          ]
        ]
      ]
    ]
    
  • [Ubuntu] We become able to provide package for Ubuntu again.

    We don’t provide packages for Ubuntu in Groonga version 14.0.0. Because we fail makeing Groonga package for Ubuntu by problrm of build environment for Ubuntu package.

    We fixed problrm of build environment for Ubuntu package in 14.0.1. So, we can provide packages for Ubuntu again since this release.

  • Fixed build error when we build from source by using clang. [GitHub#1738][Reported by windymelt]

Thanks#

  • windymelt

Release 14.0.0 - 2024-02-29#

This is a major version up! But It keeps backward compatibility. We can upgrade to 14.0.0 without rebuilding database.

Improvements#

  • Added a new tokenizer TokenH3Index (experimental).

    TokenH3Indextokenizes WGS84GetPoint to UInt64(H3 index).

  • Added support for offline and online index construction with non text based tokenizer (experimental).

    TokenH3Index is one of non text based tokenizers.

  • [select] Added support for searching by index with non text based tokenizer (experimental).

    TokenH3Index is one of non text based tokenizers.

  • Added new functions distance_cosine(), distance_inner_product(), distance_l2_norm_squared(), distance_l1_norm().

    We can only get records that a small distance as vector with these functions and limit N

    These functions calculate distance in the output stage.

    However, we don’t optimaize these functions yet.

    • distance_cosine(): Calculate cosine similarity.

    • distance_inner_product(): Calculate inner product.

    • distance_l2_norm_squared(): Calculate euclidean distance.

    • distance_l1_norm(): Calculate manhattan distance.

  • Added a new function number_round().

  • [load] Added support for parallel load.

    This feature only enable when the input_type of load is apache-arrow.

    This feature one thread per column. If there are many target columns, it will reduce load time.

  • [select] We can use uvector as much as possible for array literal in --filter.

    uvector is vector of elements with fix size.

    If all elements have the same type, we use uvector instead vector.

  • [status] Added n_workers to output of status.

  • Optimized a dynamic column creation.

  • [WAL] Added support for rebuilding broken indexes in parallel.

  • [select] Added support for Int64 in output_type=apache-arrow for columns that reference other table.

Fixes#

  • [Windows] Fixed path for documents of groonga-normalizer-mysql in package for Windows.

    Documents of groonga-normalizer-mysql put under the share/ in this release.

  • [select] Fixed a bug that Groonga may crash when we use bitwise operations.