News - 14 series#

Release 14.1.3 - 2025-01-29#

Improvements#

Offline index construction: Added support for parallel execution#

If you have many CPU cores, you can improve Offline index construction performance. You can use it by column_create ... COLUMN_INDEX ... --n_workers -1.

In general, Offline index construction will be 2-10x faster by this.

language_model_vectorize: Added support for Phi-4#

You can use Phi-4 as a language model. Note that Phi-4 isn’t designing for generating embeddings.

Added NormalizerNFKC #

It’s similar to existing NormalizerNFKC* such as NormalizerNFKC150. But it’s more generic than NormalizerNFKC*.

NormalizerNFKC accepts version argument to change used Unicode version. For example, NormalizerNFKC("version", "15.0.0") equals to NormalizerNFKC150.

NormalizerNFKC* are deprecated by NormalizerNFKC. You should use NormalizerNFKC for newly written applications.

NormalizerNFKC uses Unicode 16.0.0 by default.

Added TokenFilterNFKC #

It’s similar to existing TokenFilterNFKC* such as TokenFilterNFKC150. But it’s based on NormalizerNFKC. So you can specify Unicode version as version.

TokenFilterNFKC* are deprecated by TokenFilterNFKC. You should use TokenFilterNFKC for newly written applications.

TokenFilterNFKC uses Unicode 16.0.0 by default.

Amazon Linux: Added support for Amazon Linux 2023 aarch64#

See Amazon Linux how to install.

index_column_diff: Added support for changing log level for progress logs#

It’s useful to see only progress logs without other debug logs.

See progress_log_level for details.

Fixes#

Fixed a bug that deb/rpm packages can’t be used on machines that don’t have AVX#

pgroonga/pgroonga#642

Groonga uses llama.cpp to use language models. llama.cpp uses SIMD for fast processing.

We defer llama.cpp initialization that needs SIMD. So you can use Groonga on machines that don’t have AVX if you don’t use language model related features such as language_model_vectorize.

Reported by Yuki Shira.

TABLE_PAT_KEY: Fixed a bug that existing key may be broken after defrag #

If you have some garbage nodes in TABLE_PAT_KEY and defrag it, new added keys to the TABLE_PAT_KEY might override existing keys.

This bug exists since Release 14.1.1 - 2024-12-03.

TABLE_PAT_KEY: Fixed a bug that some WALs are missing for defrag #

This release added some missing WALs on defrag. If some WALs are missing, we may not be able to recover a crash in defrag.

defrag: Fixed a bug that columns of TABLE_PAT_KEY aren’t processed#

GH-2168

defrag pat should process pat and its columns. But it processes only pat since Release 14.1.1 - 2024-12-03. It’s a regression bug.

Index column: Fixed an Offline index construction bug#

It happened when:

Source is a Reference vector column
Source has a GRN_ID_NIL element (an element that refers nothing) between 1..-2 position
Index column has WITH_POSITION flag

Offline index construction always uses wrong position for elements after the GRN_ID_NIL element.

Offline index construction may use wrong index key. It depends on an uninitialized memory.

TABLE_HASH_KEY: Fixed a bug that garbage entry may not be reused#

It happened when a TABLE_HASH_KEY is rehashed with garbage. Garbage is increased by deleting records.

If rehash is done with garbage, the garbage is never reused after the rehash. A TABLE_PAT_KEY may large even when it has only a few valid records.

This bug exists since Release 10.1.1 - 2021-01-25.

Thanks#

Yuki Shira

Release 14.1.2 - 2024-12-24#

Improvements#

Relicensed to LGPL-2.1-or-later from LGPL-2.1-only#

GH-1774

It improves GPL 3 or later compatibility.

delete: Added support for reference weight vector#

If a reference column is indexed, delete can also delete the deleted reference value from the reference column. It works for a reference vector column but didn’t work for a reference “weight” vector column.

With this release, this also works for a reference “weight” vector column.

Fixes#

Vector column: Fixed a bug that needed cast may not be done#

Groonga casts a value automatically when it’s needed.

If a reference vector column uses 32-bit floating number weight and a new value is a no weight reference vector, Groonga must cast the new value. But it was not done.

Release 14.1.1 - 2024-12-03#

Improvements#

Stopped vendoring RapidJSON#

GH-1462

We added support for simdjson in 14.0.7. We don’t need to vendor RapidJSON anymore. RapidJSON has a file that uses the JSON license. It’s not a free software license / open source license. So we wanted to unvendor RapidJSON.

Note that RapidJSON is still supported. You can still use RapidJSON by installing RapidJSON separately. But we recommend that you use simdjson not RapidJSON.

Made llama.cpp initialization lazy#

GH-2078

llama.cpp requires recent SIMD features. If we always initialize llama.cpp, users who use old CPU and don’t use llama.cpp related features can’t use the official Groonga packages. So we made llama.cpp initialization lazy. Users who use llama.cpp related features still need recent CPU but users who don’t use llama.cpp related features can use old CPU as usual.

`NormalizerNFKC`: Added support for mixing `unify_middle_dot` and `remove_symbol`#

GH-1538

NormalizerNFKC family normalizers such as NormalizerNFKC150 can mix unify_middle_dot and remove_symbol options.

defrag: Added support for TABLE_PAT_KEY #

A TABLE_PAT_KEY table couldn’t reuse spaces that are used by removed keys. They’re still not be reusable automatically, but we can reuse them by executing defrag explicitly since this release.

load: Improved error handling#

If there are any critical errors in load, we will not wait for more data. Clients may stop sending more data in the case.

groonga executable file: Added `context_id` log flag#

It’s useful when you use Query log too. In Query log, the context ID is recorded. If we have the context ID in Process log and Query log, we can associate these logs easily.

groonga executable file: Improved error handling#

Request timeout uses lock internally. It may be failed in some cases. We added some error checks for the case.

TABLE_PAT_KEY: Improved key space efficiency#

GH-2104

There was a case that wastes key space. Now, we can use key space as efficiently as possible.

Fixes#

`grn_pat_size()`: Fixed a bug that `0` may not be returned on error#

GH-2097

If the given pat is NULL, it returns -22. But it should have returned 0 with error ctx->rc.

Now, Groonga uses the expected behavior.

Regular expression: Fixed a bug with an empty pattern#

GH-2121

If the given pattern is an empty string, there were some problems:

If there is an index for regular expression, Groonga was crashed.
If there is not an index for regular expression, Groonga returns all records.

Now, Groonga returns an empty result for an empty pattern.

Release 14.1.0 - 2024-11-05#

Improvements#

[status] Added `.["features"]["llama.cpp"]` entry#

The .["features"] entry shows whether each feature is enabled or not like the following:

[
  ["..."],
  {
    "...": "...",
    "features": {
      "nfkc": true,
      "mecab": true,
      "...": "..."
    },
    "...": "..."
  }
]

This release added llama.cpp to the .["features"] entry:

[
  ["..."],
  {
    "...": "...",
    "features": {
      "nfkc": true,
      "mecab": true,
      "...": "...",
      "llama.cpp": true,
      "...": "..."
    },
    "...": "..."
  }
]

You can use .["features"]["llama.cpp"] to know whether llama.cpp is enabled or not.

You need the llama.cpp support to use language_model_vectorize function described next.

[language_model_vectorize] Added (Experimental)#

This feature is still experimental and unstable.

This feature may consume a significant amount of CPU and memory.

The function returns normalized embeddings (vector representations) of the given text. An embedding represents the semantic information of the text as a vector. By converting meanings of the text into vectors, you can perform semantic searches instead of keyword based searches.

For example, “United States” and “America” look completely different but share the same meaning. Traditionally, you could search for words with different spellings but the same meaning by using synonym expansion. But this required maintaining a synonym table.

With semantic search enabled, you can find words with similar meanings without preparing or maintaining a synonym table.

[column_create] Added the generator parameter#

This release added the Generated column feature. You can use this feature to generate column values based on the values of other column automatically.

You can create a generated column by using column_create and generator. See Create a generated column for details.

Here are merits of this feature:

You no longer need to manage generation processes manually
You can reduce the amount of data to load since only the source columns are necessary

For example, you can use language_model_vectorize to generate embeddings of the values of other columns. An embedding may not be small. If it’s represented as an 1024 dimensions vector of 32-bit float, its size is 4KiB. If you generate embeddings automatically by language_model_vectorize and Generated column, you don’t need to send 4KiB embeddings to Groonga.

You can load only the text to be searched while simultaneously storing the embedding with this feature.

Fixes#

[GQTP] Fixed a bug that wrong return code is returned#

This is a client side problem. This problem was introduced in 14.0.0.

GQTP client converts Return code received from server wrongly. For example, INVALID_ARGUMENT code is -22 but it was converted to 65514.

This fix does not affect users who are not using GQTP clients.

[logical_range_filter] Fixed a bug that invalid JSON format response may be returned#

This problem happens only with Command version 3. The default command version is 1.

You can check your Groonga’s default command version using the status command:

[
  ["..."],
  {
    "...": "...",
    "command_version": 1,
    "default_command_version": 1,
    "...": "..."
  }
]

Here is an example invalid JSON format response. You can see that there is an extra "records": [] before the actual "records: [...]":

{
  "body": {
    "columns": ["..."],
    "records": [],
    "records": [
      [
        "The first post!",
        "Welcome! This is my first post!",
        1436313600.0,
        5,
        "Hello"
      ],
      "..."
    ]
  },
  "header": "..."
}

[request_cancel] Fixed a bug that invalid JSON format response may be returned#

You can stop a running command by request_cancel.

You might get an invalid JSON format response when you cancel a command. It depends on the timing of the cancellation.

This release fixed the cancellation process. You will not receive an invalid JSON format response even when you cancel a command.

[index_column_diff] Fixed a bug that false positive diff may be reported#

You can use index_column_diff to detect index corruption.

index_column_diff reports a missing record when a single record contains more than 131,072 instances of the same token. This release fixed this problem.

Release 14.0.9 - 2024-09-27#

Fixes#

Fixed build error when we build from source on Alpine Linux 3.20.

Release 14.0.8 - 2024-09-25#

Improvements#

[Amazon Linux 2023] Added support for Amazon Linux 2023. [GH-1845][Reported by takoyaki-nyokki and Watson]

However, we don’t support Amazon Linux 2023 for aarch64 currently. Because the maintenance is costly because of the time of building a package for aarch64 is too time consuming.

We don’t use TokenMecab in Groonga for Amazon Linux 2023. Because Amazon Linux 2023 doesn’t provide MeCab package.

Fixes#

Fixed a crash bug when we specified pgroonga.log_type on PGroonga for Windows.

Thanks#

takoyaki-nyokki
Watson

Release 14.0.7 - 2024-09-03#

Improvements#

[NormalizerNFKC] Added a new option unify_latin_alphabet_with.[pgroonga/pgroonga#418][Reported by Man Hoang]

This option enables that alphabets with diacritical mark and alphabets without diacritical mark regarded as the same character as below.

However, this feature focus on only LATIN (SMALL|CAPITAL) LETTER X WITH XXX. It doesn’t support LATIN (SMALL|CAPITAL) LETTER X + COMBINING XXX characters.

We can confirm that can get ngoằn with ngoan in the following example.

table_create --name Contents --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Contents --name content --type ShortText
load --table Contents
[
  {"_key":"1", "content":"con đường ngoằn nghoèo"},
]

table_create \
  --name Terms \
  --flags TABLE_PAT_KEY \
  --key_type ShortText \
  --default_tokenizer TokenBigram \
  --normalizer 'NormalizerNFKC150("unify_latin_alphabet_with", true)'
column_create \
  --table Terms \
  --name index_content \
  --flags COLUMN_INDEX|WITH_POSITION \
  --type Contents \
  --source content

select --table Contents --query content:@ngoan
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "content",
          "ShortText"
        ]
      ],
      [
        1,
        "1",
        "con đường ngoằn nghoèo"
      ]
    ]
  ]
]

Note

Current normalizers has the following bug.

The lowercase of U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE must be U+0069 LATIN SMALL LETTER I. But the current implementation uses U+0069 LATIN SMALL LETTER I + U+0307 COMBINING DOT ABOVE.

We will fix the bug from NormalizerNFKC160.

Added support for simdjson.

Currently, we keeps RapidJSON support but we’ll remove RapidJSON support later.
[Ubuntu] Dropped support for Ubuntu 23.10 (Mantic Minotaur)

Because Ubuntu 23.10 reached EOL at July 11, 2024.
Distributed the install packages which enabled libedit.

Now in Groonga REPL, we can refer to the command history using the up arrow key.

Fixes#

[status] Fixed a bug that “h3” is never shown in groonga --version/status.
Fixed a crash bug when log rotate is enabled.

This bug occurred in the following conditions.
1. When logging to a file by specification the following options.
 - --log-path <path>
 - --query-log-path <path>
2. When log rotate is enabled by specification the following options.
 - --log-rotate-threshold-size <threshold>
 - --query-log-rotate-threshold-size <threshold>
3. When process ID log output is enabled by specification the following options.
 - --log-flags process_id
Fixed a bug that Groonga use bundled Onigmo even if we specify GRN_WITH_BUNDLED_ONIGMO=OFF. [termux/termux-packages#21252][Patch by Biswapriyo Nath]

Thanks#

Man Hoang
Biswapriyo Nath

Release 14.0.6 - 2024-07-29#

Improvements#

Added support for prefix search without index against vector column.

We can already prefix search with index against vector column. However, we can’t prefix search without index against vector column until now.

We can also prefix search without index against vector column as below since this release.

table_create Users TABLE_NO_KEY
column_create Users name COLUMN_VECTOR ShortText

load --table Users
[
{"name": ["alice"]},
{"name": ["bob"]},
{"name": ["callum"]},
{"name": ["elly", "alice"]},
{"name": ["marshal"]}
]

select Users --query 'name:^al'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "name",
          "ShortText"
        ]
      ],
      [
        1,
        [
          "alice"
        ]
      ],
      [
        4,
        [
          "elly",
          "alice"
        ]
      ]
    ]
  ]
]

[Debian GNU/Linux] Dropped support for Debian 11 (bullseye).

Release 14.0.5 - 2024-07-04#

Improvements#

Dropped support for CentOS 7.

Because CentOS 7 reached EOL.
Added a new feature that objects(table or column) as remove as possible.

The crash safe feature of PGroonga will use this feature mainly.

PGroonga will apply PGroonga’s WAL to standby database automatically by using Custom WAL Resource Managers. However, when PGroonga use Custom WAL Resource Managers, all replications are stop if PGroonga fail application of PGroonga’s WAL due to break Groonga’s object. So, if broken objects exist in database, Groonga will try as remove as possible objects by using this feature.

Fixes#

[query] Fixed a bug that the order of evaluation of 'A || query("...", "B C")' is wrong

We expect that {"name": "Alice", "memo": "Groonga user"} is hit in the following example. However, if this problem occurred, the following query had not been hit.

table_create Users TABLE_NO_KEY
column_create Users name COLUMN_SCALAR ShortText
column_create Users memo COLUMN_SCALAR ShortText

load --table Users
[
{"name": "Alice", "memo": "Groonga user"},
{"name": "Bob",   "memo": "Rroonga user"}
]
select Users \
  --output_columns 'name, memo, _score' \
  --filter 'memo @ "Groonga" || query("name", "Bob Rroonga")'
[[0,0.0,0.0],[[[0],[["name","ShortText"],["memo","ShortText"],["_score","Int32"]]]]]

After the fix, {"name": "Alice", "memo": "Groonga user"} is hit such as the following example.

select Users \
  --output_columns 'name, memo, _score' \
  --filter 'memo @ "Groonga" || query("name", "Bob Rroonga")'
[
  [
    0,
    1719376617.537505,
    0.002481460571289062
  ],
  [
    [
      [
        1
      ],
      [
        [
          "name",
          "ShortText"
        ],
        [
          "memo",
          "ShortText"
        ],
        [
          "_score",
          "Int32"
        ]
      ],
      [
        "Alice",
        "Groonga user",
        1
      ]
    ]
  ]
]

Here is occurrence condition of this problem.

We use OR search and query().
We use AND search in query().
The order of condition expression is 'A || query("...", "B C")'.

So, this problem doesn’t occur if we use only query() or we don’t use AND search in query().

[select] Fixed a bug that a condition that evaluate prefix search in advance such as --query "A* OR B" returned wrong search result

This problem may occur when prefix search evaluate in advance. This problem doesn’t occur a condition that evaluate prefix search in the end such as --query A OR B*.

If this problem occur, the Bo and the li of --query "Bo* OR li" are evaluated as a prefix search. As a result, The following query does not hit. Because li is evaluated as a prefix search as mentioned above.

table_create Users TABLE_NO_KEY
column_create Users name COLUMN_SCALAR ShortText

load --table Users
[
["name"],
["Alice"]
]

select Users \
  --match_columns name \
  --query "Bo* OR li"
[
  [
    0,
    1719377505.628048,
    0.0007376670837402344
  ],
  [
    [
      [
        0
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "name",
          "ShortText"
        ]
      ]
    ]
  ]
]

Release 14.0.4 - 2024-05-29#

Fixes#

[query_parallel_or] Fixed a bug that the match_escalation_threshold or force_match_escalation options were ignored when using query_parallel_or().

Before the fix, even when match_escalation_threshold was set to disable match escalation, the results still escalated when we use query_parallel_or(). This problem occurred only query_parallel_or(). query don’t occur this problem.

Generally, we don’t disable match escalation. Because we want to get something search results. The number of hits is 0 is the unwelcome result of us. Therefore, this problem has no effect by many users. However, it has effect by user who use stop word as below.
```
plugin_register token_filters/stop_word

table_create Memos TABLE_NO_KEY
column_create Memos content COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStopWord
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
column_create Terms is_stop_word COLUMN_SCALAR Bool
load --table Terms
[
{"_key": "and", "is_stop_word": true}
]

load --table Memos
[
{"content": "Hello"},
{"content": "Hello and Good-bye"},
{"content": "Good-bye"}
]

select Memos \
  --filter 'query_parallel_or(["content", "content", "content", "content"], \
            "and", \
            {"options": {"TokenFilterStopWord.enable": true}})' \
  --match_escalation_threshold -1 \
  --sort_keys -_score
```
We don’t want to match a keyword that is registered as stopword. Therefore, we set -1 to match_escalation_threshold in the above example. We expect that Groonga doesn’t return records in the above example because of escalation disable and search keyword(and) is registered as stopword. However, If this problem occur, Groonga returns match record. Because if we use query_parallel_or(), match_escalation_threshold doesn’t work.

Fixed a bug that full-text search against a reference column of a vector didn’t work.

This problem has occured Groonga v14.0.0 or later. This problem has effect if we execute full-text search against a reference column of a vector.

We expected that Groonga returns [1, "Linux MySQL"] and [2, "MySQL Groonga"] as below example. However, before the fix, Groonga always returned 0 hits as below because of we executed full-text search on a reference column of a vector.

table_create bugs TABLE_PAT_KEY UInt32

table_create tags TABLE_PAT_KEY ShortText --default_tokenizer TokenDelimit
column_create tags name COLUMN_SCALAR ShortText

column_create bugs tags COLUMN_VECTOR tags

load --table bugs
[
["_key", "tags"],
[1, "Linux MySQL"],
[2, "MySQL Groonga"],
[3, "Mroonga"]
]

column_create tags bugs_tags_index COLUMN_INDEX bugs tags

select --table bugs --filter 'tags @ "MySQL"'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        0
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "UInt32"
        ],
        [
          "tags",
          "tags"
        ]
      ]
    ]
  ]
]

Release 14.0.3 - 2024-05-09#

Improvements#

We optimized performance as below.
- We optimized performance of OR and AND search when the number of hits were many.
- We optimized performance of prefix search(@^).
- We optimized performance of AND search when the number of records of A more than B in condition of A AND B.
- We optimized performance of search when we used many dynamic columns.

[TokenNgram] Added new option ignore_blank.

We can replace TokenBigramIgnoreBlank with TokenNgram("ignore_blank", true) as below.

Here is example of use TokenBigram.

tokenize TokenBigram "! ! !" NormalizerAuto
[
  [
    0,
    1715155644.64263,
    0.001013517379760742
  ],
  [
    {
      "value": "!",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "!",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "!",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

Here is example of use TokenBigramIgnoreBlank.

tokenize TokenBigramIgnoreBlank "! ! !" NormalizerAuto
[
  [
    0,
    1715155680.323451,
    0.0009913444519042969
  ],
  [
    {
      "value": "!!!",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

Here is example of use TokenNgram("ignore_blank", true).

tokenize 'TokenNgram("ignore_blank", true)' "! ! !" NormalizerAuto
[
  [
    0,
    1715155762.340685,
    0.001041412353515625
  ],
  [
    {
      "value": "!!!",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

[Ubuntu] Add support for Ubuntu 24.04 LTS (Noble Numbat).

Fixes#

[request_cancel] Fix a bug that Groonga may crash when we execute request_cancel command while we execute the other query.

Fixed the unexpected error when using --post_filter with --offset greater than the post-filtered result

In the same situation, using --filter with --offset doesn’t raise the error. This inconsistency in behavior between --filter and --post-filter has now been resolved.

table_create Users TABLE_PAT_KEY ShortText
column_create Users age COLUMN_SCALAR UInt32
load --table Users
[
  ["_key", "age"],
  ["Alice", 21],
  ["Bob", 22],
  ["Chris", 23],
  ["Diana", 24],
  ["Emily", 25]
]
select Users \
  --filter 'age >= 22' \
  --post_filter 'age <= 24' \
  --offset 3 \
  --sort_keys -age --output_pretty yes
[
  [
    -68,
    1715224057.317582,
    0.001833438873291016,
    "[table][sort] grn_output_range_normalize failed",
    [
      [
        "grn_table_sort",
        "/home/horimoto/Work/free-software/groonga.tag/lib/sort.c",
        1052
      ]
    ]
  ]
]

Fixed a bug where incorrect search result could be returned when not all phrases within (...) matched using near phrase product.

For example, there is no record which matched (2) condition using --query '*NPP1"(a) (2)"'. In this case, the expected behavior would be return no record. However, the actual behavior was equal to the query --query '*NPP and "(a)" as below. This means that despite no records matched (2), records like ax1 and axx1 were incorrectly returned.

table_create Entries TABLE_NO_KEY
column_create Entries content COLUMN_SCALAR Text

table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram
column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
load --table Entries
[
{"content": "ax1"},
{"content": "axx1"}
]

select Entries \
  --match_columns content \
  --query '*NPP1"(a) (2)"' \
  --output_columns 'content'
[
  [
    0,
    1715224211.050228,
    0.001366376876831055
  ],
  [
    [
      [
        2
      ],
      [
        [
          "content",
          "Text"
        ]
      ],
      [
        "ax1"
      ],
      [
        "axx1"
      ]
    ]
  ]
]

Fixed a bug that rehash failed or data in a table broke when rehash occurred that the table with TABLE_HASH_KEY has 2^28 or more records.

Fixed a bug that highlight position slipped out of place in the following cases.

If full width space existed before highlight target characters as below.

We expected that Groonga returned "Groonga　高速！". However, Groonga returned "Groonga　高速！" as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("report_source_location", true)' \
  --normalizer 'NormalizerNFKC150("report_source_offset", true)'
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body

load --table Entries
[
{"body": "Groonga　高速！"}
]
select Entries \
  --output_columns \
  --match_columns body \
  --query '高' \
  --output_columns 'highlight_html(body, Terms)'
[
  [
    0,
    1715215640.979517,
    0.001608610153198242
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight_html",
          null
        ]
      ],
      [
        "Groonga　<span class=\"keyword\">高速</span>！"
      ]
    ]
  ]
]

If we used TokenNgram("loose_blank", true) and if highlight target characters included full width space as below.

We expected that Groonga returned "山田太郎". However, Groonga returned "山田太" as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("loose_blank", true, "report_source_location", true)' \
  --normalizer 'NormalizerNFKC150("report_source_offset", true)'
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body

load --table Entries
[
{"body": "山田 太郎"}
]

select Entries --output_columns \
  --match_columns body --query '山田太郎' \
  --output_columns 'highlight_html(body, Terms)' --output_pretty yes
[
  [
    0,
    1715220409.096246,
    0.0004854202270507812
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight_html",
          null
        ]
      ],
      [
        "<span class=\"keyword\">山田 太</span>"
      ]
    ]
  ]
]

If white space existed in the front of highlight target characters as below.

We expected that Groonga returned " 山田太郎". However, Groonga returned " 山" as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("report_source_location", true)' \
  --normalizer 'NormalizerNFKC150("report_source_offset", true)'
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body

load --table Entries
[
{"body": " 山田太郎"}
]

select Entries \
  --output_columns \
  --match_columns body \
  --query '山' \
  --output_columns 'highlight_html(body, Terms)' --output_pretty yes
[
  [
    0,
    1715221627.002193,
    0.001977920532226562
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight_html",
          null
        ]
      ],
      [
        " <span class=\"keyword\">山</span>"
      ]
    ]
  ]
]

If the second character of highlight target was full width space as below.

We expected that Groonga returned "山　田太郎". However, Groonga returned "山　田太郎" as below.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer 'TokenNgram("report_source_location", true)' \
  --normalizer 'NormalizerNFKC150("report_source_offset", true)'
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body

load --table Entries
[
{"body": "山　田太郎"}
]

select Entries \
  --output_columns \
  --match_columns body \
  --query '山　田' \
  --output_columns 'highlight_html(body, Terms)'
[
  [
    0,
    1715222501.496007,
    0.0005536079406738281
  ],
  [
    [
      [
        0
      ],
      [
        [
          "highlight_html",
          "<span class=\"keyword\">山　田太</span>郎"
        ]
      ]
    ]
  ]
]

Release 14.0.2 - 2024-03-29#

Improvements#

Reduced a log level of a log when Groonga setting normalizers/tokenizer/token_filters against temporary table.

For example, the target log of this modification is the following log.
```
DDL:1234567890:set_normalizers NormalizerAuto
```
PGroonga sets normalizers against temporary table on start. So, this log becomes noise. Because this log become output when PGroonga start because of PGroonga’s default log level is notice.

Therefore, we reduce log level to debug for the log since this release. Thus, this log does not output when PGroonga start in default.

Release 14.0.1 - 2024-03-14#

Improvements#

[load] Stopped reporting an error when we load key that becomes an empty key by normalization.

"-" becomes "" with NormalizerNFKC150("remove_symbol", true). So the following case reports a “empty key” error.
```
table_create Values TABLE_HASH_KEY ShortText \
  --normalizers 'NormalizerNFKC150("remove_symbol", true)'
table_create Data TABLE_NO_KEY
column_create Data value COLUMN_SCALAR Values
load --table Data
[
{"value": "-"}
]
```
However, if we many load in such data, many error log are generated. Because Groonga output many “empty key” error because of Groonga can’t register empty string to index.

No problem even if empty string can’t register to index in such case. Because we don’t match anything even if we search by empty string. So, we stop reporting an “empty key” error in such case.

Fixes#

Fixed a crash bug if a request is canceled between or range search.

This bug doesn’t necessarily occur. This bug occur when we cancel a request in the specific timing. This bug occur easily when search time is long such as sequential search.

Fixed a bug that highlight_html may return invalid result when the following conditions are met.

We use multiple normalizers such as NormalizerTable and NormalizerNFKC150.
We highlight string include whitespace.

For example, this bug occur such as the following case.

table_create NormalizationsIndex TABLE_PAT_KEY ShortText --normalizer NormalizerAuto

table_create Normalizations TABLE_HASH_KEY UInt64
column_create Normalizations normalized COLUMN_SCALAR LongText
column_create Normalizations target COLUMN_SCALAR NormalizationsIndex

column_create NormalizationsIndex index COLUMN_INDEX Normalizations target


table_create Lexicon TABLE_PAT_KEY ShortText \
  --normalizers 'NormalizerTable("normalized", \
                                 "Normalizations.normalized", \
                                 "target", \
                                 "target"), NormalizerNFKC150'

table_create Names TABLE_HASH_KEY UInt64
column_create Names name COLUMN_SCALAR Lexicon

load --table Names
[
["_key","name"],
[1,"Sato Toshio"]
]

select Names \
  --query '_key:1 OR name._key:@"Toshio"' \
  --output_columns 'highlight_html(name._key, Lexicon)

[
  [
    0,
    1710401574.332274,
    0.001911401748657227
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight_html",
          null
        ]
      ],
      [
        "sato <span class=\"keyword\">toshi</span>o"
      ]
    ]
  ]
]

[Ubuntu] We become able to provide package for Ubuntu again.

We don’t provide packages for Ubuntu in Groonga version 14.0.0. Because we fail making Groonga package for Ubuntu by problrm of build environment for Ubuntu package.

We fixed problrm of build environment for Ubuntu package in 14.0.1. So, we can provide packages for Ubuntu again since this release.
Fixed build error when we build from source by using clang. [GitHub#1738][Reported by windymelt]

Thanks#

windymelt

Release 14.0.0 - 2024-02-29#

This is a major version up! But It keeps backward compatibility. We can upgrade to 14.0.0 without rebuilding database.

Improvements#

Added a new tokenizer TokenH3Index (experimental).

TokenH3Indextokenizes WGS84GetPoint to UInt64(H3 index).
Added support for offline and online index construction with non text based tokenizer (experimental).

TokenH3Index is one of non text based tokenizers.
[select] Added support for searching by index with non text based tokenizer (experimental).

TokenH3Index is one of non text based tokenizers.
Added new functions distance_cosine(), distance_inner_product(), distance_l2_norm_squared(), distance_l1_norm().

We can only get records that a small distance as vector with these functions and limit N

These functions calculate distance in the output stage.

However, we don’t optimize these functions yet.
- distance_cosine(): Calculate cosine similarity.
- distance_inner_product(): Calculate inner product.
- distance_l2_norm_squared(): Calculate euclidean distance.
- distance_l1_norm(): Calculate manhattan distance.
Added a new function number_round().
[load] Added support for parallel load.

This feature only enable when the input_type of load is apache-arrow.

This feature one thread per column. If there are many target columns, it will reduce load time.
[select] We can use uvector as much as possible for array literal in --filter.

uvector is vector of elements with fix size.

If all elements have the same type, we use uvector instead vector.
[status] Added n_workers to output of status.
Optimized a dynamic column creation.
[WAL] Added support for rebuilding broken indexes in parallel.
[select] Added support for Int64 in output_type=apache-arrow for columns that reference other table.

Fixes#

[Windows] Fixed path for documents of groonga-normalizer-mysql in package for Windows.

Documents of groonga-normalizer-mysql put under the share/ in this release.
[select] Fixed a bug that Groonga may crash when we use bitwise operations.

News - 14 series#

Release 14.1.3 - 2025-01-29#

Improvements#

Offline index construction: Added support for parallel execution#

language_model_vectorize: Added support for Phi-4#

Added NormalizerNFKC#

Added TokenFilterNFKC#

Amazon Linux: Added support for Amazon Linux 2023 aarch64#

index_column_diff: Added support for changing log level for progress logs#

Fixes#

Fixed a bug that deb/rpm packages can’t be used on machines that don’t have AVX#

TABLE_PAT_KEY: Fixed a bug that existing key may be broken after defrag#

TABLE_PAT_KEY: Fixed a bug that some WALs are missing for defrag#

defrag: Fixed a bug that columns of TABLE_PAT_KEY aren’t processed#

Index column: Fixed an Offline index construction bug#

TABLE_HASH_KEY: Fixed a bug that garbage entry may not be reused#

Thanks#

Release 14.1.2 - 2024-12-24#

Improvements#

Relicensed to LGPL-2.1-or-later from LGPL-2.1-only#

delete: Added support for reference weight vector#

Fixes#

Vector column: Fixed a bug that needed cast may not be done#

Release 14.1.1 - 2024-12-03#

Improvements#

Stopped vendoring RapidJSON#

Made llama.cpp initialization lazy#

NormalizerNFKC: Added support for mixing unify_middle_dot and remove_symbol#

defrag: Added support for TABLE_PAT_KEY#

load: Improved error handling#

groonga executable file: Added context_id log flag#

groonga executable file: Improved error handling#

TABLE_PAT_KEY: Improved key space efficiency#

Fixes#

grn_pat_size(): Fixed a bug that 0 may not be returned on error#

Regular expression: Fixed a bug with an empty pattern#

Release 14.1.0 - 2024-11-05#

Improvements#

[status] Added .["features"]["llama.cpp"] entry#

[language_model_vectorize] Added (Experimental)#

[column_create] Added the generator parameter#

Fixes#

[GQTP] Fixed a bug that wrong return code is returned#

[logical_range_filter] Fixed a bug that invalid JSON format response may be returned#

[request_cancel] Fixed a bug that invalid JSON format response may be returned#

[index_column_diff] Fixed a bug that false positive diff may be reported#

Release 14.0.9 - 2024-09-27#

Fixes#

Release 14.0.8 - 2024-09-25#

Improvements#

Fixes#

Thanks#

Release 14.0.7 - 2024-09-03#

Improvements#

Fixes#

Thanks#

Release 14.0.6 - 2024-07-29#

Improvements#

Release 14.0.5 - 2024-07-04#

Improvements#

Fixes#

Release 14.0.4 - 2024-05-29#

Fixes#

Release 14.0.3 - 2024-05-09#

Improvements#

Fixes#

Release 14.0.2 - 2024-03-29#

Improvements#

Release 14.0.1 - 2024-03-14#

Improvements#

Fixes#

Thanks#

Release 14.0.0 - 2024-02-29#

Improvements#

Fixes#

Added NormalizerNFKC #

Added TokenFilterNFKC #

TABLE_PAT_KEY: Fixed a bug that existing key may be broken after defrag #

TABLE_PAT_KEY: Fixed a bug that some WALs are missing for defrag #

`NormalizerNFKC`: Added support for mixing `unify_middle_dot` and `remove_symbol`#

defrag: Added support for TABLE_PAT_KEY #

groonga executable file: Added `context_id` log flag#

`grn_pat_size()`: Fixed a bug that `0` may not be returned on error#

[status] Added `.["features"]["llama.cpp"]` entry#