Groonga 11.0.5リリース

BloGroonga

2021-07-29

Groonga 11.0.5リリース

Groonga 11.0.5をリリースしました！

それぞれの環境毎のインストール方法: インストール

変更内容

主な変更点は以下の通りです。

改良

Normalizers ノーマライザーを複数指定できるようになりました。

テーブル作成時に --normalizers オプションを使うことで、複数のノーマライザーを指定できます。互換性のため、既存の --normalizer を使っても複数のノーマライザーを指定できます。

Groonga 11.0.4 でノーマライザーをカスタマイズするための NormalizerTable を追加しました。この NormalizerTable と既存のノーマライザーを組み合わせることでより柔軟な制御ができます。

例えば、この機能は、以下のようなケースで有用です。
- 電話番号を検索することを考えます。ただし、データはOCRで手書きのデータを取り込みます。 OCRは数字と文字列をご認識することがあります。(例えば、 5とS など)
具体的には以下の通りです。
```
  table_create Normalizations TABLE_PAT_KEY ShortText
  column_create Normalizations normalized COLUMN_SCALAR ShortText
  load --table Normalizations
  [
  {"_key": "s", "normalized": "5"}
  ]

  table_create Tels TABLE_NO_KEY
  column_create Tels tel COLUMN_SCALAR ShortText

  table_create TelsIndex TABLE_PAT_KEY ShortText \
    --normalizers 'NormalizerNFKC130("unify_hyphen_and_prolonged_sound_mark", true), \
                   NormalizerTable("column", "Normalizations.normalized")' \
    --default_tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)'
  column_create TelsIndex tel_index COLUMN_INDEX|WITH_SECTION Tels tel

  load --table Tels
  [
  {"tel": "03-4S-1234"}
  {"tel": "03-45-9876"}
  ]

  select --table Tels \
    --filter 'tel @ "03-45-1234"'
  [
    [
      0,
      1625227424.560146,
      0.0001730918884277344
    ],
    [
      [
        [
          1
        ],
        [
          [
            "_id",
            "UInt32"
          ],
          [
            "tel",
            "ShortText"
          ]
        ],
        [
          1,
          "03-4S-1234"
        ]
      ]
    ]
  ]
```
既存のノーマライザーでは、このようなケースには対応できませんでしたが、このリリースから既存のノーマライザーと NormalizerTable を組み合わせることで、このようなケースにも対応できます。
query_parallel_or、query シーケンシャルサーチのしきい値をカスタマイズできるようにしました。

以下のオプションを使うことで、シーケンシャルサーチを使うかどうかのしきい値をクエリーごとにカスタマイズできます。
- {"max_n_enough_filtered_records": xx}
  
  max_n_enough_filtered_records は、レコード数を指定します。 query または、 query_parallel_or で、この値以下までレコードが絞り込めそうな場合、シーケンシャルサーチを使います。
- {"enough_filtered_ratio": x.x}
  
  enough_filtered_ratio は、全体に占める割合を指定します。 query または、 query_parallel_or で、この割合以下までレコードが絞り込めそうな場合、シーケンシャルサーチを使います。例えば、　{"enough_filtered_ratio": 0.5} とした場合、 query または query_parallel_or で全体の半分まで絞り込めそうな場合はシーケンシャルサーチを使います。
具体的には以下の通りです。
```
  table_create Products TABLE_NO_KEY
  column_create Products name COLUMN_SCALAR ShortText

  table_create Terms TABLE_PAT_KEY ShortText --normalizer NormalizerAuto
  column_create Terms products_name COLUMN_INDEX Products name

  load --table Products
  [
  ["name"],
  ["Groonga"],
  ["Mroonga"],
  ["Rroonga"],
  ["PGroonga"],
  ["Ruby"],
  ["PostgreSQL"]
  ]

  select \
    --table Products \
    --filter 'query("name", "r name:Ruby", {"enough_filtered_ratio": 0.5})'
```
```
  table_create Products TABLE_NO_KEY
  column_create Products name COLUMN_SCALAR ShortText

  table_create Terms TABLE_PAT_KEY ShortText --normalizer NormalizerAuto
  column_create Terms products_name COLUMN_INDEX Products name

  load --table Products
  [
  ["name"],
  ["Groonga"],
  ["Mroonga"],
  ["Rroonga"],
  ["PGroonga"],
  ["Ruby"],
  ["PostgreSQL"]
  ]

  select \
    --table Products \
    --filter 'query("name", "r name:Ruby", {"max_n_enough_filtered_records": 10})'
```

between、in_values シーケンシャルサーチのしきい値をカスタマイズできるようにしました。

between と in_values は、検索対象のレコードが十分に絞り込まれている時に、シーケンシャルサーチに切り替える機能があります。

今までも、以下のように環境変数でこのしきい値はカスタマイズできました。

GRN_IN_VALUES_TOO_MANY_INDEX_MATCH_RATIO
GRN_BETWEEN_TOO_MANY_INDEX_MATCH_RATIO

in_values():

  # Don't use auto sequential search
  GRN_IN_VALUES_TOO_MANY_INDEX_MATCH_RATIO=-1
  # Set threshold to 0.02
  GRN_IN_VALUES_TOO_MANY_INDEX_MATCH_RATIO=0.02

between():

  # Don't use auto sequential search
  GRN_BETWEEN_TOO_MANY_INDEX_MATCH_RATIO=-1
  # Set threshold to 0.02
  GRN_BETWEEN_TOO_MANY_INDEX_MATCH_RATIO=0.02

環境変数によるカスタマイズは、すべてのクエリーに対して適用されますが、この機能は、クエリーごとにしきい値を指定できます。

具体的には以下の通りです。 {"too_many_index_match_ratio": x.xx} オプションでしきい値を指定できます。このオプションの値の型は double 型です。

  table_create Memos TABLE_HASH_KEY ShortText
  column_create Memos timestamp COLUMN_SCALAR Time

  table_create Times TABLE_PAT_KEY Time
  column_create Times memos_timestamp COLUMN_INDEX Memos timestamp

  load --table Memos
  [
  {"_key": "001", "timestamp": "2014-11-10 07:25:23"},
  {"_key": "002", "timestamp": "2014-11-10 07:25:24"},
  {"_key": "003", "timestamp": "2014-11-10 07:25:25"},
  {"_key": "004", "timestamp": "2014-11-10 07:25:26"},
  {"_key": "005", "timestamp": "2014-11-10 07:25:27"},
  {"_key": "006", "timestamp": "2014-11-10 07:25:28"},
  {"_key": "007", "timestamp": "2014-11-10 07:25:29"},
  {"_key": "008", "timestamp": "2014-11-10 07:25:30"},
  {"_key": "009", "timestamp": "2014-11-10 07:25:31"},
  {"_key": "010", "timestamp": "2014-11-10 07:25:32"},
  {"_key": "011", "timestamp": "2014-11-10 07:25:33"},
  {"_key": "012", "timestamp": "2014-11-10 07:25:34"},
  {"_key": "013", "timestamp": "2014-11-10 07:25:35"},
  {"_key": "014", "timestamp": "2014-11-10 07:25:36"},
  {"_key": "015", "timestamp": "2014-11-10 07:25:37"},
  {"_key": "016", "timestamp": "2014-11-10 07:25:38"},
  {"_key": "017", "timestamp": "2014-11-10 07:25:39"},
  {"_key": "018", "timestamp": "2014-11-10 07:25:40"},
  {"_key": "019", "timestamp": "2014-11-10 07:25:41"},
  {"_key": "020", "timestamp": "2014-11-10 07:25:42"},
  {"_key": "021", "timestamp": "2014-11-10 07:25:43"},
  {"_key": "022", "timestamp": "2014-11-10 07:25:44"},
  {"_key": "023", "timestamp": "2014-11-10 07:25:45"},
  {"_key": "024", "timestamp": "2014-11-10 07:25:46"},
  {"_key": "025", "timestamp": "2014-11-10 07:25:47"},
  {"_key": "026", "timestamp": "2014-11-10 07:25:48"},
  {"_key": "027", "timestamp": "2014-11-10 07:25:49"},
  {"_key": "028", "timestamp": "2014-11-10 07:25:50"},
  {"_key": "029", "timestamp": "2014-11-10 07:25:51"},
  {"_key": "030", "timestamp": "2014-11-10 07:25:52"},
  {"_key": "031", "timestamp": "2014-11-10 07:25:53"},
  {"_key": "032", "timestamp": "2014-11-10 07:25:54"},
  {"_key": "033", "timestamp": "2014-11-10 07:25:55"},
  {"_key": "034", "timestamp": "2014-11-10 07:25:56"},
  {"_key": "035", "timestamp": "2014-11-10 07:25:57"},
  {"_key": "036", "timestamp": "2014-11-10 07:25:58"},
  {"_key": "037", "timestamp": "2014-11-10 07:25:59"},
  {"_key": "038", "timestamp": "2014-11-10 07:26:00"},
  {"_key": "039", "timestamp": "2014-11-10 07:26:01"},
  {"_key": "040", "timestamp": "2014-11-10 07:26:02"},
  {"_key": "041", "timestamp": "2014-11-10 07:26:03"},
  {"_key": "042", "timestamp": "2014-11-10 07:26:04"},
  {"_key": "043", "timestamp": "2014-11-10 07:26:05"},
  {"_key": "044", "timestamp": "2014-11-10 07:26:06"},
  {"_key": "045", "timestamp": "2014-11-10 07:26:07"},
  {"_key": "046", "timestamp": "2014-11-10 07:26:08"},
  {"_key": "047", "timestamp": "2014-11-10 07:26:09"},
  {"_key": "048", "timestamp": "2014-11-10 07:26:10"},
  {"_key": "049", "timestamp": "2014-11-10 07:26:11"},
  {"_key": "050", "timestamp": "2014-11-10 07:26:12"}
  ]

  select Memos \
    --filter '_key == "003" && \
              between(timestamp, \
                      "2014-11-10 07:25:24", \
                      "include", \
                      "2014-11-10 07:27:26", \
                      "exclude", \
                      {"too_many_index_match_ratio": 0.03})'

  table_create Tags TABLE_HASH_KEY ShortText

  table_create Memos TABLE_HASH_KEY ShortText
  column_create Memos tag COLUMN_SCALAR Tags

  load --table Memos
  [
  {"_key": "Rroonga is fast!", "tag": "Rroonga"},
  {"_key": "Groonga is fast!", "tag": "Groonga"},
  {"_key": "Mroonga is fast!", "tag": "Mroonga"},
  {"_key": "Groonga sticker!", "tag": "Groonga"},
  {"_key": "Groonga is good!", "tag": "Groonga"}
  ]

  column_create Tags memos_tag COLUMN_INDEX Memos tag

  select \
    Memos \
    --filter '_id >= 3 && \
              in_values(tag, \
                       "Groonga", \
                       {"too_many_index_match_ratio": 0.7})' \
    --output_columns _id,_score,_key,tag

between GRN_EXPR_OPTIMIZE=yes をサポートしました。

between() で、条件式の評価順序の最適化をサポートしました。

query_parallel_or、query match_columns を vector で指定できるようにしました。

以下のように、 query と query_parallel_or の match_columns に vector を使えます。

table_create Users TABLE_NO_KEY
column_create Users name COLUMN_SCALAR ShortText
column_create Users memo COLUMN_SCALAR ShortText
column_create Users tag COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenNgram \
  --normalizer NormalizerNFKC130
column_create Terms name COLUMN_INDEX|WITH_POSITION Users name
column_create Terms memo COLUMN_INDEX|WITH_POSITION Users memo
column_create Terms tag COLUMN_INDEX|WITH_POSITION Users tag

load --table Users
[
{"name": "Alice", "memo": "Groonga user", "tag": "Groonga"},
{"name": "Bob",   "memo": "Rroonga user", "tag": "Rroonga"}
]

select Users \
  --output_columns _score,name \
  --filter 'query(["name * 100", "memo", "tag * 10"], \
                  "Alice OR Groonga")'

select 前方一致検索でセクションと重みをサポートしました。

前方一致検索でマルチカラムインデックスと、スコアーの調整ができます。

  table_create Memos TABLE_NO_KEY
  column_create Memos title COLUMN_SCALAR ShortText
  column_create Memos tags COLUMN_VECTOR ShortText

  table_create Terms TABLE_PAT_KEY ShortText
  column_create Terms index COLUMN_INDEX|WITH_SECTION Memos title,tags

  load --table Memos
  [
  {"title": "Groonga", "tags": ["Groonga"]},
  {"title": "Rroonga", "tags": ["Groonga", "Rroonga", "Ruby"]},
  {"title": "Mroonga", "tags": ["Groonga", "Mroonga", "MySQL"]}
  ]

  select Memos \
    --match_columns "Terms.index.title * 2" \
    --query 'G*' \
    --output_columns title,tags,_score
  [
    [
      0,
      0.0,
      0.0
    ],
    [
      [
        [
          1
        ],
        [
          [
            "title",
            "ShortText"
          ],
          [
            "tags",
            "ShortText"
          ],
          [
            "_score",
            "Int32"
          ]
        ],
        [
          "Groonga",
          [
            "Groonga"
          ],
          2
        ]
      ]
    ]
  ]

grndb grndb recover で使用したオブジェクトを即時閉じるようにしました。

これにより、メモリ消費を抑制できます。おそらくパフォーマンスは低下していますが、許容可能な低下です。 grndb check はまだ、使用したオブジェクトを即時閉じないので注意してください。

query_parallel_or、query 以下のように、 match_columns に scorer_tf_idf を指定できるようにしました。

  table_create Tags TABLE_HASH_KEY ShortText

  table_create Users TABLE_HASH_KEY ShortText
  column_create Users tags COLUMN_VECTOR Tags

  load --table Users
  [
  {"_key": "Alice",
   "tags": ["beginner", "active"]},
  {"_key": "Bob",
   "tags": ["expert", "passive"]},
  {"_key": "Chris",
   "tags": ["beginner", "passive"]}
  ]

  column_create Tags users COLUMN_INDEX Users tags

  select Users \
    --output_columns _key,_score \
    --sort_keys _id \
    --command_version 3 \
    --filter 'query_parallel_or("scorer_tf_idf(tags)", \
                                "beginner active")'
  {
    "header": {
      "return_code": 0,
      "start_time": 0.0,
      "elapsed_time": 0.0
    },
    "body": {
      "n_hits": 1,
      "columns": [
        {
          "name": "_key",
          "type": "ShortText"
        },
        {
          "name": "_score",
          "type": "Float"
        }
      ],
      "records": [
        [
          "Alice",
          2.098612308502197
        ]
      ]
    }
  }

query_expand 展開後の語の重みを操作できるようにしました。

展開後の語に対して、重みを指定できます。スコアーを増やしたい場合は、 > を使います。スコアーを減らしたい場合は、 < を指定します。数字でスコアーの量を指定できます。負の数も指定できます。

table_create TermExpansions TABLE_NO_KEY
column_create TermExpansions term COLUMN_SCALAR ShortText
column_create TermExpansions expansions COLUMN_VECTOR ShortText

load --table TermExpansions
[
{"term": "Rroonga", "expansions": ["Rroonga", "Ruby Groonga"]}
]

query_expand TermExpansions "Groonga <-0.2Rroonga Mroonga" \
  --term_column term \
  --expanded_term_column expansions
[[0,0.0,0.0],"Groonga <-0.2((Rroonga) OR (Ruby Groonga)) Mroonga"]

[httpd] バンドルしているnginxのバージョンを1.21.1に更新しました。
バンドルしているApache Arrowを5.0.0に更新しました。
Ubuntu Ubuntu 20.10 (Groovy Gorilla)サポートをやめました。
- 2021年7月22日でEOLとなったためです。

修正

query_parallel_or、query query_options とその他のオプションを指定すると、その他のオプションが無視される問題を修正しました。

例えば、以下のケースでは、 "default_operator": "OR" オプションは無視されていました。

plugin_register token_filters/stop_word

table_create Memos TABLE_NO_KEY
column_create Memos content COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStopWord
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
column_create Terms is_stop_word COLUMN_SCALAR Bool
  
load --table Terms
[
{"_key": "and", "is_stop_word": true}
]
  
load --table Memos
[
{"content": "Hello"},
{"content": "Hello and Good-bye"},
{"content": "and"},
{"content": "Good-bye"}
]
  
select Memos \
  --filter 'query_parallel_or( \
              "content", \
              "Hello and", \
              {"default_operator": "OR", \
               "options": {"TokenFilterStopWord.enable": false}})' \
  --match_escalation_threshold -1 \
  --sort_keys -_score
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "content",
          "ShortText"
        ]
      ],
      [
        2,
        "Hello and Good-bye"
      ]
    ]
  ]
]

既知の問題

現在Groongaには、ベクターカラムに対してデータを大量に追加、削除、更新した際にデータが破損することがある問題があります。
[ブラウザーベースの管理ツール] 現在Groongaには、レコード一覧の管理モードのチェックボックスにチェックを入れても、非管理モードに入力された検索クエリーが送信されるという問題があります。
*< と *> は、filter条件の右辺に query() を使う時のみ有効です。もし、以下のように指定した場合、 *< と *> は && として機能します。
- 'content @ "Groonga" *< content @ "Mroonga"'
多くのデータを削除し、同じデータを再度loadすることを繰り返すと、Groongaがヒットすべきレコードを返さないことがあります。

さいごに

詳細については、以下のお知らせも参照してください。

お知らせ 11.0.5リリース

それでは、Groongaでガンガン検索してください！

BloGroonga