7.8.12. TokenLanguageModelKNN#

Added in version 15.1.8.

Note

This is an experimental feature. Currently, this feature is still not stable.

7.8.12.1. Summary#

TokenLanguageModelKNN is a tokenizer that supports semantic search.

Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.

To enable this tokenizer, register language_model/knn plugin by the following command:

plugin_register language_model/knn

7.8.12.2. Syntax#

TokenLanguageModelKNN requires two parameters:

TokenLanguageModelKNN("model", "hf:///path/to", "code_column", "column_name")

TokenLanguageModelKNN has optional parameter:

TokenLanguageModelKNN("model", "hf:///path/to", \
                      "code_column", "column_name", \
                      "n_clusters", N_CLUSTERS)

TokenLanguageModelKNN("model", "hf:///path/to", \
                      "code_column", "column_name", \
                      "passage_prefix", "passage: ", \
                      "query_prefix", "query: ")

TokenLanguageModelKNN("model", "hf:///path/to", \
                      "code_column", "column_name", \
                      "centroid_column", "centroid_column_name")

Added in version 15.1.9: passage_prefix and query_prefix are added.

Added in version 15.2.1: centroid_column is added.

7.8.12.3. Usage#

Note

This tokenizer can’t be run with tokenize command.

This usage example shows how to set TokenLanguageModelKNN as default_tokenizer.

You need to register language_model/knn plugin at first:

Execution example:

plugin_register language_model/knn
# [[0,1337566253.89858,0.000355720520019531],true]

Here is a schema definition and sample data.

Sample schema:

Execution example:

table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]

Sample data:

Execution example:

load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
# [[0,1337566253.89858,0.000355720520019531],3]

You need to store embedding information for each record. Here is how to create that column.

Execution example:

column_create Memos embedding_code COLUMN_SCALAR ShortBinary
# [[0,1337566253.89858,0.000355720520019531],true]

Create an index for semantic search.

Specify TokenLanguageModelKNN as the tokenizer.

Execution example:

table_create Centroids TABLE_HASH_KEY ShortBinary \
  --default_tokenizer \
    'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
                           "code_column", "embedding_code")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Centroids data_content COLUMN_INDEX Memos content
# [[0,1337566253.89858,0.000355720520019531],true]

You can see that the embedding has been generated by fetching Memos table. The generated bytecode is saved.

Execution example:

select Memos
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ],
#         [
#           "embedding_code",
#           "ShortBinary"
#         ]
#       ],
#       [
#         1,
#         "I am a boy.",
#         "tW+IbgJ0dH9MM0nTOzt9ojKuwbfFnkUSCQHEh2X4l4ijXm03SpYrJHcLT+EbYFFUJPfyvim6bT8="
#       ],
#       [
#         2,
#         "This is an apple.",
#         "utwX20mQL4h7XU/4OiMXRJnxJgg6QLP1lT3WehOKMZVlgSzowGRgeqd0GN7Y5E4G6mPzvoXhZD8="
#       ],
#       [
#         3,
#         "Groonga is a full text search engine.",
#         "RLLyELYP2GC1grQNxMSK/+8OXE0W+3qJ+sY7wO4Hzmacv9NNv5vXjYjTsRBnH6b5Dpqrvu0hgT8="
#       ]
#     ]
#   ]
# ]

Users do not operate on this embedding_code. Groonga uses it internally for semantic search.

7.8.12.4. Parameters#

7.8.12.4.1. Required parameters#

7.8.12.4.1.1. model#

Specify the language model to use. You can specify a Hugging Face URI for model.

At the first index creation, it automatically downloads and places the model in the directory of Groonga’s database. After that, it uses the locally located model.

Example of URI: hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF for https://huggingface.co/groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF.

See also Language model for language model.

7.8.12.4.1.2. code_column#

Specify the column for storing the embedding.

Create a column in the table storing the searchable text and specify its column name.

7.8.12.4.2. Optional parameter#

7.8.12.4.2.1. passage_prefix#

Added in version 15.1.9.

Some models such as multilingual-e5 require prefix used for search-target texts and query texts.

passage_prefix specifies the prefix for search target text.

For example, you can set passage: prefix in search target text and query: prefix in query text as below.

TokenLanguageModelKNN("model", "hf:///groonga/multilingual-e5-base-Q4_K_M-GGUF", \
                      "code_column", "embedding_code", \
                      "passage_prefix", "passage: ", \
                      "query_prefix", "query: ")

7.8.12.4.2.2. query_prefix#

Added in version 15.1.9.

Some models such as multilingual-e5 require prefix used for search-target texts and query texts.

query_prefix specifies the prefix for query text.

7.8.12.4.2.3. centroid_column#

Added in version 15.2.1.

This option is for large embeddings with more than 1025 dimensions (more than 4100 bytes). Groonga table keys must be 4 KiB or smaller. Embeddings larger than 4 KiB cannot be stored as keys.

You can use this option to store large embeddings in a column instead of in the table key.

Execution example:

table_create LargeCentroids TABLE_HASH_KEY UInt32 \
  --default_tokenizer \
      'TokenLanguageModelKNN("model", "hf:///groonga/multilingual-e5-base-Q4_K_M-GGUF", \
                             "centroid_column", "centroid", \
                             "code_column", "embedding_code", \
                             "passage_prefix", "passage: ", \
                             "query_prefix", "query: ")'

column_create LargeCentroids centroid COLUMN_VECTOR Float32
column_create LargeCentroids data_content COLUMN_INDEX Memos content

Specify the column for storing embeddings and add it to the table for the index.

7.8.12.4.2.4. n_clusters#

Specify the number of clusters to use as indexes. If not specified, an appropriate value will be set automatically.

In most cases, you don’t need to set this option explicitly.

7.8.12.4.2.5. n_gpu_layers#

Added in version 15.2.1.

Specify the number of GPU layers to use for language model. If not specified, Groonga uses GPU as much as possible.

In most cases, you don’t need to set this option explicitly.

To disable GPU usage, set n_gpu_layers to 0.

7.8.12.5. See also#