7.8.12. TokenLanguageModelKNN#

Added in version 15.1.8.

Note

This is an experimental feature. Currently, this feature is still not stable.

7.8.12.1. Summary#

TokenLanguageModelKNN is a tokenizer that supports semantic search.

Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.

To enable this tokenizer, register language_model/knn plugin by the following command:

plugin_register language_model/knn

7.8.12.2. Syntax#

TokenLanguageModelKNN requires two parameters:

TokenLanguageModelKNN("model", "hf:///path/to", "code_column", "column_name")

TokenLanguageModelKNN has one optional parameter.

TokenLanguageModelKNN("model", "hf:///path/to", "code_column", "column_name", "n_clusters", N_CLUSTERS)

7.8.12.3. Usage#

Note

This tokenizer can’t be run with tokenize command.

This usage example shows how to set TokenLanguageModelKNN as default_tokenizer.

You need to register language_model/knn plugin at first:

Execution example:

plugin_register language_model/knn
# [[0,1337566253.89858,0.000355720520019531],true]

Here is a schema definition and sample data.

Sample schema:

Execution example:

table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]

Sample data:

Execution example:

load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
# [[0,1337566253.89858,0.000355720520019531],3]

You need to store embedding information for each record. Here is how to create that column.

Execution example:

column_create Memos embedding_code COLUMN_SCALAR ShortBinary
# [[0,1337566253.89858,0.000355720520019531],true]

Create an index for semantic search.

Specify TokenLanguageModelKNN as the tokenizer.

Execution example:

table_create Centroids TABLE_HASH_KEY ShortBinary \
  --default_tokenizer \
    'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
                           "code_column", "embedding_code")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Centroids data_content COLUMN_INDEX Memos content
# [[0,1337566253.89858,0.000355720520019531],true]

You can see that the embedding has been generated by fetching Memos table. The generated bytecode is saved.

Execution example:

select Memos
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ],
#         [
#           "embedding_code",
#           "ShortBinary"
#         ]
#       ],
#       [
#         1,
#         "I am a boy.",
#         "tW+IbgJ0dH9MM0nTOzt9ojKuwbfFnkUSCQHEh2X4l4ijXm03SpYrJHcLT+EbYFFUJPfyvim6bT8="
#       ],
#       [
#         2,
#         "This is an apple.",
#         "utwX20mQL4h7XU/4OiMXRJnxJgg6QLP1lT3WehOKMZVlgSzowGRgeqd0GN7Y5E4G6mPzvoXhZD8="
#       ],
#       [
#         3,
#         "Groonga is a full text search engine.",
#         "RLLyELYP2GC1grQNxMSK/+8OXE0W+3qJ+sY7wO4Hzmacv9NNv5vXjYjTsRBnH6b5Dpqrvu0hgT8="
#       ]
#     ]
#   ]
# ]

Users do not operate on this embedding_code. Groonga uses it internally for semantic search.

7.8.12.4. Parameters#

7.8.12.4.1. Required parameters#

7.8.12.4.1.1. model#

Specify the language model to use. You can specify a Hugging Face URI for model.

At the first index creation, it automatically downloads and places the model in the directory of Groonga’s database. After that, it uses the locally located model.

Example of URI: hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF for https://huggingface.co/groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF.

See also Language model for language model.

7.8.12.4.1.2. code_column#

Specify the column for storing the embedding.

Create a column in the table storing the searchable text and specify its column name.

7.8.12.4.2. Optional parameter#

7.8.12.4.2.1. n_clusters#

Specify the number of clusters to use as indexes. If not specified, an appropriate value will be set automatically.

Usually, no specification is necessary.

7.8.12.5. See also#