7.15.15. language_model_knn#

Added in version 15.1.8.

Note

This is an experimental feature. Currently, this feature is still not stable.

7.15.15.1. Summary#

language_model_knn is a function for semantic search.

Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.

You must use it with TokenLanguageModelKNN.

It can be used as a condition for --filter and as a sort key for --sort_keys.

To enable this function, register language_model/knn plugin by the following command:

plugin_register language_model/knn

7.15.15.2. Syntax#

language_model_knn requires two parameters:

language_model_knn(column, query)

column is the search target column. It must be a column with an index.

query is a search query.

7.15.15.3. Requirements#

You need Faiss enabled Groonga. The official packages enable it.

7.15.15.4. Usage#

You need to register language_model/knn plugin at first:

Execution example:

plugin_register language_model/knn
# [[0,1337566253.89858,0.000355720520019531],true]

Here is a schema definition and sample data.

Sample schema:

Execution example:

table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]

Sample data:

Execution example:

load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
# [[0,1337566253.89858,0.000355720520019531],3]

You need to store embedding information for each record. Here is how to create that column.

Execution example:

column_create Memos embedding_code COLUMN_SCALAR ShortBinary
# [[0,1337566253.89858,0.000355720520019531],true]

Create an index for semantic search.

Specify TokenLanguageModelKNN as the tokenizer. The tokenizer’s arguments are model and code_column. Specify the model to use for model, and specify the column to store the generated embedding information for code_column.

Execution example:

table_create Centroids TABLE_HASH_KEY ShortBinary \
  --default_tokenizer \
    'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
                           "code_column", "embedding_code")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Centroids data_content COLUMN_INDEX Memos content
# [[0,1337566253.89858,0.000355720520019531],true]

This enables semantic search. When you load data into Memos.content, Groonga automatically generates embeddings. Users do not need to generate embeddings.

Here is an example of semantic search:

Execution example:

select Memos \
  --filter 'language_model_knn(content, "male child")' \
  --output_columns content
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         "I am a boy."
#       ],
#       [
#         "This is an apple."
#       ],
#       [
#         "Groonga is a full text search engine."
#       ]
#     ]
#   ]
# ]

language_model_knn function can also be used as a sort key. Specify language_model_knn for --sort_keys. Since you likely need to fetch results in descending order of similarity, you add a - prefix to fetch them in descending order.

Here is an example of filtering by _id and then sorting by similarity:

Execution example:

select Memos \
  --filter '_id < 3' \
  --sort_keys '-language_model_knn(content, "male child")' \
  --output_columns content
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         "I am a boy."
#       ],
#       [
#         "This is an apple."
#       ]
#     ]
#   ]
# ]

7.15.15.5. Parameters#

There are two required parameters.

7.15.15.5.1. column#

column is the search target column. It must be a column with an index.

7.15.15.5.2. query#

query is a search query.

7.15.15.6. Return value#

This function works as a selector. It means that this function executes effectively.

7.15.15.7. See also#