7.15.15. language_model_knn#
Added in version 15.1.8.
Note
This is an experimental feature. Currently, this feature is still not stable.
7.15.15.1. Summary#
language_model_knn is a function for semantic search.
Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.
You must use it with TokenLanguageModelKNN.
It can be used as a condition for --filter and as a sort key for --sort_keys.
To enable this function, register language_model/knn plugin by the following command:
plugin_register language_model/knn
7.15.15.2. Syntax#
language_model_knn requires two parameters:
language_model_knn(column, query)
column is the search target column. It must be a column with an index.
query is a search query.
7.15.15.3. Requirements#
You need Faiss enabled Groonga. The official packages enable it.
7.15.15.4. Usage#
You need to register language_model/knn plugin at first:
Execution example:
plugin_register language_model/knn
# [[0,1337566253.89858,0.000355720520019531],true]
Here is a schema definition and sample data.
Sample schema:
Execution example:
table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]
Sample data:
Execution example:
load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
# [[0,1337566253.89858,0.000355720520019531],3]
You need to store embedding information for each record. Here is how to create that column.
Execution example:
column_create Memos embedding_code COLUMN_SCALAR ShortBinary
# [[0,1337566253.89858,0.000355720520019531],true]
Create an index for semantic search.
Specify TokenLanguageModelKNN as the tokenizer.
The tokenizer’s arguments are model and code_column.
Specify the model to use for model, and specify the column to store the generated embedding information for code_column.
Execution example:
table_create Centroids TABLE_HASH_KEY ShortBinary \
  --default_tokenizer \
    'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
                           "code_column", "embedding_code")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Centroids data_content COLUMN_INDEX Memos content
# [[0,1337566253.89858,0.000355720520019531],true]
This enables semantic search.
When you load data into Memos.content, Groonga automatically generates embeddings.
Users do not need to generate embeddings.
Here is an example of semantic search:
Execution example:
select Memos \
  --filter 'language_model_knn(content, "male child")' \
  --output_columns content
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         "I am a boy."
#       ],
#       [
#         "This is an apple."
#       ],
#       [
#         "Groonga is a full text search engine."
#       ]
#     ]
#   ]
# ]
language_model_knn function can also be used as a sort key.
Specify language_model_knn for --sort_keys.
Since you likely need to fetch results in descending order of similarity, you add a - prefix to fetch them in descending order.
Here is an example of filtering by _id and then sorting by similarity:
Execution example:
select Memos \
  --filter '_id < 3' \
  --sort_keys '-language_model_knn(content, "male child")' \
  --output_columns content
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         "I am a boy."
#       ],
#       [
#         "This is an apple."
#       ]
#     ]
#   ]
# ]
7.15.15.5. Parameters#
There are two required parameters.
7.15.15.5.1. column#
column is the search target column. It must be a column with an index.
7.15.15.5.2. query#
query is a search query.
7.15.15.6. Return value#
This function works as a selector. It means that this function executes effectively.