7.15.15. language_model_knn#
Added in version 15.1.8.
Note
This is an experimental feature. Currently, this feature is still not stable.
7.15.15.1. Summary#
language_model_knn is a function for semantic search.
Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.
You must use it with TokenLanguageModelKNN.
It can be used as a condition for --filter and as a sort key for --sort_keys.
To enable this function, register language_model/knn plugin by the following command:
plugin_register language_model/knn
7.15.15.2. Syntax#
language_model_knn requires two parameters:
language_model_knn(column, query)
column is the search target column. It must be a column with an index.
query is a search query.
7.15.15.3. Requirements#
You need Faiss enabled Groonga. The official packages enable it.
7.15.15.4. Usage#
You need to register language_model/knn plugin at first:
Execution example:
plugin_register language_model/knn
# [[0,1337566253.89858,0.000355720520019531],true]
Here is a schema definition and sample data.
Sample schema:
Execution example:
table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
--table Memos \
--name content \
--flags COLUMN_SCALAR \
--type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]
Sample data:
Execution example:
load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
# [[0,1337566253.89858,0.000355720520019531],3]
You need to store embedding information for each record. Here is how to create that column.
Execution example:
column_create Memos embedding_code COLUMN_SCALAR ShortBinary
# [[0,1337566253.89858,0.000355720520019531],true]
Create an index for semantic search.
Specify TokenLanguageModelKNN as the tokenizer.
The tokenizer’s arguments are model and code_column.
Specify the model to use for model, and specify the column to store the generated embedding information for code_column.
Execution example:
table_create Centroids TABLE_HASH_KEY ShortBinary \
--default_tokenizer \
'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
"code_column", "embedding_code")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Centroids data_content COLUMN_INDEX Memos content
# [[0,1337566253.89858,0.000355720520019531],true]
This enables semantic search.
When you load data into Memos.content, Groonga automatically generates embeddings.
Users do not need to generate embeddings.
Here is an example of semantic search:
Execution example:
select Memos \
--filter 'language_model_knn(content, "male child")' \
--output_columns content
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 3
# ],
# [
# [
# "content",
# "ShortText"
# ]
# ],
# [
# "I am a boy."
# ],
# [
# "This is an apple."
# ],
# [
# "Groonga is a full text search engine."
# ]
# ]
# ]
# ]
language_model_knn function can also be used as a sort key.
Specify language_model_knn for --sort_keys.
Since you likely need to fetch results in descending order of similarity, you add a - prefix to fetch them in descending order.
Here is an example of filtering by _id and then sorting by similarity:
Execution example:
select Memos \
--filter '_id < 3' \
--sort_keys '-language_model_knn(content, "male child")' \
--output_columns content
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 2
# ],
# [
# [
# "content",
# "ShortText"
# ]
# ],
# [
# "I am a boy."
# ],
# [
# "This is an apple."
# ]
# ]
# ]
# ]
7.15.15.5. Parameters#
There are two required parameters.
7.15.15.5.1. column#
column is the search target column. It must be a column with an index.
7.15.15.5.2. query#
query is a search query.
7.15.15.5.3. k#
Added in version 15.2.1.
Specify the number of records to return.
If you don’t set this option, the upper limit of the number of records to return is 10.
You can specify a negative value. It means that the number of matched records + k + 1.
For example, "{ "k" : -1 }" outputs all records. It’s a very useful value to show all records.
Here is a simple negative k value usage example.
Sample data:
Execution example:
load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."},
{"content": "This is an orange."},
{"content": "This is a banana."},
{"content": "This is a tomato."},
{"content": "This is a carrot."},
{"content": "This is a cucumber."},
{"content": "This is a pepper."},
{"content": "This is a potato."},
{"content": "This is an onion."}
]
# [[0,1337566253.89858,0.000355720520019531],11]
Execution example:
select Memos \
--filter 'language_model_knn(text, "male child", { "k" : -1 })' \
--output_columns content
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 11
# ],
# [
# [
# "content",
# "ShortText"
# ]
# ],
# [
# "I am a boy."
# ],
# [
# "This is a pepper."
# ],
# [
# "This is a cucumber."
# ],
# [
# "This is a carrot."
# ],
# [
# "This is a potato."
# ],
# [
# "This is an apple."
# ],
# [
# "This is a banana."
# ],
# [
# "This is a tomato."
# ],
# [
# "This is an orange."
# ],
# [
# "This is an onion."
# ]
# ]
# ]
# ]
7.15.15.6. Return value#
This function works as a selector. It means that this function executes effectively.