7.15.15. language_model_vectorize
#
Added in version 14.1.0.
Note
This is an experimental feature. Currently, this feature is still not stable.
7.15.15.1. Summary#
language_model_vectorize
generates a normalized embedding from the
given text.
See also Language model how to prepare a language model.
You can use Generated column to automate embeddings generation.
To enable this function, register functions/language_model
plugin by
the following command:
plugin_register functions/vector
7.15.15.2. Syntax#
language_model_vectorize
requires two parameters:
language_model_vectorize(model_name, text)
mode_name
is the name of language mode to be used. It’s associated
with file name. If
${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf
exists, you can refer it by mistral-7b-v0.1.Q4_K_M
. It’s computed by
removing directory and .gguf
extension.
text
is the input text.
7.15.15.3. Requirements#
You need llama.cpp enabled Groonga. The official packages enable it.
You need enough CPU/memory resources to use this feature. Language model related features require more resources than other features.
You can use GPU in the feature.
7.15.15.4. Usage#
You need to register functions/language_model
plugin at first:
Execution example:
plugin_register functions/language_model
# [[0,1337566253.89858,0.000355720520019531],true]
Here is a schema definition and sample data.
Sample schema:
Execution example:
table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
--table Memos \
--name content \
--flags COLUMN_SCALAR \
--type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]
Sample data:
Execution example:
load --table Memos
[
{"content": "Groonga is fast and embeddable full text search engine."},
{"content": "PGroonga is a PostgreSQL extension that uses Groonga."},
{"content": "PostgreSQL is a RDBMS."}
]
# [[0,1337566253.89858,0.000355720520019531],3]
Here is a schema that creates a Generated column that
generates embeddings of Memos.content
automatically:
Execution example:
column_create \
--table Memos \
--name content_embedding \
--flags COLUMN_VECTOR \
--type Float32 \
--source content \
--generator 'language_model_vectorize("mistral-7b-v0.1.Q4_K_M", content)'
# [[0,1337566253.89858,0.000355720520019531],true]
You can re-rank matched records by using distance_inner_product()
not distance_cosine()
because language_model_vectorize()
returns a
normalized embedding. The following example uses all records instead
of filtered records to show this usage simply:
Execution example:
select \
--table Memos \
--columns[similarity].stage filtered \
--columns[similarity].flags COLUMN_SCALAR \
--columns[similarity].types Float32 \
--columns[similarity].value 'distance_inner_product(content_embedding, language_model_vectorize("mistral-7b-v0.1.Q4_K_M", "high performance FTS"))' \
--output_columns content,similarity \
--sort_keys -similarity
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 3
# ],
# [
# [
# "content",
# "ShortText"
# ],
# [
# "similarity",
# "Text"
# ]
# ],
# [
# "Groonga is fast and embeddable full text search engine.",
# "0.6581704020500183"
# ],
# [
# "PGroonga is a PostgreSQL extension that uses Groonga.",
# "0.6540993452072144"
# ],
# [
# "PostgreSQL is a RDBMS.",
# "0.6449499130249023"
# ]
# ]
# ]
# ]
7.15.15.5. Parameters#
There are two required parameters.
7.15.15.5.1. model_name
#
mode_name
is the name of language mode to be used. It’s associated
with file name. If
${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf
exists, you can refer it by mistral-7b-v0.1.Q4_K_M
. It’s computed by
removing directory and .gguf
extension.
7.15.15.5.2. text
#
text
is the input text.
7.15.15.6. Return value#
language_model_vectorize
returns Float32
vector which as a
normalized embedding.