7.15.15. language_model_vectorize#

Added in version 14.1.0.

Note

This is an experimental feature. Currently, this feature is still not stable.

7.15.15.1. Summary#

language_model_vectorize generates a normalized embedding from the given text.

See also Language model how to prepare a language model.

You can use Generated column to automate embeddings generation.

To enable this function, register functions/language_model plugin by the following command:

plugin_register functions/vector

7.15.15.2. Syntax#

language_model_vectorize requires two parameters:

language_model_vectorize(model_name, text)

mode_name is the name of language mode to be used. It’s associated with file name. If ${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf exists, you can refer it by mistral-7b-v0.1.Q4_K_M. It’s computed by removing directory and .gguf extension.

text is the input text.

7.15.15.3. Requirements#

You need llama.cpp enabled Groonga. The official packages enable it.

You need enough CPU/memory resources to use this feature. Language model related features require more resources than other features.

You can use GPU in the feature.

7.15.15.4. Usage#

You need to register functions/language_model plugin at first:

Execution example:

plugin_register functions/language_model
# [[0,1337566253.89858,0.000355720520019531],true]

Here is a schema definition and sample data.

Sample schema:

Execution example:

table_create --name Memos --flags TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
# [[0,1337566253.89858,0.000355720520019531],true]

Sample data:

Execution example:

load --table Memos
[
{"content": "Groonga is fast and embeddable full text search engine."},
{"content": "PGroonga is a PostgreSQL extension that uses Groonga."},
{"content": "PostgreSQL is a RDBMS."}
]
# [[0,1337566253.89858,0.000355720520019531],3]

Here is a schema that creates a Generated column that generates embeddings of Memos.content automatically:

Execution example:

column_create \
  --table Memos \
  --name content_embedding \
  --flags COLUMN_VECTOR \
  --type Float32 \
  --source content \
  --generator 'language_model_vectorize("mistral-7b-v0.1.Q4_K_M", content)'
# [[0,1337566253.89858,0.000355720520019531],true]

You can re-rank matched records by using distance_inner_product() not distance_cosine() because language_model_vectorize() returns a normalized embedding. The following example uses all records instead of filtered records to show this usage simply:

Execution example:

select \
  --table Memos \
  --columns[similarity].stage filtered \
  --columns[similarity].flags COLUMN_SCALAR \
  --columns[similarity].types Float32 \
  --columns[similarity].value 'distance_inner_product(content_embedding, language_model_vectorize("mistral-7b-v0.1.Q4_K_M", "high performance FTS"))' \
  --output_columns content,similarity \
  --sort_keys -similarity
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "content",
#           "ShortText"
#         ],
#         [
#           "similarity",
#           "Text"
#         ]
#       ],
#       [
#         "Groonga is fast and embeddable full text search engine.",
#         "0.6581704020500183"
#       ],
#       [
#         "PGroonga is a PostgreSQL extension that uses Groonga.",
#         "0.6540993452072144"
#       ],
#       [
#         "PostgreSQL is a RDBMS.",
#         "0.6449499130249023"
#       ]
#     ]
#   ]
# ]

7.15.15.5. Parameters#

There are two required parameters.

7.15.15.5.1. model_name#

mode_name is the name of language mode to be used. It’s associated with file name. If ${PREFIX}/share/groonga/language_models/mistral-7b-v0.1.Q4_K_M.gguf exists, you can refer it by mistral-7b-v0.1.Q4_K_M. It’s computed by removing directory and .gguf extension.

7.15.15.5.2. text#

text is the input text.

7.15.15.6. Return value#

language_model_vectorize returns Float32 vector which as a normalized embedding.