7.3.63. table_tokenize Summary

table_tokenize command tokenizes text by the specified table's tokenizer. Syntax

This command takes many parameters.

table and string are required parameters. Others are optional:

table_tokenize table
               [index_column=null] Usage

Here is a simple example.

Execution example:

plugin_register token_filters/stop_word
# [[0,0.0,0.0],true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStopWord
# [[0,0.0,0.0],true]
column_create Terms is_stop_word COLUMN_SCALAR Bool
# [[0,0.0,0.0],true]
load --table Terms
{"_key": "and", "is_stop_word": true}
# [[0,0.0,0.0],1]
table_tokenize Terms "Hello and Good-bye" --mode GET
# [
#  [
#    0,
#    0.0,
#    0.0
#  ],
#  [
#    {
#      "value": "hello",
#      "position": 0
#    },
#    {
#      "value": "good",
#      "position": 2
#    },
#    {
#      "value": "-",
#      "position": 3
#    },
#    {
#      "value": "bye",
#      "position": 4
#    }
#  ]
# ]

Terms table is set TokenBigram tokenizer, NormalizerAuto normalizer, TokenFilterStopWord token filter. It returns tokens that is generated by tokenizeing "Hello and Good-bye" with TokenBigram tokenizer. It is normalized by NormalizerAuto normalizer. and token is removed with TokenFilterStopWord token filter. Parameters

This section describes all parameters. Parameters are categorized. Required parameters

There are required parameters, table and string. table

Specifies the lexicon table. table_tokenize command uses the tokenizer, the normalizer, the token filters that is set the lexicon table. string

Specifies any string which you want to tokenize.

See string option in tokenize about details. Optional parameters

There are optional parameters. flags

Specifies a tokenization customize options. You can specify multiple options separated by "|".

The default value is NONE.

See flags option in tokenize about details. mode

Specifies a tokenize mode.

The default value is GET.

See mode option in tokenize about details. index_column

Specifies an index column.

Return value includes estimated_size of the index.

The estimated_size is useful for checking estimated frequency of tokens. Return value

table_tokenize command returns tokenized tokens.

See Return value option in tokenize about details. See also