7.3.69. tokenize
#
7.3.69.1. Summary#
tokenize
command tokenizes text by the specified tokenizer.
It is useful to debug tokenization.
7.3.69.2. Syntax#
This command takes many parameters.
tokenizer
and string
are required parameters. Others are
optional:
tokenize tokenizer
string
[normalizer=null]
[flags=NONE]
[mode=ADD]
[token_filters=NONE]
7.3.69.3. Usage#
Here is a simple example.
Execution example:
tokenize TokenBigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
It has only required parameters. tokenizer
is TokenBigram
and
string
is "Fulltext Search"
. It returns tokens that is
generated by tokenizing "Fulltext Search"
with TokenBigram
tokenizer. It doesn’t normalize "Fulltext Search"
.
7.3.69.4. Parameters#
This section describes all parameters. Parameters are categorized.
7.3.69.4.1. Required parameters#
There are required parameters, tokenizer
and string
.
7.3.69.4.1.1. tokenizer
#
Specifies the tokenizer name. tokenize
command uses the
tokenizer that is named tokenizer
.
See Tokenizers about built-in tokenizers.
Here is an example to use built-in TokenTrigram
tokenizer.
Execution example:
tokenize TokenTrigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Ful",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ull",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "llt",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lte",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "tex",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ext",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt ",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t S",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " Se",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Sea",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ear",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "arc",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rch",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
If you want to use other tokenizers, you need to register additional
tokenizer plugin by register command. For example, you can use
KyTea based tokenizer by
registering tokenizers/kytea
.
7.3.69.4.1.2. string
#
Specifies any string which you want to tokenize.
If you want to include spaces in string
, you need to quote
string
by single quotation ('
) or double quotation ("
).
Here is an example to use spaces in string
.
Execution example:
tokenize TokenBigram "Groonga is a fast fulltext earch engine!"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Gr",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ro",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "oo",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "on",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ng",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ga",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "a ",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " i",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "is",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "s ",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " a",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "a ",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " f",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "fa",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "as",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "st",
# "position": 15,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 16,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " f",
# "position": 17,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "fu",
# "position": 18,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 19,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 20,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 21,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 22,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 23,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 24,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 25,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " e",
# "position": 26,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 27,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 28,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 29,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 30,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h ",
# "position": 31,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " e",
# "position": 32,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "en",
# "position": 33,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ng",
# "position": 34,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "gi",
# "position": 35,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "in",
# "position": 36,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ne",
# "position": 37,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "e!",
# "position": 38,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!",
# "position": 39,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.3.69.4.2. Optional parameters#
There are optional parameters.
7.3.69.4.2.1. normalizer
#
Specifies the normalizer name. tokenize
command uses the
normalizer that is named normalizer
. Normalizer is important for
N-gram family tokenizers such as TokenBigram
.
Normalizer detects character type for each character while normalizing. N-gram family tokenizers use character types while tokenizing.
Here is an example that doesn’t use normalizer.
Execution example:
tokenize TokenBigram "Fulltext Search"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
All alphabets are tokenized by two characters. For example, Fu
is
a token.
Here is an example that uses normalizer.
Execution example:
tokenize TokenBigram "Fulltext Search" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "fulltext",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "search",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Continuous alphabets are tokenized as one token. For example,
fulltext
is a token.
If you want to tokenize by two characters with noramlizer, use
TokenBigramSplitSymbolAlpha
.
Execution example:
tokenize TokenBigramSplitSymbolAlpha "Fulltext Search" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "se",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
All alphabets are tokenized by two characters. And they are normalized
to lower case characters. For example, fu
is a token.
7.3.69.4.2.2. flags
#
Specifies a tokenization customize options. You can specify
multiple options separated by “|
”. For example,
NONE|ENABLE_TOKENIZED_DELIMITER
.
Here are available flags.
Flag |
Description |
---|---|
|
Just ignored. |
|
Enables tokenized delimiter. See Tokenizers about tokenized delimiter details. |
Here is an example that uses ENABLE_TOKENIZED_DELIMITER
.
Execution example:
tokenize TokenDelimit "Fulltext Seacrch" NormalizerAuto ENABLE_TOKENIZED_DELIMITER
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "full",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "text sea",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "crch",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenDelimit
tokenizer is one of tokenized delimiter supported
tokenizer. ENABLE_TOKENIZED_DELIMITER
enables tokenized delimiter.
Tokenized delimiter is special character that indicates token
border. It is U+FFFE
. The character is not assigned any
character. It means that the character is not appeared in normal
string. So the character is good character for this puropose. If
ENABLE_TOKENIZED_DELIMITER
is enabled, the target string is
treated as already tokenized string. Tokenizer just tokenizes by
tokenized delimiter.
7.3.69.4.2.3. mode
#
Specifies a tokenize mode. If the mode is specified ADD
, the text
is tokenized by the rule that adding a document. If the mode is specified
GET
, the text is tokenized by the rule that searching a document. If
the mode is omitted, the text is tokenized by the ADD
mode.
The default mode is ADD
.
Here is an example to the ADD
mode.
Execution example:
tokenize TokenBigram "Fulltext Search" --mode ADD
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "h",
# "position": 14,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
The last alphabet is tokenized by one character.
Here is an example to the GET
mode.
Execution example:
tokenize TokenBigram "Fulltext Search" --mode GET
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Fu",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ul",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lt",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "te",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ex",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "xt",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "t ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": " S",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "Se",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ea",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ar",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "rc",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ch",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
The last alphabet is tokenized by two characters.
7.3.69.4.2.4. token_filters
#
Specifies the token filter names. tokenize
command uses the
tokenizer that is named token_filters
.
See Token filters about token filters.
7.3.69.5. Return value#
tokenize
command returns tokenized tokens. Each token has some
attributes except token itself. The attributes will be increased in
the feature:
[HEADER, tokens]
HEADER
See Output format about
HEADER
.
tokens
tokens
is an array of token. Token is an object that has the following attributes.
Name
Description
value
Token itself.
position
The N-th token.