7.8.10. TokenDelimit
#
7.8.10.1. Summary#
TokenDelimit
extracts token by splitting one or more space
characters (U+0020
). For example, Hello World
is tokenized to
Hello
and World
.
TokenDelimit
is suitable for tag text. You can extract groonga
and full-text-search
and http
as tags from groonga
full-text-search http
.
7.8.10.2. Syntax#
TokenDelimit
has optional parameter.
No options(Extracts token by splitting one or more space characters (U+0020
)):
TokenDelimit
Specify delimiter:
TokenDelimit("delimiter", "delimiter1", "delimiter", "delimiter2", ...)
Specify delimiter with regular expression:
TokenDelimit("pattern", pattern)
The delimiter
option and a pattern
option are not use at the same time.
7.8.10.3. Usage#
7.8.10.4. Simple usage#
Here is an example of TokenDelimit
:
Execution example:
tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "groonga",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "full-text-search",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "http",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenDelimit
can also specify options.
TokenDelimit
has delimiter
option and pattern
option.
delimiter
option can split token with a specified character.
For example, Hello,World
is tokenized to Hello
and World
with delimiter
option as below.
Execution example:
tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
pattern
option can split token with a regular expression.
You can except needless space by pattern
option.
For example, This is a pen. This is an apple
is tokenized to This is a pen
and
This is an apple
with pattern
option as below.
Normally, when This is a pen. This is an apple.
is splitted by .
,
needless spaces are included at the beginning of “This is an apple.”.
You can except the needless spaces by a pattern
option as below example.
Execution example:
tokenize 'TokenDelimit("pattern", "\\\\.\\\\s*")' "This is a pen. This is an apple."
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "This is a pen",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "This is an apple",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.10.5. Advanced usage#
delimiter
option can also specify multiple delimiters.
For example, Hello, World
is tokenized to Hello
and World
.
,
and `` `` are delimiters in below example.
Execution example:
tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "Hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "World",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
You can extract token in complex conditions by pattern
option.
For example, これはペンですか!?リンゴですか?「リンゴです。」
is tokenize to これはペンですか
and リンゴですか
, 「リンゴです。」
with delimiter
option as below.
Execution example:
tokenize 'TokenDelimit("pattern", "([。!?]+(?![)」])|[\\r\\n]+)\\s*")' "これはペンですか!?リンゴですか?「リンゴです。」"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "これはペンですか",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "リンゴですか",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "「リンゴです。」",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
\\s*
of the end of above regular expression match 0 or more spaces after a delimiter.
[。!?]+
matches 1 or more 。
or !
, ?
.
For example, [。!?]+
matches !?
of これはペンですか!?
.
(?![)」])
is negative lookahead.
(?![)」])
matches if a character is not matched )
or 」
.
negative lookahead interprets in combination regular expression of just before.
Therefore it interprets [。!?]+(?![)」])
.
[。!?]+(?![)」])
matches if there are not )
or 」
after 。
or !
, ?
.
In other words, [。!?]+(?![)」])
matches 。
of これはペンですか。
. But [。!?]+(?![)」])
doesn’t match 。
of 「リンゴです。」
.
Because there is 」
after 。
.
[\\r\\n]+
match 1 or more newline character.
In conclusion, ([。!?]+(?![)」])|[\\r\\n]+)\\s*
uses 。
and !
and ?
, newline character as delimiter. However, 。
and !
, ?
are not delimiters if there is )
or 」
after 。
or !
, ?
.
7.8.10.6. Parameters#
7.8.10.6.1. Optional parameter#
There are two optional parameters delimiter
and pattern
.
7.8.10.6.1.1. delimiter
#
Split token with a specified one or more characters.
You can use one or more characters for a delimiter.
7.8.10.6.1.2. pattern
#
Split token with a regular expression.