7.8.10. `TokenDelimit`#

7.8.10.1. Summary#

TokenDelimit extracts token by splitting one or more space characters (U+0020). For example, Hello World is tokenized to Hello and World.

TokenDelimit is suitable for tag text. You can extract groonga and full-text-search and http as tags from groonga full-text-search http.

7.8.10.2. Syntax#

TokenDelimit has optional parameter.

No options(Extracts token by splitting one or more space characters (U+0020)):

TokenDelimit

Specify delimiter:

TokenDelimit("delimiter",  "delimiter1", delimiter", "delimiter2", ...)

Specify delimiter with regular expression:

TokenDelimit("pattern", pattern)

The delimiter option and a pattern option are not use at the same time.

7.8.10.3. Usage#

7.8.10.4. Simple usage#

Here is an example of TokenDelimit:

Execution example:

tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "groonga",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "full-text-search",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "http",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

TokenDelimit can also specify options. TokenDelimit has delimiter option and pattern option.

delimiter option can split token with a specified character.

For example, Hello,World is tokenized to Hello and World with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("delimiter", ",")' "Hello,World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

pattern option can split token with a regular expression. You can except needless space by pattern option.

For example, This is a pen. This is an apple is tokenized to This is a pen and This is an apple with pattern option as below.

Normally, when This is a pen. This is an apple. is splitted by ., needless spaces are included at the beginning of “This is an apple.”.

You can except the needless spaces by a pattern option as below example.

Execution example:

tokenize 'TokenDelimit("pattern", "\\\\.\\\\s*")' "This is a pen. This is an apple."
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "This is a pen",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "This is an apple",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.10.5. Advanced usage#

delimiter option can also specify multiple delimiters.

For example, Hello, World is tokenized to Hello and World. , and `` `` are delimiters in below example.

Execution example:

tokenize 'TokenDelimit("delimiter", ",", "delimiter", " ")' "Hello, World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "Hello",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "World",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

You can extract token in complex conditions by pattern option.

For example, これはペンですか！？リンゴですか？「リンゴです。」 is tokenize to これはペンですか and リンゴですか, 「リンゴです。」 with delimiter option as below.

Execution example:

tokenize 'TokenDelimit("pattern", "([。！？]+(?![）」])|[\\r\\n]+)\\s*")' "これはペンですか！？リンゴですか？「リンゴです。」"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "これはペンですか",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "リンゴですか",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "「リンゴです。」",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

\\s* of the end of above regular expression match 0 or more spaces after a delimiter.

[。！？]+ matches 1 or more 。 or ！, ？. For example, [。！？]+ matches ！？ of これはペンですか！？.

(?![）」]) is negative lookahead. (?![）」]) matches if a character is not matched ） or 」. negative lookahead interprets in combination regular expression of just before.

Therefore it interprets [。！？]+(?![）」]).

[。！？]+(?![）」]) matches if there are not ） or 」 after 。 or ！, ？.

In other words, [。！？]+(?![）」]) matches 。 of これはペンですか。. But [。！？]+(?![）」]) doesn’t match 。 of 「リンゴです。」. Because there is 」 after 。.

[\\r\\n]+ match 1 or more newline character.

In conclusion, ([。！？]+(?![）」])|[\\r\\n]+)\\s* uses 。 and ！ and ？, newline character as delimiter. However, 。 and !, ？ are not delimiters if there is ） or 」 after 。 or ！, ？.

7.8.10.6. Parameters#

7.8.10.6.1. Optional parameter#

There are two optional parameters delimiter and pattern.

7.8.10.6.1.1. `delimiter`#

Split token with a specified one or more characters.

You can use one or more characters for a delimiter.

7.8.10.6.1.2. `pattern`#

Split token with a regular expression.

7.8.10.7. See also#

tokenize

7.8.10. TokenDelimit#

7.8.10.1. Summary#

7.8.10.2. Syntax#

7.8.10.3. Usage#

7.8.10.4. Simple usage#

7.8.10.5. Advanced usage#

7.8.10.6. Parameters#

7.8.10.6.1. Optional parameter#

7.8.10.6.1.1. delimiter#

7.8.10.6.1.2. pattern#

7.8.10.7. See also#

7.8.10. `TokenDelimit`#

7.8.10.6.1.1. `delimiter`#

7.8.10.6.1.2. `pattern`#