7.8.6. TokenBigramIgnoreBlankSplitSymbolAlphaDigit
#
7.8.6.1. Summary#
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
is similar to
TokenBigram. The differences between them are the followings:
Blank handling
Symbol, alphabet and digit handling
7.8.6.2. Syntax#
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
hasn’t parameter:
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
7.8.6.3. Usage#
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
ignores white-spaces
in continuous symbols and non-ASCII characters.
TokenBigramIgnoreBlankSplitSymbolAlphaDigit
tokenizes symbols,
alphabets and digits by bigram tokenize method. It means that all
characters are tokenized by bigram tokenize method.
You can find difference of them by Hello 日 本 語 ! ! ! 777
text
because it has symbols and non-ASCII characters with white spaces,
alphabets and digits.
Here is a result by TokenBigram :
Execution example:
tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "hello",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "日",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "本",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "語",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "777",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Here is a result by TokenBigramIgnoreBlankSplitSymbolAlphaDigit
:
Execution example:
tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "he",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "el",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ll",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "lo",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "o日",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "日本",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "本語",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "語!",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!!",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!!",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "!7",
# "position": 10,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "77",
# "position": 11,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "77",
# "position": 12,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "7",
# "position": 13,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]