7.8.6. TokenBigramIgnoreBlankSplitSymbolAlphaDigit

7.8.6.1. Summary

TokenBigramIgnoreBlankSplitSymbolAlphaDigit is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol, alphabet and digit handling

7.8.6.2. Syntax

TokenBigramIgnoreBlankSplitSymbolAlphaDigit hasn’t parameter:

TokenBigramIgnoreBlankSplitSymbolAlphaDigit

7.8.6.3. Usage

TokenBigramIgnoreBlankSplitSymbolAlphaDigit ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbolAlphaDigit tokenizes symbols, alphabets and digits by bigram tokenize method. It means that all characters are tokenized by bigram tokenize method.

You can find difference of them by Hello ! ! ! 777 text because it has symbols and non-ASCII characters with white spaces, alphabets and digits.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "777"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbolAlphaDigit:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "he"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o日"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!7"
#     },
#     {
#       "position": 11,
#       "force_prefix": false,
#       "value": "77"
#     },
#     {
#       "position": 12,
#       "force_prefix": false,
#       "value": "77"
#     },
#     {
#       "position": 13,
#       "force_prefix": false,
#       "value": "7"
#     }
#   ]
# ]