7.8. Tokenizers

7.8.1. Summary

Groonga has tokenizer module that tokenizes text. It is used when the following cases:

  • Indexing text

    ../_images/used-when-indexing.png

    Tokenizer is used when indexing text.

  • Searching by query

    ../_images/used-when-searching.png

    Tokenizer is used when searching by query.

Tokenizer is an important module for full-text search. You can change trade-off between precision and recall by changing tokenizer.

Normally, TokenBigram is a suitable tokenizer. If you don't know much about tokenizer, it's recommended that you choose TokenBigram.

You can try a tokenizer by tokenize and table_tokenize. Here is an example to try TokenBigram tokenizer by tokenize:

Execution example:

tokenize TokenBigram "Hello World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "He"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o "
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": " W"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "Wo"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "or"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "rl"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ld"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "d"
#     }
#   ]
# ]

7.8.2. What is "tokenize"?

"tokenize" is the process that extracts zero or more tokens from a text. There are some "tokenize" methods.

For example, Hello World is tokenized to the following tokens by bigram tokenize method:

  • He
  • el
  • ll
  • lo
  • o_ (_ means a white-space)
  • _W (_ means a white-space)
  • Wo
  • or
  • rl
  • ld

In the above example, 10 tokens are extracted from one text Hello World.

For example, Hello World is tokenized to the following tokens by white-space-separate tokenize method:

  • Hello
  • World

In the above example, 2 tokens are extracted from one text Hello World.

Token is used as search key. You can find indexed documents only by tokens that are extracted by used tokenize method. For example, you can find Hello World by ll with bigram tokenize method but you can't find Hello World by ll with white-space-separate tokenize method. Because white-space-separate tokenize method doesn't extract ll token. It just extracts Hello and World tokens.

In general, tokenize method that generates small tokens increases recall but decreases precision. Tokenize method that generates large tokens increases precision but decreases recall.

For example, we can find Hello World and A or B by or with bigram tokenize method. Hello World is a noise for people who wants to search "logical and". It means that precision is decreased. But recall is increased.

We can find only A or B by or with white-space-separate tokenize method. Because World is tokenized to one token World with white-space-separate tokenize method. It means that precision is increased for people who wants to search "logical and". But recall is decreased because Hello World that contains or isn't found.

7.8.3. Built-in tokenizsers

Here is a list of built-in tokenizers:

  • TokenBigram
  • TokenBigramSplitSymbol
  • TokenBigramSplitSymbolAlpha
  • TokenBigramSplitSymbolAlphaDigit
  • TokenBigramIgnoreBlank
  • TokenBigramIgnoreBlankSplitSymbol
  • TokenBigramIgnoreBlankSplitSymbolAlpha
  • TokenBigramIgnoreBlankSplitSymbolAlphaDigit
  • TokenUnigram
  • TokenTrigram
  • TokenDelimit
  • TokenDelimitNull
  • TokenMecab
  • TokenRegexp

7.8.3.1. TokenBigram

TokenBigram is a bigram based tokenizer. It's recommended to use this tokenizer for most cases.

Bigram tokenize method tokenizes a text to two adjacent characters tokens. For example, Hello is tokenized to the following tokens:

  • He
  • el
  • ll
  • lo

Bigram tokenize method is good for recall because you can find all texts by query consists of two or more characters.

In general, you can't find all texts by query consists of one character because one character token doesn't exist. But you can find all texts by query consists of one character in Groonga. Because Groonga find tokens that start with query by predictive search. For example, Groonga can find ll and lo tokens by l query.

Bigram tokenize method isn't good for precision because you can find texts that includes query in word. For example, you can find world by or. This is more sensitive for ASCII only languages rather than non-ASCII languages. TokenBigram has solution for this problem described in the below.

TokenBigram behavior is different when it's worked with any Normalizers.

If no normalizer is used, TokenBigram uses pure bigram (all tokens except the last token have two characters) tokenize method:

Execution example:

tokenize TokenBigram "Hello World"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "He"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o "
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": " W"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "Wo"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "or"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "rl"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ld"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "d"
#     }
#   ]
# ]

If normalizer is used, TokenBigram uses white-space-separate like tokenize method for ASCII characters. TokenBigram uses bigram tokenize method for non-ASCII characters.

You may be confused with this combined behavior. But it's reasonable for most use cases such as English text (only ASCII characters) and Japanese text (ASCII and non-ASCII characters are mixed).

Most languages consists of only ASCII characters use white-space for word separator. White-space-separate tokenize method is suitable for the case.

Languages consists of non-ASCII characters don't use white-space for word separator. Bigram tokenize method is suitable for the case.

Mixed tokenize method is suitable for mixed language case.

If you want to use bigram tokenize method for ASCII character, see TokenBigramSplitXXX type tokenizers such as TokenBigramSplitSymbolAlpha.

Let's confirm TokenBigram behavior by example.

TokenBigram uses one or more white-spaces as token delimiter for ASCII characters:

Execution example:

tokenize TokenBigram "Hello World" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "world"
#     }
#   ]
# ]

TokenBigram uses character type change as token delimiter for ASCII characters. Character type is one of them:

  • Alphabet
  • Digit
  • Symbol (such as (, ) and !)
  • Hiragana
  • Katakana
  • Kanji
  • Others

The following example shows two token delimiters:

  • at between 100 (digits) and cents (alphabets)
  • at between cents (alphabets) and !!! (symbols)

Execution example:

tokenize TokenBigram "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!!"
#     }
#   ]
# ]

Here is an example that TokenBigram uses bigram tokenize method for non-ASCII characters.

Execution example:

tokenize TokenBigram "日本語の勉強" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語の"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "の勉"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "勉強"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "強"
#     }
#   ]
# ]

7.8.3.2. TokenBigramSplitSymbol

TokenBigramSplitSymbol is similar to TokenBigram. The difference between them is symbol handling. TokenBigramSplitSymbol tokenizes symbols by bigram tokenize method:

Execution example:

tokenize TokenBigramSplitSymbol "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.3. TokenBigramSplitSymbolAlpha

TokenBigramSplitSymbolAlpha is similar to TokenBigram. The difference between them is symbol and alphabet handling. TokenBigramSplitSymbolAlpha tokenizes symbols and alphabets by bigram tokenize method:

Execution example:

tokenize TokenBigramSplitSymbolAlpha "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "en"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "nt"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "ts"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "s!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.4. TokenBigramSplitSymbolAlphaDigit

TokenBigramSplitSymbolAlphaDigit is similar to TokenBigram. The difference between them is symbol, alphabet and digit handling. TokenBigramSplitSymbolAlphaDigit tokenizes symbols, alphabets and digits by bigram tokenize method. It means that all characters are tokenized by bigram tokenize method:

Execution example:

tokenize TokenBigramSplitSymbolAlphaDigit "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "10"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "00"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "0c"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "en"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "nt"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "ts"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "s!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.5. TokenBigramIgnoreBlank

TokenBigramIgnoreBlank is similar to TokenBigram. The difference between them is blank handling. TokenBigramIgnoreBlank ignores white-spaces in continuous symbols and non-ASCII characters.

You can find difference of them by ! ! ! text because it has symbols and non-ASCII characters.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlank:

Execution example:

tokenize TokenBigramIgnoreBlank "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!!!"
#     }
#   ]
# ]

7.8.3.6. TokenBigramIgnoreBlankSplitSymbol

TokenBigramIgnoreBlankSplitSymbol is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol handling

TokenBigramIgnoreBlankSplitSymbol ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbol tokenizes symbols by bigram tokenize method.

You can find difference of them by ! ! ! text because it has symbols and non-ASCII characters.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbol:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbol "日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.7. TokenBigramIgnoreBlankSplitSymbolAlpha

TokenBigramIgnoreBlankSplitSymbolAlpha is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol and alphabet handling

TokenBigramIgnoreBlankSplitSymbolAlpha ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbolAlpha tokenizes symbols and alphabets by bigram tokenize method.

You can find difference of them by Hello ! ! ! text because it has symbols and non-ASCII characters with white spaces and alphabets.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "Hello 日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbolAlpha:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbolAlpha "Hello 日 本 語 ! ! !" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "he"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o日"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!"
#     }
#   ]
# ]

7.8.3.8. TokenBigramIgnoreBlankSplitSymbolAlphaDigit

TokenBigramIgnoreBlankSplitSymbolAlphaDigit is similar to TokenBigram. The differences between them are the followings:

  • Blank handling
  • Symbol, alphabet and digit handling

TokenBigramIgnoreBlankSplitSymbolAlphaDigit ignores white-spaces in continuous symbols and non-ASCII characters.

TokenBigramIgnoreBlankSplitSymbolAlphaDigit tokenizes symbols, alphabets and digits by bigram tokenize method. It means that all characters are tokenized by bigram tokenize method.

You can find difference of them by Hello ! ! ! 777 text because it has symbols and non-ASCII characters with white spaces, alphabets and digits.

Here is a result by TokenBigram :

Execution example:

tokenize TokenBigram "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "hello"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "日"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "本"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "語"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "!"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "777"
#     }
#   ]
# ]

Here is a result by TokenBigramIgnoreBlankSplitSymbolAlphaDigit:

Execution example:

tokenize TokenBigramIgnoreBlankSplitSymbolAlphaDigit "Hello 日 本 語 ! ! ! 777" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "he"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "el"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ll"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "lo"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "o日"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "日本"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "本語"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "語!"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "!!"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "!7"
#     },
#     {
#       "position": 11,
#       "force_prefix": false,
#       "value": "77"
#     },
#     {
#       "position": 12,
#       "force_prefix": false,
#       "value": "77"
#     },
#     {
#       "position": 13,
#       "force_prefix": false,
#       "value": "7"
#     }
#   ]
# ]

7.8.3.9. TokenUnigram

TokenUnigram is similar to TokenBigram. The differences between them is token unit. TokenBigram uses 2 characters per token. TokenUnigram uses 1 character per token.

Execution example:

tokenize TokenUnigram "100cents!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "100"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!!"
#     }
#   ]
# ]

7.8.3.10. TokenTrigram

TokenTrigram is similar to TokenBigram. The differences between them is token unit. TokenBigram uses 2 characters per token. TokenTrigram uses 3 characters per token.

Execution example:

tokenize TokenTrigram "10000cents!!!!!" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "10000"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "cents"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "!!!!!"
#     }
#   ]
# ]

7.8.3.11. TokenDelimit

TokenDelimit extracts token by splitting one or more space characters (U+0020). For example, Hello World is tokenized to Hello and World.

TokenDelimit is suitable for tag text. You can extract groonga and full-text-search and http as tags from groonga full-text-search http.

Here is an example of TokenDelimit:

Execution example:

tokenize TokenDelimit "Groonga full-text-search HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "groonga"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "full-text-search"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "http"
#     }
#   ]
# ]

7.8.3.12. TokenDelimitNull

TokenDelimitNull is similar to TokenDelimit. The difference between them is separator character. TokenDelimit uses space character (U+0020) but TokenDelimitNull uses NUL character (U+0000).

TokenDelimitNull is also suitable for tag text.

Here is an example of TokenDelimitNull:

Execution example:

tokenize TokenDelimitNull "Groonga\u0000full-text-search\u0000HTTP" NormalizerAuto
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "groongau0000full-text-searchu0000http"
#     }
#   ]
# ]

7.8.3.13. TokenMecab

TokenMecab is a tokenizer based on MeCab part-of-speech and morphological analyzer.

MeCab doesn't depend on Japanese. You can use MeCab for other languages by creating dictionary for the languages. You can use NAIST Japanese Dictionary for Japanese.

TokenMecab is good for precision rather than recall. You can find 東京都 and 京都 texts by 京都 query with TokenBigram but 東京都 isn't expected. You can find only 京都 text by 京都 query with TokenMecab.

If you want to support neologisms, you need to keep updating your MeCab dictionary. It needs maintain cost. (TokenBigram doesn't require dictionary maintenance because TokenBigram doesn't use dictionary.) mecab-ipadic-NEologd : Neologism dictionary for MeCab may help you.

Here is an example of TokenMeCab. 東京都 is tokenized to 東京 and . They don't include 京都:

Execution example:

tokenize TokenMecab "東京都"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "東京"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "都"
#     }
#   ]
# ]

7.8.3.14. TokenRegexp

New in version 5.0.1.

Caution

This tokenizer is experimental. Specification may be changed.

Caution

This tokenizer can be used only with UTF-8. You can't use this tokenizer with EUC-JP, Shift_JIS and so on.

TokenRegexp is a tokenizer for supporting regular expression search by index.

In general, regular expression search is evaluated as sequential search. But the following cases can be evaluated as index search:

  • Literal only case such as hello
  • The beginning of text and literal case such as \A/home/alice
  • The end of text and literal case such as \.txt\z

In most cases, index search is faster than sequential search.

TokenRegexp is based on bigram tokenize method. TokenRegexp adds the beginning of text mark (U+FFEF) at the begging of text and the end of text mark (U+FFF0) to the end of text when you index text:

Execution example:

tokenize TokenRegexp "/home/alice/test.txt" NormalizerAuto --mode ADD
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "position": 0,
#       "force_prefix": false,
#       "value": "￯"
#     },
#     {
#       "position": 1,
#       "force_prefix": false,
#       "value": "/h"
#     },
#     {
#       "position": 2,
#       "force_prefix": false,
#       "value": "ho"
#     },
#     {
#       "position": 3,
#       "force_prefix": false,
#       "value": "om"
#     },
#     {
#       "position": 4,
#       "force_prefix": false,
#       "value": "me"
#     },
#     {
#       "position": 5,
#       "force_prefix": false,
#       "value": "e/"
#     },
#     {
#       "position": 6,
#       "force_prefix": false,
#       "value": "/a"
#     },
#     {
#       "position": 7,
#       "force_prefix": false,
#       "value": "al"
#     },
#     {
#       "position": 8,
#       "force_prefix": false,
#       "value": "li"
#     },
#     {
#       "position": 9,
#       "force_prefix": false,
#       "value": "ic"
#     },
#     {
#       "position": 10,
#       "force_prefix": false,
#       "value": "ce"
#     },
#     {
#       "position": 11,
#       "force_prefix": false,
#       "value": "e/"
#     },
#     {
#       "position": 12,
#       "force_prefix": false,
#       "value": "/t"
#     },
#     {
#       "position": 13,
#       "force_prefix": false,
#       "value": "te"
#     },
#     {
#       "position": 14,
#       "force_prefix": false,
#       "value": "es"
#     },
#     {
#       "position": 15,
#       "force_prefix": false,
#       "value": "st"
#     },
#     {
#       "position": 16,
#       "force_prefix": false,
#       "value": "t."
#     },
#     {
#       "position": 17,
#       "force_prefix": false,
#       "value": ".t"
#     },
#     {
#       "position": 18,
#       "force_prefix": false,
#       "value": "tx"
#     },
#     {
#       "position": 19,
#       "force_prefix": false,
#       "value": "xt"
#     },
#     {
#       "position": 20,
#       "force_prefix": false,
#       "value": "t"
#     },
#     {
#       "position": 21,
#       "force_prefix": false,
#       "value": "￰"
#     }
#   ]
# ]