7.9.2. TokenFilterNFKC#

7.9.2.1. Summary#

Added in version 14.1.3.

This token filter can use the same option by NormalizerNFKC. This token filter is used to normalize after tokenizing. Because, if you normalize before tokenizing with TokenMecab , the meaning of a token may be lost.

7.9.2.2. Syntax#

TokenFilterNFKC has optional parameter.

No options:

TokenFilterNFKC

TokenFilterNFKC normalizes text by Unicode NFKC (Normalization Form Compatibility Composition).

Example of option specification:

TokenFilterNFKC("version", "16.0.0")

TokenFilterNFKC("unify_kana", true)

TokenFilterNFKC("unify_hyphen", true)

TokenFilterNFKC("unify_to_romaji", true)

Other options available same as NormalizerNFKC.

7.9.2.3. Usage#

7.9.2.3.1. Simple usage#

Normalization is the same as in NormalizerNFKC, so here are a few examples of how to use the options.

Here is an example of TokenFilterNFKC. TokenFilterNFKC normalizes text by Unicode NFKC (Normalization Form Compatibility Composition).

Execution example:

tokenize TokenDelimit "©" --token_filters TokenFilterNFKC
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "©",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of version option. You can specify the Unicode version for this option.

Execution example:

tokenize TokenDelimit "©" --token_filters 'TokenFilterNFKC("version", "16.0.0")'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "©",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_kana option.

This option enables that same pronounced characters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below.

Execution example:

tokenize TokenDelimit "あイウェおヽヾ" --token_filters 'TokenFilterNFKC("unify_kana", true)'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "あいうぇおゝゞ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_hyphen option. This option enables normalize hyphen to “-” (U+002D HYPHEN-MINUS) as below.

Execution example:

tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋−" --token_filters 'TokenFilterNFKC("unify_hyphen", true)'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "-----------",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_to_romaji option. This option enables normalize hiragana and katakana to romaji as below.

Execution example:

tokenize TokenDelimit "アァイィウゥエェオォ" --token_filters  'TokenFilterNFKC("unify_to_romaji", true)'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "axaixiuxuexeoxo",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.9.2.3.2. Advanced usage#

You can output all input string as hiragana with cimbining TokenFilterNFKC with use_reading option of TokenMecab as below.

Execution example:

tokenize   'TokenMecab("use_reading", true)'   "私は林檎を食べます。"   --token_filters 'TokenFilterNFKC("unify_kana", true)'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     {
#       "value": "わたし",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "は",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "りんご",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "を",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "たべ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ます",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "。",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.9.2.4. Parameters#

See Parameters in NormalizerNFKC for details.