7.9.2. TokenFilterNFKC
#
7.9.2.1. Summary#
Added in version 14.1.3.
This token filter can use the same option by NormalizerNFKC.
This token filter is used to normalize after tokenizing.
Because, if you normalize before tokenizing with TokenMecab
, the meaning of a token may be lost.
7.9.2.2. Syntax#
TokenFilterNFKC
has optional parameter.
No options:
TokenFilterNFKC
TokenFilterNFKC
normalizes text by Unicode NFKC (Normalization Form Compatibility Composition).
Example of option specification:
TokenFilterNFKC("version", "16.0.0")
TokenFilterNFKC("unify_kana", true)
TokenFilterNFKC("unify_hyphen", true)
TokenFilterNFKC("unify_to_romaji", true)
Other options available same as NormalizerNFKC.
7.9.2.3. Usage#
7.9.2.3.1. Simple usage#
Normalization is the same as in NormalizerNFKC, so here are a few examples of how to use the options.
Here is an example of TokenFilterNFKC
. TokenFilterNFKC
normalizes text by Unicode NFKC (Normalization Form Compatibility Composition).
Execution example:
tokenize TokenDelimit "©" --token_filters TokenFilterNFKC
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "©",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Here is an example of version option. You can specify the Unicode version for this option.
Execution example:
tokenize TokenDelimit "©" --token_filters 'TokenFilterNFKC("version", "16.0.0")'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "©",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Here is an example of unify_kana option.
This option enables that same pronounced characters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below.
Execution example:
tokenize TokenDelimit "あイウェおヽヾ" --token_filters 'TokenFilterNFKC("unify_kana", true)'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "あいうぇおゝゞ",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Here is an example of unify_hyphen option. This option enables normalize hyphen to “-” (U+002D HYPHEN-MINUS) as below.
Execution example:
tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋−" --token_filters 'TokenFilterNFKC("unify_hyphen", true)'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "-----------",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
Here is an example of unify_to_romaji option. This option enables normalize hiragana and katakana to romaji as below.
Execution example:
tokenize TokenDelimit "アァイィウゥエェオォ" --token_filters 'TokenFilterNFKC("unify_to_romaji", true)'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "axaixiuxuexeoxo",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.9.2.3.2. Advanced usage#
You can output all input string as hiragana with cimbining TokenFilterNFKC
with use_reading
option of TokenMecab
as below.
Execution example:
tokenize 'TokenMecab("use_reading", true)' "私は林檎を食べます。" --token_filters 'TokenFilterNFKC("unify_kana", true)'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "わたし",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "は",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "りんご",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "を",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "たべ",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ます",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "。",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.9.2.4. Parameters#
See Parameters in NormalizerNFKC
for details.