7.9.2. TokenFilterNFKC100

7.9.2.1. Summary

New in version 8.0.9.

This token filter can use the same option by NormalizerNFKC100. This token filter is used to normalize after tokenizing. Because, if you normalize before tokenizing with TokenMecab , the meaning of a token may be lost.

7.9.2.2. Syntax

TokenFilterNFKC100 has optional parameter:

No options:

TokenFilterNFKC100

TokenFilterNFKC100 normalizes text by Unicode NFKC (Normalization Form Compatibility Composition) for Unicode version 10.0.

Specify option:

TokenFilterNFKC100("unify_kana", true)

TokenFilterNFKC100("unify_kana_case", true)

TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)

TokenFilterNFKC100("unify_hyphen", true)

TokenFilterNFKC100("unify_prolonged_sound_mark", true)

TokenFilterNFKC100("unify_hyphen_and_prolonged_sound_mark", true)

TokenFilterNFKC100("unify_middle_dot", true)

TokenFilterNFKC100("unify_katakana_v_sounds", true)

TokenFilterNFKC100("unify_katakana_bu_sound", true)

TokenFilterNFKC100("unify_to_romaji", true)

7.9.2.3. Usage

7.9.2.4. Simple usage

Here is an example of TokenFilterNFKC100. TokenFilterNFKC100 normalizes text by Unicode NFKC (Normalization Form Compatibility Composition) for Unicode version 10.0.

Execution example:

tokenize TokenDelimit "㎡" --token_filters 'TokenFilterNFKC100'
# [
#   [
#     0,
#     1546906509.304568,
#     0.0002825260162353516
#   ],
#   [
#     {
#       "value": "m2",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_kana option.

This option enables that same pronounced characters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below.

Execution example:

tokenize TokenDelimit "あイウェおヽヾ" --token_filters 'TokenFilterNFKC100("unify_kana", true)'
# [
#   [
#     0,
#     1546906576.590515,
#     0.0003581047058105469
#   ],
#   [
#     {
#       "value": "あいうぇおゝゞ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_kana_case option.

This option enables that large and small versions of same letters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below.

Execution example:

tokenize TokenDelimit "ぁあぃいぅうぇえぉおゃやゅゆょよゎわゕかゖけ" --token_filters 'TokenFilterNFKC100("unify_kana_case", true)'
# [
#   [
#     0,
#     1546906658.116119,
#     0.0003299713134765625
#   ],
#   [
#     {
#       "value": "ああいいううええおおややゆゆよよわわかかけけ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Execution example:

tokenize TokenDelimit "ァアィイゥウェエォオャヤュユョヨヮワヵカヶケ" --token_filters 'TokenFilterNFKC100("unify_kana_case", true)'
# [
#   [
#     0,
#     1546906730.305962,
#     0.0003023147583007812
#   ],
#   [
#     {
#       "value": "アアイイウウエエオオヤヤユユヨヨワワカカケケ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_kana_voiced_sound_mark option.

This option enables that letters with/without voiced sound mark and semi voiced sound mark in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character as below.

Execution example:

tokenize TokenDelimit "かがきぎくぐけげこごさざしじすずせぜそぞただちぢつづてでとどはばぱひびぴふぶぷへべぺほぼぽ" --token_filters 'TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)'
# [
#   [
#     0,
#     1546906812.423493,
#     0.0003724098205566406
#   ],
#   [
#     {
#       "value": "かかききくくけけここささししすすせせそそたたちちつつててととはははひひひふふふへへへほほほ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Execution example:

tokenize TokenDelimit "カガキギクグケゲコゴサザシジスズセゼソゾタダチヂツヅテデトドハバパヒビピフブプヘベペホボポ" --token_filters 'TokenFilterNFKC100("unify_kana_voiced_sound_mark", true)'
# [
#   [
#     0,
#     1546906950.51529,
#     0.0003533363342285156
#   ],
#   [
#     {
#       "value": "カカキキククケケココササシシススセセソソタタチチツツテテトトハハハヒヒヒフフフヘヘヘホホホ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_hyphen option. This option enables normalize hyphen to “-” (U+002D HYPHEN-MINUS) as below.

Execution example:

tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋−" --token_filters 'TokenFilterNFKC100("unify_hyphen", true)'
# [
#   [
#     0,
#     1546907023.849045,
#     0.0003139972686767578
#   ],
#   [
#     {
#       "value": "-----------",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_prolonged_sound_mark option. This option enables normalize prolonged sound to “-” (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK) as below.

Execution example:

tokenize TokenDelimit "ー—―─━ー" --token_filters 'TokenFilterNFKC100("unify_prolonged_sound_mark", true)'
# [
#   [
#     0,
#     1546907076.575454,
#     0.0003325939178466797
#   ],
#   [
#     {
#       "value": "ーーーーーー",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_hyphen_and_prolonged_sound_mark option. This option enables normalize hyphen and prolonged sound to “-” (U+002D HYPHEN-MINUS) as below.

Execution example:

tokenize TokenDelimit "-˗֊‐‑‒–⁃⁻₋− ﹣- ー—―─━ー" --token_filters 'TokenFilterNFKC100("unify_hyphen_and_prolonged_sound_mark", true)'
# [
#   [
#     0,
#     1546907138.989727,
#     0.0003240108489990234
#   ],
#   [
#     {
#       "value": "-----------",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "--",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "------",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_middle_dot option. This option enables normalize middle dot to “·” (U+00B7 MIDDLE DOT) as below.

Execution example:

tokenize TokenDelimit "·ᐧ•∙⋅⸱・・" --token_filters 'TokenFilterNFKC100("unify_middle_dot", true)'
# [
#   [
#     0,
#     1546907221.227195,
#     0.0003573894500732422
#   ],
#   [
#     {
#       "value": "········",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_katakana_v_sounds option. This option enables normalize “ヴァヴィヴヴェヴォ” to “バビブベボ” as below.

Execution example:

tokenize TokenDelimit "ヴァヴィヴヴェヴォヴ" --token_filters 'TokenFilterNFKC100("unify_katakana_v_sounds", true)'
# [
#   [
#     0,
#     1546907295.776949,
#     0.0003447532653808594
#   ],
#   [
#     {
#       "value": "バビブベボブ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_katakana_bu_sound option. This option enables normalize “ヴァヴィヴゥヴェヴォ” to “ブ” as below.

Execution example:

tokenize TokenDelimit "ヴァヴィヴヴェヴォヴ" --token_filters 'TokenFilterNFKC100("unify_katakana_bu_sound", true)'
# [
#   [
#     0,
#     1546907361.518968,
#     0.0002958774566650391
#   ],
#   [
#     {
#       "value": "ブブブブブブ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

Here is an example of unify_to_romaji option. This option enables normalize hiragana and katakana to romaji as below.

Execution example:

tokenize TokenDelimit "アァイィウゥエェオォ" --token_filters  'TokenFilterNFKC100("unify_to_romaji", true)'
# [
#   [
#     0,
#     1546907415.47742,
#     0.0003619194030761719
#   ],
#   [
#     {
#       "value": "axaixiuxuexeoxo",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.9.2.5. Advanced usage

You can output all input string as hiragana with cimbining TokenFilterNFKC100 with use_reading option of TokenMecab as below.

Execution example:

tokenize   'TokenMecab("use_reading", true)'   "私は林檎を食べます。"   --token_filters 'TokenFilterNFKC100("unify_kana", true)'
# [
#   [
#     0,
#     1545901819.377275,
#     0.0003833770751953125
#   ],
#   [
#     {
#       "value": "わたし",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "は",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "りんご",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "を",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "たべ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ます",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "。",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.9.2.6. Parameters

7.9.2.6.1. Optional parameter

There are optional parameters as below.

7.9.2.6.1.1. unify_kana

This option enables that same pronounced characters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character.

7.9.2.6.1.2. unify_kana_case

This option enables that large and small versions of same letters in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character.

7.9.2.6.1.3. unify_kana_voiced_sound_mark

This option enables that letters with/without voiced sound mark and semi voiced sound mark in all of full-width Hiragana, full-width Katakana and half-width Katakana are regarded as the same character.

7.9.2.6.1.4. unify_hyphen

This option enables normalize hyphen to “-” (U+002D HYPHEN-MINUS).

Hyphen of the target of normalizing is as below.

  • “-” (U+002D HYPHEN-MINUS)
  • “֊” (U+058A ARMENIAN HYPHEN)
  • “˗” (U+02D7 MODIFIER LETTER MINUS SIGN)
  • “‐” (U+2010 HYPHEN)
  • “—” (U+2014 EM DASH)
  • “⁃” (U+2043 HYPHEN BULLET)
  • “⁻” (U+207B SUPERSCRIPT MINUS)
  • “₋” (U+208B SUBSCRIPT MINUS)
  • “−” (U+2212 MINUS SIGN)

7.9.2.6.1.5. unify_prolonged_sound_mark

This option enables normalize prolonged sound to “-” (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK).

Prolonged sound of the target of normalizing is as below.

  • “—” (U+2014 EM DASH)
  • “―” (U+2015 HORIZONTAL BAR)
  • “─” (U+2500 BOX DRAWINGS LIGHT HORIZONTAL)
  • “━” (U+2501 BOX DRAWINGS HEAVY HORIZONTAL)
  • “ー” (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK)
  • “ー” (U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK)

7.9.2.6.1.6. unify_hyphen_and_prolonged_sound_mark

This option enables normalize hyphen and prolonged sound to “-” (U+002D HYPHEN-MINUS).

Hyphen and prolonged sound of the target normalizing is below.

  • “-” (U+002D HYPHEN-MINUS)
  • “֊” (U+058A ARMENIAN HYPHEN)
  • “˗” (U+02D7 MODIFIER LETTER MINUS SIGN)
  • “‐” (U+2010 HYPHEN)
  • “—” (U+2014 EM DASH)
  • “⁃” (U+2043 HYPHEN BULLET)
  • “⁻” (U+207B SUPERSCRIPT MINUS)
  • “₋” (U+208B SUBSCRIPT MINUS)
  • “−” (U+2212 MINUS SIGN)
  • “—” (U+2014 EM DASH)
  • “―” (U+2015 HORIZONTAL BAR)
  • “─” (U+2500 BOX DRAWINGS LIGHT HORIZONTAL)
  • “━” (U+2501 BOX DRAWINGS HEAVY HORIZONTAL)
  • “ー” (U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK)
  • “ー” (U+FF70 HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK)

7.9.2.6.1.7. unify_middle_dot

This option enables normalize middle dot to “·” (U+00B7 MIDDLE DOT).

Middle dot of the target of normalizing is as below.

  • “·” (U+00B7 MIDDLE DOT)
  • “ᐧ” (U+1427 CANADIAN SYLLABICS FINAL MIDDLE DOT)
  • “•” (U+2022 BULLET)
  • “∙” (U+2219 BULLET OPERATOR)
  • “⋅” (U+22C5 DOT OPERATOR)
  • “⸱” (U+2E31 WORD SEPARATOR MIDDLE DOT)
  • “・” (U+30FB KATAKANA MIDDLE DOT)
  • “・” (U+FF65 HALFWIDTH KATAKANA MIDDLE DOT)

7.9.2.6.1.8. unify_katakana_v_sounds

This option enables normalize “ヴァヴィヴヴェヴォ” to “バビブベボ”.

7.9.2.6.1.9. unify_katakana_bu_sound

This option enables normalize “ヴァヴィヴゥヴェヴォ” to “ブ”.

7.9.2.6.1.10. unify_to_romaji

This option enables normalize hiragana and katakana to romaji.