7.8.12. TokenMecab

7.8.12.1. Summary

TokenMecab is a tokenizer based on MeCab part-of-speech and morphological analyzer.

MeCab doesn’t depend on Japanese. You can use MeCab for other languages by creating dictionary for the languages. You can use NAIST Japanese Dictionary for Japanese.

You need to install an additional package to using TokenMecab. For more detail of how to installing an additional package, see how to install each OS .

TokenMecab is good for precision rather than recall. You can find 東京都 and 京都 texts by 京都 query with TokenBigram but 東京都 isn’t expected. You can find only 京都 text by 京都 query with TokenMecab.

If you want to support neologisms, you need to keep updating your MeCab dictionary. It needs maintain cost. (TokenBigram doesn’t require dictionary maintenance because TokenBigram doesn’t use dictionary.) mecab-ipadic-NEologd : Neologism dictionary for MeCab may help you.

7.8.12.2. Syntax

TokenMecab has optional parameter.

No options:

TokenMecab

Specify option:

TokenMecab("include_class", true)

TokenMecab("target_class", "a_part_of_speech")

TokenMecab("include_reading", true)

TokenMecab("include_form", true)

TokenMecab("use_reading", true)

Specify multiple options:

TokenMecab("target_class", "名詞", "include_reading", true)

TokenMecab also specify multiple options as above. You can also specify mingle multiple options except above example.

7.8.12.3. Usage

7.8.12.4. Simple usage

Here is an example of TokenMeCab. 東京都 is tokenized to 東京 and . They don’t include 京都:

Execution example:

tokenize TokenMecab "東京都"
# [
#   [
#     0,
#     1545812631.661493,
#     0.0002415180206298828
#   ],
#   [
#     {
#       "value": "東京",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "都",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

TokenMecab can also specify options. TokenMecab has target_class option, include_class option, include_reading option, include_form option and use_reading option.

target_class option searches a token of specifying a part-of-speech. For example, you can search only a noun as below.

Execution example:

tokenize 'TokenMecab("target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545810238.195525,
#     0.0003066062927246094
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "名前",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "山田",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "さん",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "はず",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

include_class option outputs class and subclass in MeCab’s metadata as below.

Execution example:

tokenize 'TokenMecab("include_class", true)' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545892715.887472,
#     0.03757452964782715
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "名詞",
#         "subclass0": "代名詞",
#         "subclass1": "一般"
#       }
#     },
#     {
#       "value": "の",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "助詞",
#         "subclass0": "連体化"
#       }
#     },
#     {
#       "value": "名前",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "名詞",
#         "subclass0": "一般"
#       }
#     },
#     {
#       "value": "は",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "助詞",
#         "subclass0": "係助詞"
#       }
#     },
#     {
#       "value": "山田",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "名詞",
#         "subclass0": "固有名詞",
#         "subclass1": "人名",
#         "subclass2": "姓"
#       }
#     },
#     {
#       "value": "さん",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "名詞",
#         "subclass0": "接尾",
#         "subclass1": "人名"
#       }
#     },
#     {
#       "value": "の",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "助詞",
#         "subclass0": "連体化"
#       }
#     },
#     {
#       "value": "はず",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "名詞",
#         "subclass0": "非自立",
#         "subclass1": "一般"
#       }
#     },
#     {
#       "value": "です",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "助動詞"
#       }
#     },
#     {
#       "value": "。",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "class": "記号",
#         "subclass0": "句点"
#       }
#     }
#   ]
# ]

You can exclude needless token with target_class and class and sub class of this option outputs.

include_reading outputs reading in MeCab’s metadata as below.

Execution example:

tokenize 'TokenMecab("include_reading", true)' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545892913.226588,
#     0.0003414154052734375
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "カレ"
#       }
#     },
#     {
#       "value": "の",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ノ"
#       }
#     },
#     {
#       "value": "名前",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ナマエ"
#       }
#     },
#     {
#       "value": "は",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ハ"
#       }
#     },
#     {
#       "value": "山田",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ヤマダ"
#       }
#     },
#     {
#       "value": "さん",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "サン"
#       }
#     },
#     {
#       "value": "の",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ノ"
#       }
#     },
#     {
#       "value": "はず",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ハズ"
#       }
#     },
#     {
#       "value": "です",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "デス"
#       }
#     },
#     {
#       "value": "。",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "。"
#       }
#     }
#   ]
# ]

You can get reading of a token with this option.

include_form outputs inflected_type, inflected_form and base_form in MeCab’s metadata as below.

Execution example:

tokenize 'TokenMecab("include_form", true)' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545892987.209944,
#     0.0004286766052246094
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "彼"
#       }
#     },
#     {
#       "value": "の",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "の"
#       }
#     },
#     {
#       "value": "名前",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "名前"
#       }
#     },
#     {
#       "value": "は",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "は"
#       }
#     },
#     {
#       "value": "山田",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "山田"
#       }
#     },
#     {
#       "value": "さん",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "さん"
#       }
#     },
#     {
#       "value": "の",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "の"
#       }
#     },
#     {
#       "value": "はず",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "はず"
#       }
#     },
#     {
#       "value": "です",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "inflected_type": "特殊・デス",
#         "inflected_form": "基本形",
#         "base_form": "です"
#       }
#     },
#     {
#       "value": "。",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "base_form": "。"
#       }
#     }
#   ]
# ]

use_reading supports a search by kana. This option is useful for countermeasure of orthographical variants because it searches with kana.

Execution example:

tokenize 'TokenMecab("use_reading", true)' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545893087.556662,
#     0.0003693103790283203
#   ],
#   [
#     {
#       "value": "カレ",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ノ",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ナマエ",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ハ",
#       "position": 3,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ヤマダ",
#       "position": 4,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "サン",
#       "position": 5,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ノ",
#       "position": 6,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "ハズ",
#       "position": 7,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "デス",
#       "position": 8,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "。",
#       "position": 9,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

7.8.12.5. Advanced usage

target_class option can also specify subclasses and exclude or add specific part-of-speech of specific using + or -. So, you can also search a noun with excluding non-independent word and suffix of person name as below.

In this way you can search exclude the noise of token.

Execution example:

tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545810363.771334,
#     0.0003197193145751953
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "名前",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false
#     },
#     {
#       "value": "山田",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false
#     }
#   ]
# ]

In addition, you can get reading of a token that exclude the noise with include_reading option as below.

Execution example:

tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞", "include_reading", true)' '彼の名前は山田さんのはずです。'
# [
#   [
#     0,
#     1545893197.914959,
#     0.0003139972686767578
#   ],
#   [
#     {
#       "value": "彼",
#       "position": 0,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "カレ"
#       }
#     },
#     {
#       "value": "名前",
#       "position": 1,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ナマエ"
#       }
#     },
#     {
#       "value": "山田",
#       "position": 2,
#       "force_prefix": false,
#       "force_prefix_search": false,
#       "metadata": {
#         "reading": "ヤマダ"
#       }
#     }
#   ]
# ]

7.8.12.6. Parameters

7.8.12.6.1. Optional parameter

There are four optional parameters include_class , target_class , include_form and use_reading .

7.8.12.6.1.1. include_class

Outputs class and subclass in MeCab’s metadata.

7.8.12.6.1.2. target_class

Outputs a token of specifying a part-of-speech.

7.8.12.6.1.3. include_reading

Outputs reading in MeCab’s metadata.

7.8.12.6.1.4. include_form

Outputs inflected_type, inflected_form and base_form in MeCab’s metadata.

7.8.12.6.1.5. use_reading

Outputs reading of token.

7.8.12.7. See also