7.8.12. TokenMecab
#
7.8.12.1. Summary#
TokenMecab
is a tokenizer based on MeCab part-of-speech and
morphological analyzer.
MeCab doesn’t depend on Japanese. You can use MeCab for other languages by creating dictionary for the languages. You can use NAIST Japanese Dictionary for Japanese.
You need to install an additional package to using TokenMecab. For more detail of how to installing an additional package, see Install .
TokenMecab
is good for precision rather than recall. You can find
東京都
and 京都
texts by 京都
query with
TokenBigram but 東京都
isn’t expected. You can find only
京都
text by 京都
query with TokenMecab
.
If you want to support neologisms, you need to keep updating your MeCab dictionary. It needs maintain cost. (TokenBigram doesn’t require dictionary maintenance because TokenBigram doesn’t use dictionary.) mecab-ipadic-NEologd : Neologism dictionary for MeCab may help you.
7.8.12.2. Syntax#
TokenMecab
has optional parameter.
No options:
TokenMecab
Specify option:
TokenMecab("include_class", true)
TokenMecab("target_class", "a_part_of_speech")
TokenMecab("include_reading", true)
TokenMecab("include_form", true)
TokenMecab("use_reading", true)
Specify multiple options:
TokenMecab("target_class", "名詞", "include_reading", true)
TokenMecab
also specify multiple options as above.
You can also specify mingle multiple options except above example.
7.8.12.3. Usage#
7.8.12.4. Simple usage#
Here is an example of TokenMeCab
. 東京都
is tokenized to 東京
and 都
. They don’t include 京都
:
Execution example:
tokenize TokenMecab "東京都"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "東京",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "都",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
TokenMecab
can also specify options.
TokenMecab
has target_class
option, include_class
option,
include_reading
option, include_form
option and use_reading
option.
target_class
option searches a token of specifying a part-of-speech.
For example, you can search only a noun as below.
Execution example:
tokenize 'TokenMecab("target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "さん",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "はず",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
include_class
option outputs class and subclass in MeCab’s metadata as below.
Execution example:
tokenize 'TokenMecab("include_class", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "代名詞",
# "subclass1": "一般"
# }
# },
# {
# "value": "の",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助詞",
# "subclass0": "連体化"
# }
# },
# {
# "value": "名前",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "一般"
# }
# },
# {
# "value": "は",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助詞",
# "subclass0": "係助詞"
# }
# },
# {
# "value": "山田",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "固有名詞",
# "subclass1": "人名",
# "subclass2": "姓"
# }
# },
# {
# "value": "さん",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "接尾",
# "subclass1": "人名"
# }
# },
# {
# "value": "の",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助詞",
# "subclass0": "連体化"
# }
# },
# {
# "value": "はず",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "名詞",
# "subclass0": "非自立",
# "subclass1": "一般"
# }
# },
# {
# "value": "です",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "助動詞"
# }
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "class": "記号",
# "subclass0": "句点"
# }
# }
# ]
# ]
You can exclude needless token with target_class
and class and sub class of this option outputs.
include_reading
outputs reading in MeCab’s metadata as below.
Execution example:
tokenize 'TokenMecab("include_reading", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "カレ"
# }
# },
# {
# "value": "の",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ノ"
# }
# },
# {
# "value": "名前",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ナマエ"
# }
# },
# {
# "value": "は",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ハ"
# }
# },
# {
# "value": "山田",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ヤマダ"
# }
# },
# {
# "value": "さん",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "サン"
# }
# },
# {
# "value": "の",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ノ"
# }
# },
# {
# "value": "はず",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ハズ"
# }
# },
# {
# "value": "です",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "デス"
# }
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "。"
# }
# }
# ]
# ]
You can get reading of a token with this option.
include_form
outputs inflected_type, inflected_form and base_form in MeCab’s metadata as below.
Execution example:
tokenize 'TokenMecab("include_form", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "彼"
# }
# },
# {
# "value": "の",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "の"
# }
# },
# {
# "value": "名前",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "名前"
# }
# },
# {
# "value": "は",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "は"
# }
# },
# {
# "value": "山田",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "山田"
# }
# },
# {
# "value": "さん",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "さん"
# }
# },
# {
# "value": "の",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "の"
# }
# },
# {
# "value": "はず",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "はず"
# }
# },
# {
# "value": "です",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "inflected_type": "特殊・デス",
# "inflected_form": "基本形",
# "base_form": "です"
# }
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "base_form": "。"
# }
# }
# ]
# ]
use_reading
supports a search by kana.
This option is useful for countermeasure of orthographical variants because it searches with kana.
Execution example:
tokenize 'TokenMecab("use_reading", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "カレ",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ノ",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ナマエ",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ハ",
# "position": 3,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ヤマダ",
# "position": 4,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "サン",
# "position": 5,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ノ",
# "position": 6,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "ハズ",
# "position": 7,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "デス",
# "position": 8,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "。",
# "position": 9,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
7.8.12.5. Advanced usage#
target_class
option can also specify subclasses and exclude or add specific
part-of-speech of specific using + or -.
So, you can also search a noun with excluding non-independent word and suffix of
person name as below.
In this way you can search exclude the noise of token.
Execution example:
tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞")' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false
# }
# ]
# ]
In addition, you can get reading of a token that exclude the noise with include_reading
option as below.
Execution example:
tokenize 'TokenMecab("target_class", "-名詞/非自立", "target_class", "-名詞/接尾/人名", "target_class", "名詞", "include_reading", true)' '彼の名前は山田さんのはずです。'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# {
# "value": "彼",
# "position": 0,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "カレ"
# }
# },
# {
# "value": "名前",
# "position": 1,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ナマエ"
# }
# },
# {
# "value": "山田",
# "position": 2,
# "force_prefix": false,
# "force_prefix_search": false,
# "metadata": {
# "reading": "ヤマダ"
# }
# }
# ]
# ]
7.8.12.6. Parameters#
7.8.12.6.1. Optional parameter#
There are four optional parameters include_class
, target_class
, include_form
and use_reading
.
7.8.12.6.1.1. include_class
#
Outputs class and subclass in MeCab’s metadata.
7.8.12.6.1.2. target_class
#
Outputs a token of specifying a part-of-speech.
7.8.12.6.1.3. include_reading
#
Outputs reading in MeCab’s metadata.
7.8.12.6.1.4. include_form
#
Outputs inflected_type, inflected_form and base_form in MeCab’s metadata.
7.8.12.6.1.5. use_reading
#
Outputs reading of token.