7.9.3. TokenFilterStem

7.9.3.1. Summary

TokenFilterStem stems tokenized token.

7.9.3.2. Syntax

TokenFilterStopWord has optional parameter:

TokenFilterStopStem

TokenFilterStem("algorithm", "steming_algorithm")

7.9.3.3. Usage

Here is an example that uses TokenFilterStem token filter:

Execution example:

plugin_register token_filters/stem
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Memos TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStem
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0, 1337566253.89858, 0.000355720520019531], true]
load --table Memos
[
{"content": "I develop Groonga"},
{"content": "I'm developing Groonga"},
{"content": "I developed Groonga"}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]
select Memos --match_columns content --query "develops"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "I develop Groonga"
#       ],
#       [
#         2,
#         "I'm developing Groonga"
#       ],
#       [
#         3,
#         "I developed Groonga"
#       ]
#     ]
#   ]
# ]

All of develop, developing, developed and develops tokens are stemmed as develop. So we can find develop, developing and developed by develops query.

You can specify steming algorithm except English with algorithm option as below.

Execution example:

plugin_register token_filters/stem
table_create Memos TABLE_NO_KEY
column_create Memos content COLUMN_SCALAR ShortText
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters 'TokenFilterStem("algorithm", "french")'
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
load --table Memos
[
{"content": "maintenait"},
{"content": "maintenant"}
]
select Memos --match_columns content --query "maintenir"
# [
#   [
#     0,
#     0.0,
#     0.0
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "maintenait"
#       ],
#       [
#         2,
#         "maintenant"
#       ]
#     ]
#   ]
# ]

7.9.3.4. Parameters

7.9.3.4.1. Optional parameter

There is a optional parameters algorithm.

7.9.3.4.1.1. algorithm

Specify a steming algorithm.

Steming algorithm is extract the stem. It is prepared for each language.

You can extract the stem of each language by changing steming algorithm. For example, if you want extract the stem of the French, you specify French to algorithm option.

Here are support steming algorithm:

French
Spanish
Portuguese
Italian
Romanian
German
Dutch
Swedish
Norwegian
Danish
Russian
Finnish