7.9.4. TokenFilterStem#

7.9.4.1. Summary#

TokenFilterStem stems tokenized token.

You need to install an additional package to using TokenFilterStem. For more detail of how to installing an additional package, see Install .

7.9.4.2. Syntax#

TokenFilterStem has optional parameter:

TokenFilterStem

TokenFilterStem("algorithm", "steming_algorithm")

7.9.4.3. Usage#

Here is an example that uses TokenFilterStem token filter:

Execution example:

plugin_register token_filters/stem
# [[0,1337566253.89858,0.000355720520019531],true]
table_create Memos TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Memos content COLUMN_SCALAR ShortText
# [[0,1337566253.89858,0.000355720520019531],true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters TokenFilterStem
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
# [[0,1337566253.89858,0.000355720520019531],true]
load --table Memos
[
{"content": "I develop Groonga"},
{"content": "I'm developing Groonga"},
{"content": "I developed Groonga"}
]
# [[0,1337566253.89858,0.000355720520019531],3]
select Memos --match_columns content --query "develops"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         3
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "I develop Groonga"
#       ],
#       [
#         2,
#         "I'm developing Groonga"
#       ],
#       [
#         3,
#         "I developed Groonga"
#       ]
#     ]
#   ]
# ]

All of develop, developing, developed and develops tokens are stemmed as develop. So we can find develop, developing and developed by develops query.

You can specify steming algorithm except English with algorithm option as below.

Execution example:

plugin_register token_filters/stem
# [[0,1337566253.89858,0.000355720520019531],true]
table_create FrenchMemos TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create FrenchMemos content COLUMN_SCALAR ShortText
# [[0,1337566253.89858,0.000355720520019531],true]
table_create FrenchTerms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizer NormalizerAuto \
  --token_filters 'TokenFilterStem("algorithm", "french")'
# [[0,1337566253.89858,0.000355720520019531],true]
column_create FrenchTerms french_memos_content \
   COLUMN_INDEX|WITH_POSITION FrenchMemos content
# [[0,1337566253.89858,0.000355720520019531],true]
load --table FrenchMemos
[
{"content": "maintenait"},
{"content": "maintenant"}
]
# [[0,1337566253.89858,0.000355720520019531],2]
select FrenchMemos --match_columns content --query "maintenir"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         1,
#         "maintenait"
#       ],
#       [
#         2,
#         "maintenant"
#       ]
#     ]
#   ]
# ]

7.9.4.4. Parameters#

7.9.4.4.1. Optional parameter#

There is a optional parameters algorithm.

7.9.4.4.1.1. algorithm#

Specify a steming algorithm.

Steming algorithm is extract the stem. It is prepared for each language.

You can extract the stem of each language by changing steming algorithm. For example, if you want extract the stem of the French, you specify French to algorithm option.

Here are support steming algorithm:

French
Spanish
Portuguese
Italian
Romanian
German
Dutch
Swedish
Norwegian
Danish
Russian
Finnish