7.7. Extractors#

7.7.1. Summary#

Added in version 16.0.3.

注釈

This is an experimental feature. Currently, this feature is still not stable.

Groonga has extractor module that extracts plain text or values from structured data such as HTML and JSON. It is used before tokenizing and indexing a value. For example, ExtractorHTML extracts only the text content from HTML by removing tags, so that markup such as <p> isn't indexed as a token.

Extractor module can be added as a plugin. You can customize value extraction by registering your extractor plugins to Groonga.

An extractor module is attached to a table. The table is normally a lexicon for an index. A table can have zero or more extractor modules. You can attach extractor modules to a table by the extractors option in table_create.

Here is an example table_create that uses the ExtractorHTML extractor module:

Execution example:

table_create Terms TABLE_PAT_KEY ShortText --extractors ExtractorHTML
# [[0,1337566253.89858,0.000355720520019531],true]

When extractors are set to a lexicon, they are applied automatically when an index of the lexicon is updated. The extracted value is tokenized and indexed instead of the original value. The original value is still stored as-is in its data column. So you can search against the extracted content while keeping the original structured data.

If a table has multiple extractors, they are applied in order. The output of an extractor is passed to the next extractor as its input. This is useful when you need to combine extractors. For example, you can extract a string from JSON by ExtractorJSON and then remove HTML tags in the string by ExtractorHTML.

You can use the extract command to check how extractors process a value. The extract command applies the specified extractors to the given value and returns the extracted value. It doesn't need a table:

Execution example:

extract \
  --extractors 'ExtractorHTML' \
  --value "<html><body>He&lt;ll&gt;o</body></html>"
# [[0,1337566253.89858,0.000355720520019531],{"extracted":"He<ll>o"}]

The extract command is useful to confirm the result of extractors before you attach them to a lexicon.

7.7.2. Built-in extractors#

Here is a list of built-in extractors:

7.7.3. See also#