7.3.25. extract#

7.3.25.1. Summary#

Added in version 16.0.3.

注釈

This is an experimental feature. Currently, this feature is still not stable.

extract command extracts plain text or values from structured data such as HTML and JSON by the specified extractors.

There is no need to create a table to use extract command. It is useful for you to check the results of extractors before you attach them to a lexicon by the extractors option of table_create.

See Extractors for details of extractors.

7.3.25.2. Syntax#

This command takes two parameters.

Both extractors and value are required:

extract extractors
        value

7.3.25.3. Usage#

Here is an example that extracts text content from HTML by ExtractorHTML. It removes HTML tags and expands character references:

Execution example:

extract \
  --extractors 'ExtractorHTML' \
  --value "<html><body>He&lt;ll&gt;o</body></html>"
# [[0,1337566253.89858,0.000355720520019531],{"extracted":"He<ll>o"}]

Here is an example that extracts values from JSON by ExtractorJSON. The $.tags[*] JSONPath matches all elements of the tags array:

Execution example:

extract \
  --extractors 'ExtractorJSON("path", "$.tags[*]")' \
  --value '{"tags": ["groonga", "search", "engine"], "title": "ignored"}'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "extracted": [
#       "groonga",
#       "search",
#       "engine"
#     ]
#   }
# ]

7.3.25.4. Parameters#

This section describes parameters of extract.

7.3.25.4.1. Required parameters#

There are required parameters, extractors and value.

7.3.25.4.1.1. extractors#

Specifies extractors separated by ,. extract command applies the extractors to value in order. The output of an extractor is passed to the next extractor as its input.

See Extractors for all extractors.

7.3.25.4.1.2. value#

Specifies the value that you want to extract plain text or values from.

If you want to include spaces in value, you need to quote value by single quotation (') or double quotation (").

7.3.25.5. Return value#

[HEADER, {"extracted": EXTRACTED_VALUE}]
HEADER

See 出力形式 about HEADER.

EXTRACTED_VALUE

The value extracted by the specified extractors. It's a single value when the extractors return a single value such as ExtractorHTML. It's an array when the extractors return multiple values such as ExtractorJSON.

7.3.25.6. See also#