7.7.2.1. ExtractorHTML#

7.7.2.1.1. Summary#

Added in version 16.0.3.

注釈

This is an experimental feature. Currently, this feature is still not stable.

ExtractorHTML extracts text content from HTML. It removes HTML tags and expands character references (HTML entities such as <) by default. You can use this extractor to index only the text in HTML without markup.

ExtractorHTML does nothing for a value that isn't a text type such as ShortText, Text and LongText. The value is returned as-is in this case.

7.7.2.1.2. Syntax#

ExtractorHTML has optional parameters.

No options:

ExtractorHTML

Specify options:

ExtractorHTML("remove_tag", true)

ExtractorHTML("expand_character_reference", true)

7.7.2.1.3. Usage#

Here is an example that extracts text content from HTML by the default parameters. HTML tags (<html>, <body> and so on) are removed and character references (&lt; and &gt;) are expanded:

Execution example:

extract \
  --extractors ExtractorHTML \
  --value "<html><body>He&lt;ll&gt;o</body></html>"
# [[0,1337566253.89858,0.000355720520019531],{"extracted":"He<ll>o"}]

You can keep HTML tags by setting remove_tag to false. Character references are still expanded in this case:

Execution example:

extract \
  --extractors 'ExtractorHTML("remove_tag", false)' \
  --value "<html><body>He&lt;ll&gt;o</body></html>"
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   {
#     "extracted": "<html><body>He<ll>o</body></html>"
#   }
# ]

You can keep character references as-is by setting expand_character_reference to false. HTML tags are still removed in this case:

Execution example:

extract \
  --extractors 'ExtractorHTML("expand_character_reference", false)' \
  --value "<html><body>He&lt;ll&gt;o</body></html>"
# [[0,1337566253.89858,0.000355720520019531],{"extracted":"He&lt;ll&gt;o"}]

When you attach ExtractorHTML to a lexicon, the lexicon indexes the extracted text content. The original HTML is kept in the data column:

Execution example:

table_create Contents TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Contents html COLUMN_SCALAR Text
# [[0,1337566253.89858,0.000355720520019531],true]
table_create Terms TABLE_PAT_KEY ShortText \
  --default_tokenizer TokenBigram \
  --normalizers NormalizerNFKC \
  --extractors ExtractorHTML
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Terms contents_html COLUMN_INDEX|WITH_POSITION Contents html
# [[0,1337566253.89858,0.000355720520019531],true]
load --table Contents
[
{"html": "<p>Groonga is a <b>fast</b> full text search engine.</p>"}
]
# [[0,1337566253.89858,0.000355720520019531],1]
select Contents \
  --match_columns html \
  --query "fast" \
  --output_columns html
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "html",
#           "Text"
#         ]
#       ],
#       [
#         "<p>Groonga is a <b>fast</b> full text search engine.</p>"
#       ]
#     ]
#   ]
# ]
select Contents \
  --match_columns html \
  --query "<b>" \
  --output_columns html
# [[0,1337566253.89858,0.000355720520019531],[[[0],[["html","Text"]]]]]

The query fast matches but the query <b> doesn't match because the indexed token comes from the extracted text Groonga is a fast full text search engine. instead of the raw HTML.

7.7.2.1.4. Parameters#

7.7.2.1.4.1. Optional parameters#

7.7.2.1.4.1.1. remove_tag#

Specifies whether HTML tags are removed.

If this is true, HTML tags such as <p> are removed. If this is false, HTML tags are kept as-is.

The default value is true.

7.7.2.1.4.1.2. expand_character_reference#

Specifies whether character references (HTML entities) are expanded.

If this is true, both named character references such as &lt; and numeric character references such as &#x3042; are expanded to the corresponding characters. If this is false, character references are kept as-is.

The default value is true.

7.7.2.1.5. See also#