7.15.23. snippet
#
7.15.23.1. Summary#
This function extracts snippets of target text around search
keywords (KWIC
. KeyWord In Context
).
If you want to use this function for normal Web application, snippet_html may be suitable. It’s a HTML specific version of this function.
7.15.23.2. Syntax#
snippet
requires at least one parameter that is the snippet target
text:
snippet(column, ...)
You can specify one ore more tuples of keyword, open tag and close tag:
snippet(column,
"keyword1", "open-tag1", "close-tag1",
"keyword2", "open-tag2", "close-tag2",
...)
If you specify default open tag and default close tag, you can specify only keywords:
snippet(column,
"keyword1",
"keyword2",
...,
{
"default_open_tag": "open-tag",
"default_close_tag": "close-tag"
})
Added in version 11.0.9: If you specify default open tag and default close tag and omit keywords, keywords are extracted from the current condition automatically like snippet_html:
snippet(column,
{
"default_open_tag": "open-tag",
"default_close_tag": "close-tag"
})
You can specify options as the last argument with all syntaxes:
snippet(column,
...,
{
"width": 200,
"max_n_results": 3,
"skip_leading_spaces": true,
"html_escape": false,
"prefix": null,
"suffix": null,
"normalizer": null,
"default_open_tag": null,
"default_close_tag": null,
"default": null,
"delimiter_pattern": null,
})
7.15.23.3. Usage#
Here are a schema definition and sample data to show usage.
Execution example:
table_create Documents TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Documents content COLUMN_SCALAR Text
# [[0,1337566253.89858,0.000355720520019531],true]
table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer TokenBigram --normalizer NormalizerAuto
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Terms documents_content_index COLUMN_INDEX|WITH_POSITION Documents content
# [[0,1337566253.89858,0.000355720520019531],true]
load --table Documents
[
["content"],
["Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, groonga allows updates without read locks. These characteristics result in superior performance on real-time applications."],
["Groonga is also a column-oriented database management system (DBMS). Compared with well-known row-oriented systems, such as MySQL and PostgreSQL, column-oriented systems are more suited for aggregate queries. Due to this advantage, groonga can cover weakness of row-oriented systems."]
]
# [[0,1337566253.89858,0.000355720520019531],2]
snippet
extracts keywords from conditions specified in --query
and/or --filter
automatically when you specify
default_open_tag
option and default_close_tag
and don’t
specify keywords. It’s similar to snippet_html.
The following example uses --query "fast performance"
. In this
case, fast
and performance
are used as keywords.
Execution example:
select Documents \
--output_columns 'snippet(content, \
{ \
"default_open_tag": "[", \
"default_close_tag": "]" \
})' \
--match_columns content \
--query "fast performance"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 1
# ],
# [
# [
# "snippet",
# null
# ]
# ],
# [
# [
# "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro",
# "onga allows updates without read locks. These characteristics result in superior [performance] on real-time applications."
# ]
# ]
# ]
# ]
# ]
--query "fast performance"
matches to only the first record’s
content. This snippet
extracts two text parts that include the
keywords fast
or performance
and surrounds the keywords with
[
and ]
.
The max number of text parts is 3 by default. You can change it by
max_n_results
option:
Execution example:
select Documents \
--output_columns 'snippet(content, \
{ \
"default_open_tag": "[", \
"default_close_tag": "]", \
"max_n_results": 1 \
})' \
--match_columns content \
--query "fast performance"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 1
# ],
# [
# [
# "snippet",
# null
# ]
# ],
# [
# [
# "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro"
# ]
# ]
# ]
# ]
# ]
It returns only one snippet because "max_n_results": 1
is specified.
The max size of a text part is 200byte by default. The unit is bytes
not characters. The size doesn’t include inserted [
and [
. You
can change it by width
option:
Execution example:
select Documents \
--output_columns 'snippet(content, \
{ \
"default_open_tag": "[", \
"default_close_tag": "]", \
"width": 50 \
})' \
--match_columns content \
--query "fast performance"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 1
# ],
# [
# [
# "snippet",
# null
# ]
# ],
# [
# [
# "Groonga is a [fast] and accurate full text search en",
# " result in superior [performance] on real-time appli"
# ]
# ]
# ]
# ]
# ]
You can detect snippet delimiter with regular expression by
delimiter_regexp
option. You can use \.\s*
to use only text in
the target sentence. Note that you need to escape \
in string:
Execution example:
select Documents \
--output_columns 'snippet(content, \
{ \
"default_open_tag": "[", \
"default_close_tag": "]", \
"delimiter_regexp": "\\\\.\\\\s*" \
})' \
--match_columns content \
--query "fast performance"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 1
# ],
# [
# [
# "snippet",
# null
# ]
# ],
# [
# [
# "Groonga is a [fast] and accurate full text search engine based on inverted index",
# "These characteristics result in superior [performance] on real-time applications"
# ]
# ]
# ]
# ]
# ]
You can see the detected delimiters (.
and following white spaces)
aren’t included in the result snippets. This is intentional behavior.
You can specify keywords explicitly instead of extracting keywords from the current condition:
Execution example:
select Documents \
--output_columns 'snippet(content, \
"fast", \
"performance", \
{ \
"default_open_tag": "[", \
"default_close_tag": "]" \
})'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 2
# ],
# [
# [
# "snippet",
# null
# ]
# ],
# [
# [
# "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro",
# "onga allows updates without read locks. These characteristics result in superior [performance] on real-time applications."
# ]
# ],
# [
# null
# ]
# ]
# ]
# ]
This snippet
returns two snippets for the first record and
null
for the second record. Because the second record doesn’t have
any specified keywords.
You can specify open tag and close tag for each keyword:
Execution example:
select Documents \
--output_columns 'snippet(content, \
"fast", "[", "]", \
"performance", "(", ")")'
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 2
# ],
# [
# [
# "snippet",
# null
# ]
# ],
# [
# [
# "Groonga is a [fast] and accurate full text search engine based on inverted index. One of the characteristics of groonga is that a newly registered document instantly appears in search results. Also, gro",
# "onga allows updates without read locks. These characteristics result in superior (performance) on real-time applications."
# ]
# ],
# [
# null
# ]
# ]
# ]
# ]
This snippet
surrounds fast
with [
and ]]
and
performance
with (
and )
.
TODO: html_escape
option and so on
7.15.23.4. Parameters#
7.15.23.4.1. Required parameters#
TODO
7.15.23.4.2. Optional parameters#
TODO
7.15.23.4.2.1. max_n_results
#
TODO
7.15.23.4.2.2. width
#
TODO
7.15.23.5. Return value#
This function returns an array of string or null
. If This function
can’t find any snippets, it returns null
.
An element of array is a snippet:
[SNIPPET1, SNIPPET2, ...]
A snippet includes one or more keywords. The max byte size of a snippet except open tag and close tag is 200byte. The unit isn’t the number of characters.
You can change this by width option.
The array size is larger than or equal to 1 and less than or equal to 3.
You can change this by max_n_results option.