7.11.3.2. scorer_tf_idf
#
Added in version 5.0.1.
7.11.3.2.1. Summary#
scorer_tf_idf
is a scorer based of TF-IDF (term
frequency-inverse document frequency) score function.
To put it simply, TF (term frequency) divided by DF (document frequency) is TF-IDF. “TF” means that “the number of occurrences is more important”. “TF divided by DF” means that “the number of occurrences of important term is more important”.
The default score function in Groonga is TF (term frequency). It doesn’t care about term importance but is fast.
TF-IDF cares about term importance but is slower than TF.
TF-IDF will compute more suitable score rather than TF for many cases. But it’s not perfect.
If document contains many same keywords such as “They are keyword, keyword, keyword … and keyword”, it increases score by TF and TF-IDF. Search engine spammer may use the technique. But TF-IDF doesn’t guard from the technique.
Okapi BM25 can solve the case. But it’s more slower than TF-IDF and not implemented yet in Groonga.
Groonga provides scorer_tf_at_most scorer that can also solve the case.
You don’t need to resolve scoring only by score function. Score function is highly depends on search query. You may be able to use metadata of matched record.
For example, Google uses PageRank for scoring. You may be able to use data type (“title” data are important rather than “memo” data), tag, geolocation and so on.
Please stop to think about only score function for scoring.
7.11.3.2.2. Syntax#
This scorer has only one parameter:
scorer_tf_idf(column)
scorer_tf_idf(index)
7.11.3.2.3. Usage#
This section describes how to use this scorer.
Here are a schema definition and sample data to show usage.
Sample schema:
Execution example:
table_create Logs TABLE_NO_KEY
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Logs message COLUMN_SCALAR Text
# [[0,1337566253.89858,0.000355720520019531],true]
table_create Terms TABLE_PAT_KEY ShortText \
--default_tokenizer TokenBigram \
--normalizer NormalizerAuto
# [[0,1337566253.89858,0.000355720520019531],true]
column_create Terms message_index COLUMN_INDEX|WITH_POSITION Logs message
# [[0,1337566253.89858,0.000355720520019531],true]
Sample data:
Execution example:
load --table Logs
[
{"message": "Error"},
{"message": "Warning"},
{"message": "Warning Warning"},
{"message": "Warning Warning Warning"},
{"message": "Info"},
{"message": "Info Info"},
{"message": "Info Info Info"},
{"message": "Info Info Info Info"},
{"message": "Notice"},
{"message": "Notice Notice"},
{"message": "Notice Notice Notice"},
{"message": "Notice Notice Notice Notice"},
{"message": "Notice Notice Notice Notice Notice"}
]
# [[0,1337566253.89858,0.000355720520019531],13]
You specify scorer_tf_idf
in match_columns like the
following:
Execution example:
select Logs \
--match_columns "scorer_tf_idf(message)" \
--query "Error OR Info" \
--output_columns "message, _score" \
--sort_keys "-_score"
# [
# [
# 0,
# 1337566253.89858,
# 0.000355720520019531
# ],
# [
# [
# [
# 5
# ],
# [
# [
# "message",
# "Text"
# ],
# [
# "_score",
# "Int32"
# ]
# ],
# [
# "Info Info Info Info",
# 3
# ],
# [
# "Error",
# 2
# ],
# [
# "Info Info Info",
# 2
# ],
# [
# "Info Info",
# 1
# ],
# [
# "Info",
# 1
# ]
# ]
# ]
# ]
Both the score of Info Info Info
and the score of Error
are
2
even Info Info Info
includes three Info
terms. Because
Error
is more important term rather than Info
. The number of
documents that include Info
is 4
. The number of documents that
include Error
is 1
. Term that is included in less documents
means that the term is more characteristic term. Characteristic term
is important term.
7.11.3.2.4. Parameters#
This section describes all parameters.
7.11.3.2.4.1. Required parameters#
There is only one required parameter.
7.11.3.2.4.1.1. column
#
The data column that is match target. The data column must be indexed.
7.11.3.2.4.1.2. index
#
The index column to be used for search.
7.11.3.2.4.2. Optional parameters#
There is no optional parameter.
7.11.3.2.5. Return value#
This scorer returns score as Float32.
select returns _score
as Int32
not
Float
. Because it casts to Int32
from Float
for keeping
backward compatibility.
Score is computed as TF-IDF based algorithm.