7.20. Indexing

Groonga supports both online index construction and offline index construction since 2.0.0.

7.20.1. Online index construction

In online index construction, registered documents can be searchable quickly while indexing. But indexing requires more cost rather than indexing by offline index construction.

Online index construction is suitable for a search system that values freshness. For example, a search system for tweets, news, blog posts and so on will value freshness. Online index construction can make fresh documents searchable and keep searchable while indexing.

7.20.2. Offline index construction

In offline index construction, indexing cost is less than indexing cost by online index construction. Indexing time will be shorter. Index will be smaller. Resources required for indexing will be smaller. But a registering document cannot be searchable until all registered documents are indexed.

Offline index construction is suitable for a search system that values less required resources. If a search system doesn't value freshness, offline index construction will be suitable. For example, a reference manual search system doesn't value freshness because a reference manual will be updated only at a release.

7.20.3. How to use

Groonga uses online index construction by default. We register a document, we can search it quickly.

Groonga uses offline index construction by adding an index to a column that already has data.

We define a schema:

Execution example:

table_create Tweets TABLE_NO_KEY
# [[0, 1337566253.89858, 0.000355720520019531], true]
column_create Tweets content COLUMN_SCALAR ShortText
# [[0, 1337566253.89858, 0.000355720520019531], true]
table_create Lexicon TABLE_HASH_KEY ShortText --default_tokenizer TokenBigram --normalizer NormalizerAuto
# [[0, 1337566253.89858, 0.000355720520019531], true]

We register data:

Execution example:

load --table Tweets
[
{"content":"Hello!"},
{"content":"I just start it!"},
{"content":"I'm sleepy... Have a nice day... Good night..."}
]
# [[0, 1337566253.89858, 0.000355720520019531], 3]

We can search with sequential search when we don't have index:

Execution example:

select Tweets --match_columns content --query 'good nice'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         3,
#         "I'm sleepy... Have a nice day... Good night..."
#       ]
#     ]
#   ]
# ]

We create index for Tweets.content. Already registered data in Tweets.content are indexed by offline index construction:

Execution example:

column_create Lexicon tweet COLUMN_INDEX|WITH_POSITION Tweets content
# [[0, 1337566253.89858, 0.000355720520019531], true]

We search with index. We get a matched record:

Execution example:

select Tweets --match_columns content --query 'good nice'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         1
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         3,
#         "I'm sleepy... Have a nice day... Good night..."
#       ]
#     ]
#   ]
# ]

We register data again. They are indexed by online index construction:

Execution example:

load --table Tweets
[
{"content":"Good morning! Nice day."},
{"content":"Let's go shopping."}
]
# [[0, 1337566253.89858, 0.000355720520019531], 2]

We can also get newly registered records by searching:

Execution example:

select Tweets --match_columns content --query 'good nice'
# [
#   [
#     0,
#     1337566253.89858,
#     0.000355720520019531
#   ],
#   [
#     [
#       [
#         2
#       ],
#       [
#         [
#           "_id",
#           "UInt32"
#         ],
#         [
#           "content",
#           "ShortText"
#         ]
#       ],
#       [
#         3,
#         "I'm sleepy... Have a nice day... Good night..."
#       ],
#       [
#         4,
#         "Good morning! Nice day."
#       ]
#     ]
#   ]
# ]