BloGroonga

2023-04-13

PGroonga (fast full text search module for PostgreSQL) 3.0.0 has been released

PGroonga 3.0.0 has been released! PGroonga makes PostgreSQL fast full text search for all languages. This is third major version up!

About PGroonga

We explain about PGroonga. Because this is the first 3.Y.Z series release announcement. The highlight is summarized after this.

PGroonga is a PostgreSQL extension that makes PostgreSQL fast full text search platform for all languages. There are some PostgreSQL extensions that improve full text search feature of PostgreSQL. PGroonga provides full text search related rich features. PGroonga is very fast. Because PGroonga uses Groonga that is a real full text search engine as backend.

Performance

PGroonga is faster than pg_bigm. PGroonga is faster than textsearch bundled in PostgreSQL. PGroonga is faster than them about index creation and full text search.

Here is a benchmark result between PGroonga and pg_bigm. They use Japanese Wikipedia data.

Extension Index creation time
PGroonga About 19m
pg_bigm About 33m

In this case, PGroonga is about 2 times faster than pg_bigm.

Here is a benchmark result between PGroonga and textsearch. They use English Wikipedia data. Because textsearch doesn't support Japanese.

You can't comparison directly against the above result. Because the amount of data is different.

Extension Index creation time
PGroonga About 1h 24m
textsearch About 2h 53m

In this case, PGroonga is about 2 times faster than textsearch.

Here is a benchmark result for full text search between PGroonga and pg_bigm:

Search keywords N hits PGroonga pg_bigm
"PostgreSQL" or "MySQL" About 300 About 2ms About 49ms
データベース (database in Japanese) About 15000 About 49ms About 1300ms
テレビアニメ (TV animation in Japanese) About 20000 About 65ms About 2800ms
日本 (Japan in Japanese) About 530000 About 560ms About 480ms

In "日本" (Japan in Japanese) case, pg_bigm is a bit faster (*1) than PGroonga. But PGroonga is 24 times to 43 times faster than pg_bigm in other cases. The result shows that PGroonga can perform stable high performance fast full text search against all keywords.

(*1) pg_bigm can perform faster full text search against keywords that have 2 or less characters rather than keywords that have 3 or more characters.

Here is a benchmark result for full text search between PGroonga and textsearch:

Search keywords N hits PGroonga textsearch Groonga
"PostgreSQL" or "MySQL" About 1600 About 6ms About 3ms About 3ms
database About 210000 About 698ms About 602ms About 19ms
animation About 40000 About 173ms About 1000ms (*2) About 6ms
America About 470000 About 1300ms About 1200ms About 45ms

(*2) textsearch is slow because hit about 420 thousand items (about 10 times larger of PGroonga) with "animation". This is caused by stemming. "animation" is stemmed as "anim".

The search performance of PGroonga and textsearch are almost the same. Textsearch is slower in "animation" because it comes from the difference in the number of hits, not the essential search performance difference.

There are Groonga's results as reference. Groonga is the full text search engine of PGroonga. Groonga can search every cases in less than 50ms. It shows that the main processes of PGroonga and textsearch aren't full text search in these cases. It shows there are common overhead in PostgreSQL. It has greater impact than full text search.

You can see more details of these benchmark results:

Features

PGroonga provides the following features. Some of features aren't provided by other extensions:

  • Normalization feature
  • Custom tokenizer feature
  • Custom token filter feature
  • Search using query language
  • HTML highlight feature
  • HTML snippet feature
  • JSON search
  • Auto complete feature
  • Similar document search feature
  • Synonym expansion feature
  • Weight feature

Normalization feature is a feature that unifies different notation texts to an unified notation text.

For simple example, both "POSTGRESQL" (uppercase only) and "PostgreSQL" (mixed case) are converted to "postgresql" (lowercase only). You can search "PostgreSQL" (mixed case) by "postgresql" (lowercase).

For more complex example, "¼" (U+00BC VULGAR FRACTION ONE QUARTER) is converted to "1/4" ("1", "/" and "4"). This normalization is based on Unicode NFKC.

Custom tokenizer feature is a feature that customizes search keyword extraction process (tokenization). If you can custom tokenization, you can control trade-off between search precision and search performance.

For example, if you use "tokenizer that is based of character based N-gram" instead of "tokenizer that is based on character and character type based N-gram", you can get better search precision and search performance but may not find some texts. You can search "123" by "2" with character based N-gram but not with character and character type based N-gram.

Custom token filter feature is a function that customizes how to process keywords extracted by tokenizer. Textsearch has the same function with the name dictionary. Both PGroonga and textsearch implements the stemming function with this mechanism.

Search using query language is a function that specifies AND/OR/NOT search by user with a mini language like "A OR (B - 1)". The syntax is similar to Google's one.

HTML highlight feature is a function to mark up search keywords with <span class="keyword">...</span>. The result is safe to use in HTML as it is. It's useful for Web application development.

HTML snippet feature is a function to return texts around search keyword. The feature is used in Google search results too. The keyword is marked up with <span class="keyword">...</span>. It's safe to use the result in HTML as it is.

JSON search feature is a function that searches JSON contents flexible. You can index jsonb type column as is. You don't need to use expression for indexing. You can perform full text search against all texts in JSON. It's useful to insert all logs as JSON that have some different structured and search them later.

Auto complete feature is a function to completes an input in a text box for entering a search keyword. Google implements it too. You can support completing by romaji like Google does.

Similar document search feature is a function to search texts whose contents are similar. You can use this feature to show similar entries in blog system.

Synonym expansion feature is a function that searches keywords that have the same meaning but different expressions. For example, you can search "PostgreSQL" with "PostgreSQL" or "PG".

Weight feature is a feature to adjust scores. You can show more important records in the first view with this feature. For example, a content in the "title" column is more important than a content in the "body" column. You can add +10 score when the given query matches a content in the "title" column and +1 score when the given query matches a content in the "body" column. With this adjustment and sort, more important records exist at the front of a result set.

See reference manual and how to for details of these features.

Usage

You can use PGroonga without full text search knowledge. You just create an index and put a condition into WHERE:

CREATE INDEX index_name ON table USING pgroonga (column);

SELECT * FROM table WHERE column &@~ 'PostgreSQL';

You can also use LIKE to use PGroonga. PGroonga provides a feature that performs LIKE with index. LIKE with PGroonga index is faster than LIKE without index. It means that you can improve performance without changing your application that uses the following SQL:

SELECT * FROM table WHERE column LIKE '%PostgreSQL%';

It's recommend that you migrate to &@~, an operator for full text search, from LIKE. Because &@~ is faster than LIKE.

Are you interested in PGroonga? Please install and try tutorial. You can know all PGroonga features.

You can install PGroonga easily. Because PGroonga provides packages for major platforms. There are binaries for Windows.

Highlight

PGroonga 3.0.0 has a backward incompatible change. But most PGroonga users including Supabase users will not be affected. So you can upgrade to PGroonga 3.0.0 as usual:

ALTER EXTENSION pgroonga UPDATE;

If you're a PGroonga 1.Y.Z user, PGroonga 3.0.0 has a backward incompatible change. You need to stop using the pgroonga schema. You need to use pgroonga_* in the current schema instead.

Here are highlights after PGroonga 2.4.7:

  • Added support for the query_allow_column=true index option in pgroonga_query_extract_keywords(). You can pass an index name to pgroonga_query_extract_keywords() to use the query_allow_column=true index option.

  • Added support for using query language for term search against text[] and varchar[] columns. See the &=? operator for details.

How to upgrade

If you're using PGroonga 2.0.0 or later, you can upgrade by steps in "Compatible case" in Upgrade document.

If you're using PGroonga 1.Y.Z, you can upgrade by steps in "Incompatible case" in Upgrade document.

Support service

If you need commercial support for PGroonga, contact us.

Conclusion

New PGroonga major version has been released. PGroonga 3 provides more PostgreSQL friendly interface.

Try PGroonga when you want to perform fast full text search against all languages on PostgreSQL!