BloGroonga

2017-10-10

PGroonga (fast full text search module for PostgreSQL) 2.0.2 has been released

PGroonga 2.0.2 has been released! PGroonga makes PostgreSQL fast full text search for all languages. This is major version up!

About PGroonga

I will explain about PGroonga. Because it is the first release announcement after the major version upgrade. The highlight is summarized after this.

PGroonga is a PostgreSQL extension that makes PostgreSQL fast full text search platform for all languages. There are some PostgreSQL extensions that improve full text search feature of PostgreSQL. PGroonga provides full text search related rich features. PGroonga is very fast. Because PGroonga uses Groonga that is a real full text search engine as backend.

Performance

PGroonga is faster than pg_bigm. PGroonga is faster than textsearch bundled in PostgreSQL. PGroonga is faster than them about index creation and full text search.

Here is a benchmark result between PGroonga and pg_bigm. They use Japanese Wikipedia data.

Extension Index creation time
PGroonga About 19m
pg_bigm About 33m

In this case, PGroonga is about 2 times faster than pg_bigm.

Here is a benchmark result between PGroonga and textsearch. They use English Wikipedia data. Because textsearch isn't support Japanese.

You can't comparison directly against the above result. Because the amount of data is different.

Extension Index creation time
PGroonga About 1h 24m
textsearch About 2h 53m

In this case, PGroonga is about 2 times faster than textsearch.

Here is a benchmark result for full text search between PGroonga and pg_bigm:

Search keywords N hits PGroonga pg_bigm
"PostgreSQL" or "MySQL" About 300 About 2ms About 49ms
データベース (database in Japanese) About 15000 About 49ms About 1300ms
テレビアニメ (TV animation in Japanese) About 20000 About 65ms About 2800ms
日本 (Japan in Japanese) About 530000 About 560ms About 480ms

In "日本" (Japan in Japanese) case, pg_bigm is a bit faster (*1) than PGroonga. But PGroonga is 24 times to 43 times faster than pg_bigm in other cases. The result shows that PGroonga can perform stable high performance fast full text search against all keywords.

(*1) pg_bigm can perform faster full text search against keywords that have 2 or less characters rather than keywords that have 3 or more characters.

Here is a benchmark result for full text search between PGroonga and textsearch:

Search keywords N hits PGroonga textsearch Groonga
"PostgreSQL" or "MySQL" About 1600 About 6ms About 3ms About 3ms
database About 210000 About 698ms About 602ms About 19ms
animation About 40000 About 173ms About 1000ms (*2) About 6ms
America About 470000 About 1300ms About 1200ms About 45ms

(*2) textsearch is slow because hit about 420 thousand items (about 10 times larger of PGroonga) with "animation". This is caused by stemming. "animation" is stemmed as "anim".

The search performance of PGroonga and textsearch are almost the same. Textsearch is slower in "animation" because it comes from the difference in the number of hits, not the essential search performance difference.

There are Groonga's results as reference. Groonga is the full text search engine of PGroonga. Groonga can search every cases in less than 50ms. It shows that the main processes of PGroonga and textsearch aren't full text search in these cases. It shows there are common overhead in PostgreSQL. It has greater impact than full text search.

You can see more details of these benchmark results:

PGroonga provides the following features that aren't provided by other extensions:

  • Normalize feature
  • Custom tokenizer feature
  • Custom token filter feature
  • Search using query language
  • HTML highlight feature
  • HTML snippet feature
  • JSON search
  • Auto complete feature
  • Similar document search feature
  • Synonym expansion feature

Normalize feature is a feature that unifies different notation texts to an unified notation text.

For simple example, both "POSTGRESQL" (uppercase only) and "PostgreSQL" (mixed case) are converted to "postgresql" (lowercase only). You can search "PostgreSQL" (mixed case) by "postgresql" (lowercase).

For more complex example, "¼" (U+00BC VULGAR FRACTION ONE QUARTER) is converted to "1/4" ("1", "/" and "4"). This normalization is based on Unicode NFKC

Custom tokenizer feature is a feature that customizes search keyword extraction process (tokenization). If you can custom tokenization, you can control trade-off between search precision and search performance.

For example, if you use "tokenizer that is based of character based N-gram" instead of "tokenizer that is based on character and character type based N-gram", you can get better search precision and search performance but may not find some texts. You can search "123" by "2" with character based N-gram but not with character and character type based N-gram.

Custom token filter feature is a function that customizes how to process keywords extracted by tokenizer. Textsearch has the same function with the name dictionary. Both PGroonga and textsearch implements the stemming function with this mechanism.

Search using query language is a function that specifies AND/OR/NOT search by user with a mini language like "A OR (B - 1)". The syntax is similar to Google's one.

HTML highlight feature is a function to mark up search keywords with <span class="keyword">...</span>. The result is safe to use in HTML as it is. It's useful for Web application development.

HTML snippet feature is a function to return texts around search keyword. The feature is used in Google search results too. The keyword is marked up with <span class="keyword">...</span>. It's safe to use the result in HTML as it is.

JSON search feature is a function that searches JSON contents flexible. You can index jsonb type column as is. You don't need to use expression for indexing. You can perform full text search against all texts in JSON. It's useful to insert all logs as JSON that have some different structured and search them later.

Auto complete feature is a function to completes an input in a text box for entering a search keyword. Google implements it too. You can support completing by romaji like Google does.

Similar document search feature is a function to search texts whose contents are similar. You can use this feature to show similar entries in blog system.

Synonym expansion feature is a function that searches keywords that have the same meaning but different expressions. For example, You can search "PostgreSQL" with "PostgreSQL" or "PG".

See reference manual and how to for details of these features.

Here are features that will be implemented in the feature. They are already implemented in Groonga.

  • Weight feature

Usage

You can use PGroonga without full text search knowledge. You just create an index and puts a condition into WHERE:

CREATE INDEX index_name ON table USING pgroonga (column);

SELECT * FROM table WHERE column &@~ 'PostgreSQL';

You can also use LIKE to use PGroonga. PGroonga provides a feature that performs LIKE with index. LIKE with PGroonga index is faster than LIKE without index. It means that you can improve performance without changing your application that uses the following SQL:

SELECT * FROM table WHERE column LIKE '%PostgreSQL%';

It's recommend that you migrate to &@~, an operator for full text search, from LIKE. Because &@~ is faster than LIKE.

Are you interested in PGroonga? Please install and try tutorial. You can know all PGroonga features.

You can install PGroonga easily. Because PGroonga provides packages for major platforms. There are binaries for Windows.

Highlight

Here are highlights after PGroonga 1.2.3:

  • Support PostgreSQL 10

  • Improve accuracy of query execution plan (performance is improved)

  • pgroonga schema is deprecated

    • You can still use pgroonga scheme. Because PGroonga 2.x supports backward compatibility with PGroonga 1.x.

Support PostgreSQL 10

PGroonga supports PostgreSQL 10. You can use PGroonga in the latest PostgreSQL!

PGroonga supports logical replication. Logical replication is a new feature in PostgreSQL 10. You can also use physical replication for replication. You can choose physical replication or logical replication in PostgreSQL 10.

Physical replication uses more disk space, but crash recovery works in many cases.

On the other hand, you can use more flexible schema with logical replication. For example, you can use master only for update and use slaves for all search. You just create PGroonga indexes in slave. It improves update performance and supports scale-out for search.

We'll publish benchmark results for each replication. Stay tuned!

If you need commercial support about PostgreSQL cluster and PGroonga, contact us.

Improve accuracy of query execution plan (performance is improved)

PostgreSQL planner estimates execution cost based on information from each index including PGroonga and selects the best execution plan. If PGroonga returns more accurate information, PostgreSQL can selects more effective execution plan.

In this release, PGroonga improves index information of expression that uses a STABLE function or IMMUTABLE function. PGroonga provides pgroonga_query_expand() as an IMMUTABLE function. It's a function that expands query. It's used like the following. PostgreSQL will select a good execution plan when you use like the following SQL:

SELECT *
  FROM diaries
 WHERE content &@~ pgroonga_query_expand('synonyms', 'term', 'synonyms',
                                         'SEARCH QUERY BY THE USER');

pgroonga schema is deprecated

PGroonga 1.x defines functions and operator classes in the pgroonga schema. Some users say that it's useful to use the current schema (public in most cases) with prefixed names instead of using pgroonga schema. pgroonga schema is deprecated in PGroonga 2.x. PGroonga 2.x defines functions and operator classes with pgroonga _ prefix.

pgroonga schema is deprecated but you can still use pgroonga schema. You can upgrade to PGroonga 2.x safely. pgroonga schema is maintained at least in PGroonga 2.x. Please migrate to pgroonga_ prefixed name gradually.

How to upgrade

This version is compatible with 1.0 or later. You can upgrade by steps in "Compatible case" in Upgrade document.

Announce

Sessions

Both sessions are about PGroonga 2. The session of PostgreSQL Conference Japan 2017 is for people who are not using PGroonga yet. The session of PGConf.ASIA 2017 is for people who already use PGroonga.

Support service

If you need commercial support for PGroonga, contact us.

Conclusion

New PGroonga version has been released. PGroonga 2 provides more PostgreSQL friendly interface.

See also release note for all changes.

Try PGroonga when you want to perform fast full text search against all languages on PostgreSQL!