BloGroonga

2017-10-10

PGroonga (fast full text search module for PostgreSQL) 2.0.2 has been released

PGroonga 2.0.2 has been released! PGroonga makes PostgreSQL fast full text search for all languages. This is major version up!

About PGroonga

I will explain about PGroonga. Because it is the first release announcement after the major version upgrade. The highlight is summarized after this.

PGroonga is a PostgreSQL extension that makes PostgreSQL fast full text search platform for all languages. There are some PostgreSQL extensions that improves full text search feature of PostgreSQL. PGroonga provides rich full text search related features and is very fast. Because PGroonga uses Groonga that is a full-fledged full text search engine as backend.

Speed

PGroonga is faster than pg_bigm. PGroonga is faster than textsearch contained in PostgreSQL. PGroonga is 10 times over faster about index creation and full text search.

Here are benchmark results between PGroonga and pg_bigm. They use Japanese Wikipedia data.

Extension Index creation time
PGroonga About 19m
pg_bigm About 33m

In this case, PGroonga is about 2 times faster than pg_bigm.

Next is benchmark results between PGroonga and pg_bigm. They use English Wikipedia data. Because textsearch isn't support Japanese.

You can't comparison directly against the above result. Because the amount of data is different.

Extension Index creation time
PGroonga About 1h 24m
textsearch About 2h 53m

In this case, PGroonga is about 2 times faster than textsearch.

Here is a benchmark result for full text search between PGroonga and pg_bigm:

Search keywords N hits PGroonga pg_bigm
"PostgreSQL" or "MySQL" About 300 About 2ms About 49ms
データベース(database in Japanese) About 15000 About 49ms About 1300ms
テレビアニメ(TV animation in Japanese) About 20000 About 65ms About 2800ms
日本(Japan in Japanese) About 530000 About 560ms About 480ms

In "日本" (Japan in Japanese) case, pg_bigm is a bit faster(*1) than PGroonga. But PGroonga is 24 times to 43 times faster than pg_bigm in other cases. The result shows that PGroonga can perform stable high performance fast full text search against all keywords.

(*1) pg_bigm can perform faster full text search against keywords that have 2 or less characters rather than keywords that have 3 or more characters.

Here is a benchmark result for full text search between PGroonga and textsearch:

Search keywords N hits PGroonga textsearch Groonga
"PostgreSQL" or "MySQL" About 1600 About 6ms About 3ms About 3ms
database About 210000 About 698ms About 602ms About 19ms
animation About 40000 About 173ms About 1000ms(*2) About 6ms
America About 470000 About 1300ms About 1200ms About 45ms

(*2) textsearch is slow because hit about 420 thousand items (about 10 times of PGroonga) with "animation". The cause is because "animation" is stemmed as "anim".

The search times of PGroonga and textsearch are almost the same. Textsearch is slower in "animation" because it comes from the difference in the number of hits, not the essential search speed difference.

For reference, the rsult of Groonga. Groonga is the full text search engine of PGroonga. As Groonga searches every few tens ms, you can see that the search speed of PGroonga and textsearch is not the speed of full-text search. So, The impact of common overhead in PostgreSQL is great.

Details of benchmark is below reference.

PGroonga provides the following features that aren't provided by other extensions:

  • Normalize feature
  • Custom tokenizer feature
  • Custom tokenfilter feature
  • Search using query language
  • HTML highlight feature
  • HTML snippet feature
  • JSON search
  • Autocomplate feature
  • Synonyms search feature
  • Query expand feature

Normalize feature is a feature that converts different notation texts to unified notation text.

For simple example, both "ポスグレ" (HALFWIDTH KATAKANA) and "ポスグレ" (FULLWIDTH KATAKANA) are converted to "ポスグレ" (FULLWIDTH KATAKANA). ("ポスグレ" is an abbreviation of PostgreSQL in Japanese.) if you search by "ポスグレ"(FULLWIDTH KATAKANA), even texts written in "ポスグレ"(HALFWIDTH KATAKANA) will also be hit.

For more complex example, "㍊"is converted to"ミリバール". This normalization is based on Unicode NFKC

Custom tokenizer feature is a feature that customizes search keyword extraction process (tokenization). If you can custom tokenization, you can control trade-off between search precision and search performance.

For example, if you use "tokenizer that is based of morphological analyzer", you can get better search precision and search performance but may not find some texts.

FYI: There is no other extension that supports morphological analyzer based tokenizer. PGroonga is the only extension that supports it.

Custom tokenfilter feature is a function that you can customize the process of processing keywords extracted with tokenizer. textsearch has the same function with the name dictionary. Both PGroonga and textsearch realize the stemming function with this mechanism.

Search using query language is a function that you can specify AND/OR/NOT search. You can use it in the same grammar as Google.

HTML highlight feature is a function to mark up search keywords with <span class =" keyword "> ... </ span>. The result is safe to use in HTML as it is. It is useful for developing web applications.

HTML snippet feature is a function to return search keyword surrounding text, which is also used in Google search results. Of course, the keyword is marked up with <span class =" keywor d "> ... </ span>. It is also safe to use the result in HTML as it is.

JSON search feature is a function that allows you to flexibly search by registering the entire column of jsonb type as an index. Since it is also possible to search all the texts in JSON in full text, it is easy to save the logs of various structures in JSON format and search later.

Autocomplete feature is a function to supplement an input keyword in a text box for entering a search keyword. It is also implemented by Google. You can complement it by entering in Roman characters just like Google.

Synonyms search feature is a function to search for texts whose contents are similar. It can be used to display similar articles on a blog.

Query expand feature is a function that searches keywords that have the same meaning but different expressions at once. For example, When searching for "牛乳" it will also search for "牛乳" and "ミルク".

About each these function see also PHP manual high speed full text search system made with PostgreSQL and PGroonga

Here are features that will be implemented in the feature. They are already implemented in Groonga.

  • Weight feature

Usage

You can use PGroonga without full text search knowledge. You just create an index and puts a condition into WHERE:

CREATE INDEX index_name ON table USING pgroonga (column);

SELECT * FROM table WHERE column &@~ '全文検索';

You can also use LIKE to use PGroonga. PGroonga provides a feature that performs LIKE with index. LIKE with PGroonga index is faster than LIKE without index. It means that you can improve performance without changing your application that uses the following SQL: But &@~ is faster than LIKE, I recommend you to fix it.

SELECT * FROM table WHERE column LIKE '%全文検索%';

Are you interested in PGroonga? Please install and try tutorial. You can know all PGroonga features.

You can install PGroonga easily. Because PGroonga provides packages for major platforms. There are binaries for Windows.

Highlight

Here are highlights form 1.2.3:

  • PostgreSQL 10 supported

  • Improve accuracy of query execution plan (speedup)

  • pgroonga schema is deprecated

    • You can also use pgroonga scheme. Because PGroonga 2.x is maintain compatibility with PGroonga 1.x.

PostgreSQL 10 supported

PGroonga is supported PostgreSQL 10 by the other day released. You can use PGroonga in latest PostgreSQL!

PGroonga is supported logical replication. logical replication is new feature for PostgreSQL 10. physical replication was able to be replication, but with PostgreSQL 10 compatible, new choices have increased.

Physical replication is disk usage is large, but crash recovery will become effective to some extent.

On the other hand, you can constract more flexible schema by logical replication. For example, master is using only to update purpose and then slave is using only to read purpose. You can make PGroonga's index in only slaves. It makes you can scale out against read while improving to update throughput.

We are planning to benchmark each replication method.

About configuring PostgreSQL cluster If you need commercial support, please refer to PGroonga Support Page.

Improve accuracy of query execution plan (speedup)

Planner of PostgreSQL obtains information from each index including PGroonga, calculates the processing cost, and selects the optimal execution plan. So, If PGroonga return more accurate results, PostgreSQL can selects more high speed plan.

In this release, a STABLE function (if the arguments are the same regardless of the contents of the DB, this function with the same result) or ` IMMUTABLE function (if the arguments are the same within the same transaction, this function with the same result) Improved accuracy of execution plans when searching using results of these function. The representative of the IMMUTABLE function by PGroonga is [ pgroonga_query_expand () `](https://pgroonga.github.io/reference/functions/pgroonga-query-expand.html). This is a function to expand the query and use it for search queries as follows. This makes it easier to select a faster execution plan even when using functions that process search queries.

SELECT *
  FROM diaries
 WHERE content &@~ pgroonga_query_expand('synonyms', 'term', 'synonyms',
                                         'Search query entered by the user');

pgroonga schema is deprecated

PGroonga 1.x defined functions and operator classes under the pgroonga schema, but it was easier to use if you defined under the current schema (in most cases public) with prefixes. So, pgroonga schema is deprecated in PGroonga 2.x. It was defined with a name with pgroonga _ prefix.

pgroonga schema is deprecated. But you can also use pgroonga schema for maintain compatibility. Since at least PGroonga 2.x supports pgroonga schema, please gradually update to usage with pgroonga _ prefix.

How to upgrade

This version is compatible with 1.0 or later. You can upgrade by steps in "Compatible case" in Upgrade document.

Announce

Session

On Friday, November 3, 2017PostgreSQL Conference Japan 2017 there is a session called PGroonga 2.0 - The definitive edition of full-text search in PostgreSQL

On Friday, December 5, 2017PGConf.ASIA 2017 there is session called PGroonga 2.0 – Make PostgreSQL rich full text search system backend!

Both session is about PGroonga 2. Session of PostgreSQL Conference Japan 2017 is for people who are not using PGroonga yet. Session of PGConf.ASIA 2017 is for people who already use PGroonga and developing extension functions.

Support servies

We provide Support of PGroonga. Please consult us when you need support related to PGroonga, such as consulting on index and search design method, investigation at trouble, technical support such as performance improvement, addition of new function etc.

Conclusion

New PGroonga version has been released. PGroonga 2 become it is made to be able to use it more like PostgreSQL.

See also release note for all changes.

Try PGroonga when you want to perform fast full text search against all languages on PostgreSQL!

2017-09-29

Groonga 7.0.7 has been released

Groonga 7.0.7 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • Fixed the case that --query '+' doesn't work for QUERY_NO_SYNTAX_ERROR flag
  • --default-command-version 3 has been supported
  • Caching select result with function call has been supported.

Fixed the case that --query '+' doesn't work for QUERY_NO_SYNTAX_ERROR flag

In the previous version, QUERY_NO_SYNTAX_ERROR flag is introduced. If this flag is set, query never causes syntax error.

But there is a case that it causes an error when this flag is used for --query '+'. In this release, this bug was fixed.

--default-command-version 3 has been supported

In this release, groonga executable now supports --default-command-version 3. In the previous versions, groonga executable only supports --command_version 3 but not for --default-command-version 3.

Caching select result with function call has been supported.

In this release, caching select result with function call feature has been supported.

Now, most of existing functions supports this feature.

But there are two exception. When now() and rand() are used in a query, select result will not be cached.

Conclusion

See Release 7.0.7 2017-09-29 about detailed changes since 7.0.6.

Let's search by Groonga!

2017-08-29

Groonga 7.0.6 has been released

Groonga 7.0.6 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • object_inspect command has been supported to show disk usage
  • Falllback feature when parsing query has been supported
  • The score adjusting about keyword in query has been supported

object_inspect command has been supported to show disk usage

In this release, object_inspect command has been supported to show disk usage.

In the previous versions, there is no easy way to calculate the disk usage about each objects such as tables, index columns and so on.

object_inspect command returns disk_usage parameter in response. It returns size in bytes.

table_create --name Site --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Site --name title --type ShortText
load --table Site
[
{"_key":"http://example.org/","title":"This is test record 1!"},
{"_key":"http://example.net/","title":"test record 2."},
{"_key":"http://example.com/","title":"test test record three."},
{"_key":"http://example.net/afr","title":"test record four."},
{"_key":"http://example.org/aba","title":"test test test record five."},
{"_key":"http://example.com/rab","title":"test test test test record six."},
{"_key":"http://example.net/atv","title":"test test test record seven."},
{"_key":"http://example.org/gat","title":"test test record eight."},
{"_key":"http://example.com/vdw","title":"test test record nine."},
]
table_create --name Terms --flags TABLE_PAT_KEY --key_type ShortText --default_tokenizer TokenBigram --normalizer NormalizerAuto
column_create --table Terms --name blog_title --flags COLUMN_INDEX|WITH_POSITION --type Site --source title

Execute the following command to check the disk usage about Terms table.

object_inspect --output_pretty yes Terms

Then, object_inspect command returns the following result.

{
  "id": 258,
  "name": "Terms",
  "type": {
    "id": 49,
    "name": "table:pat_key"
  },
  "key": {
    "type": {
      "id": 14,
      "name": "ShortText",
      "type": {
        "id": 32,
        "name": "type"
      },
      "size": 4096
    },
    "total_size": 21,
    "max_total_size": 4294967294
  },
  "value": {
    "type": null
  },
  "n_records": 15,
  "disk_usage": 8437760
}

It turns out that the disk usage is "disk_usage": 8437760.

Let's check the disk usage about index column.

Execute the following command to check blog_title column in Terms table.

object_inspect --output_pretty yes Terms.blog_title

object_inspect command returns the following result.

{
  "id": 259,
  "name": "blog_title",
  "table": {
    "id": 258,
    "name": "Terms",
    "type": {
      "id": 49,
      "name": "table:pat_key"
    },
  (省略)
  ],
  "disk_usage": 5283840
}

It turns out that the disk usage is "disk_usage": 5283840.

Falllback feature when parsing query has been supported

It is enabled when QUERY_NO_SYNTAX_ERROR flag is set to query_flags.

This feature is disabled by default.

If this flag is set, query never causes syntax error. For example, "A +" is parsed and escaped automatically into "A" and "+". This behavior is useful when application uses user input directly and doesn't want to show syntax error to user and in log.

Here is the example how to use QUERY_NO_SYNTAX_ERROR.

table_create --name Magazine --flags TABLE_HASH_KEY --key_type ShortText
column_create --table Magazine --name title --type ShortText
load --table Magazine
[
{"_key":"http://gihyo.jp/magazine/wdpress","title":"WEB+DB PRESS"},
{"_key":"http://gihyo.jp/magazine/SD","title":"Software Design"},
]
table_create --name Terms --flags TABLE_PAT_KEY --key_type ShortText --default_tokenizer TokenBigram --normalizer NormalizerAuto
column_create --table Terms --name title --flags COLUMN_INDEX|WITH_POSITION --type Magazine --source title

Let's search by keyword - WEB +.

select Magazine --output_pretty yes --query 'WEB +' --match_columns title"

It causes an syntax error.

[
  [
    -63,
    1503902587.063566,
    0.0007965564727783203,
    "Syntax error: <WEB +||>",
    [
      [
        "yy_syntax_error",
        "grn_ecmascript.lemon",
        37
      ]
    ]
  ]
]

Let's try with QUERY_NO_SYNTAX_ERROR flag.

select Magazine --output_pretty yes --match_columns title --query 'WEB +'  --query_flags ALLOW_PRAGMA|ALLOW_COLUMN|QUERY_NO_SYNTAX_ERROR

It turns out that there is no syntax error.

[
  [
    0,
    1503902343.382929,
    0.0419621467590332
  ],
  [
    [
      [
        1
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "title",
          "ShortText"
        ]
      ],
      [
        1,
        "http://gihyo.jp/magazine/wdpress",
        "WEB+DB PRESS"
      ]
    ]
  ]
]

With QUERY_NO_SYNTAX_ERROR flag in query, The keyword in above query is parsed into WEB and +. So, it doesn't cause an syntax error.

The score adjusting about keyword in query has been supported

In this release, The feature which adjusts score for term in query has been supported. Actually, >, <, and ~ operators are supported.

For example, >Groonga increments score of Groonga, <Groonga decrements score of Groonga. ~Groonga decreases score of matched document in the current search result. ~ operator doesn't change search result itself.

Here is the sample to show usage.

table_create --name Shops --flags TABLE_NO_KEY
column_create --table Shops --name keyword --type ShortText
load --table Shops
[
{"keyword":"restraunt western food"},
{"keyword":"restraunt japanese food"},
{"keyword":"restraunt chinese food"},
{"keyword":"cafe western food"},
]

Let's search restraunt by the following query.

select Shops --output_pretty yes --match_columns keyword --output_columns keyword,_score --sort_keys -_score --query 'restraunt'

It returns the following result.

[
  [
    3
  ],
  [
    [
      "keyword",
      "ShortText"
    ],
    [
      "_score",
      "Int32"
    ]
  ],
  [
    "restraunt western food",
    1
  ],
  [
    "restraunt chinese food",
    1
  ],
  [
    "restraunt japanese food",
    1
  ]
]

The query returns response which contains same score - 1.

Let's search japanese food with > to adjust score.

select Shops --output_pretty yes --match_columns keyword --output_columns keyword,_score --sort_keys -_score --query 'restraunt (>japanese OR western OR chinese)'

[
  [
    3
  ],
  [
    [
      "keyword",
      "ShortText"
    ],
    [
      "_score",
      "Int32"
    ]
  ],
  [
    "restraunt japanese food",
    8
  ],
  [
    "restraunt chinese food",
    2
  ],
  [
    "restraunt western food",
    2
  ]
]

Now that score of japanese food is largest in the tree restraunt.

Then, try to adjust score with < to raise western food.

select Shops --output_pretty yes --match_columns keyword --output_columns keyword,_score --sort_keys -_score --query 'restraunt (>japanese OR <western OR chinese)'

[
  [
    3
  ],
  [
    [
      "keyword",
      "ShortText"
    ],
    [
      "_score",
      "Int32"
    ]
  ],
  [
    "restraunt japanese food",
    8
  ],
  [
    "restraunt western food",
    7
  ],
  [
    "restraunt chinese food",
    2
  ]
]

As you can see, the score is adjustable by <, and > combination with each keyword.

Conclusion

See Release 7.0.6 2017-08-29 about detailed changes since 7.0.5.

Let's search by Groonga!

2017-07-29

Groonga 7.0.5 has been released

Groonga 7.0.5 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • [dump] --sort_hash_table option has been supported
  • math_abs() function has been supported
  • Ubuntu 16.10 (Yakkety Yak) support has been dropped

[dump] --sort_hash_table option has been supported

In this release, dump command supported --sort_hash_table option.

If this option is specified as --sort_hash_table yes, dump command shows result sorted by _key.

It is useful to check difference about dumped results of hash table.

math_abs() function has been supported

In this release, math_abs() function was supported. This function is used to calculate absolute value.

Register math plugin to use this function.

plugin_register functions/math

Here is the sample schema and data:

table_create City TABLE_HASH_KEY ShortText
column_create City utc COLUMN_SCALAR Int32
  
load --table City
[
{"_key": "Tokyo",    "utc": 9},
{"_key": "New York", "utc": -4}, 
{"_key": "Paris",    "utc": 2}
]

Let's search near city from Singapore. In this context, near means that the value of time difference is smaller one.

You can search it by the following query:

select City \
  --filter true \
  --output_columns '_key, utc, _score' \
  --scorer '_score = math_abs(8 - utc)' \
  --sort_keys _score

Ubuntu 16.10 (Yakkety Yak) support has been dropped

In this release, Ubuntu 16.10 (Yakkety Yak) support was dropped. It has reached EOL on July 20, 2017.

Conclusion

See Release 7.0.5 2017-07-29 about detailed changes since 7.0.4.

Let's search by Groonga!

2017-07-03

PGroonga (fast full text search module for PostgreSQL) 1.2.3 has been released

PGroonga (píːzí:lúnɡά) (fast full text search module for PostgreSQL) 1.2.3 has been released!

See PGroonga 1.0.0 release announce about PGroonga.

Highlight

Here are highlights from 1.2.1:

  • Cleared operator names

  • Supported query expansion

  • Supported auto complete

  • Supported similar search

  • Improved usage of full text search against jsonb

Cleared operator names

PGroonga has used @@ as operator name for full text search with query language. It's the operator name used by textsearch. It's for easy to use for existing PostgreSQL users. But it's not easy to use because there are differences between textserach's @@ and PGroonga's @@.

PGroonga introduced &? as alternative of @@ and deprecated @@. It works well when we run SQL directly on psql, but doesn't work well with client libraries. Because ? is used as placeholder character in client libraries.

Finally, PGroonga introduces &@~ and deprecates @@ and &?.

Existing operator classes that doesn't have _v2 suffix can use &@~. So users can migrate new operator gradually.

Supported query expansion

Query expansion is used for implementing synonym search. (*)

(*) Synonym search is a search to find by meaning rather than surface. For example, you can find "PostgreSQL" with "PostgreSQL", "Postgres" or "PG".

Query expansion rules (synonyms) are managed in normal table. So you can use normal SELECT, INSERT, UPDATE and DELETE to manage them. Textsearch uses synonym dictionary to manage synonyms. So you need to use special way to manage synonyms.

See pgroonga.query_expand function for details.

Supported auto complete

Simple full text search only full text search system is too naive as the recent full text search system. The recent full text search systems have auto complete feature as one of standard features. In this release, PGroonga supports auto complete feature.

Auto complete is implemented by prefix search and full text search. PGroonga also provides Japanese specialized auto complete feature. PGroonga supports auto complete by Romaji.

Complete candidates are managed in normal table. It's easy to maintain.

See how to implement auto complete for details.

Similar search is useful search for providing related documents to users. _v2 operator classes support similar search.

See &~? operator for how to use similar search. TokenMecab tokenizer is recommended for similar search against Japanese documents. It's described at the end of the document. If your documents are written in Japanese, see the document carefully.

Improved usage of full text search against jsonb

You can search all texts in jsonb with old PGroonga but the new PGroonga improves usability. You can do full text search from all texts in jsonb like you do from a text column:

-- For text column
SELECT * FROM memos WHERE content &@~ 'KEYWORD1 OR KEYWORD2';
-- For jsonb column
SELECT * FROM logs WHERE record &@~ 'KEYWORD1 OR KEYWORD2';

See &@~ operator for jsonb for details.

How to upgrade

This version is compatible with 1.0 or later. You can upgrade by steps in "Compatible case" in Upgrade document.

Conclusion

New PGroonga version has been released.

See also release note for all changes.

Try PGroonga when you want to perform fast full text search against all languages on PostgreSQL!