BloGroonga

2021-10-04

PGroonga (fast full text search module for PostgreSQL) 2.3.2 has been released

PGroonga 2.3.2 has been released! PGroonga makes PostgreSQL fast full text search for all languages.

If you are new user, see also About PGroonga.

In this release, we supported PostgreSQL14 that was just released!

Highlight

Here are highlights after PGroonga 2.3.2:

  • Added support for PostgreSQL 14.

  • Added support for parallel scan.

  • Added support for parallel scan against declarative partitioning.

  • [CREATE INDEX USING PGroonga] Added index_flags_mapping option that can be used to customize index flags for each indexed target.

  • [CREATE INDEX USING PGroonga] Added support for ${table:INDEX_NAME} substitution in normalizers_mapping option.

  • [Ubuntu] Added support for Ubuntu 21.04.

  • [pgroonga_highlight_html function] Fixed a bug that a lexicon may not update when we recreate the lexicon.

How to upgrade

This version is compatible with before versions. You can upgrade by steps in "Compatible case" in Upgrade document.

Announce

Session

This tutorial session is for people who have already used PGroonga. We will introduce how to an improvement of search results by using PGroonga.

Conclusion

Try PGroonga when you want to perform fast full text search against all languages on PostgreSQL!

2021-09-29

Groonga 11.0.7 has been released

Groonga 11.0.7 has been released!

How to install: Install

Changes

Here are important changes in this release:

Improvements

  • load Added support for casting a string like as "[int, int,…]" to a vector of integer like as [int, int,…].

    For example, Groonga handle as a vector of integer like as [1, -2] even if we load vector of string like as "[1, -2]" as below.

      table_create Data TABLE_NO_KEY
      column_create Data numbers COLUMN_VECTOR Int16
      table_create Numbers TABLE_PAT_KEY Int16
      column_create Numbers data_numbers COLUMN_INDEX Data numbers
    
      load --table Data
      [
      {"numbers": "[1, -2]"},
      {"numbers": "[-3, 4]"}
      ]
    
      dump   --dump_plugins no   --dump_schema no
      load --table Data
      [
      ["_id","numbers"],
      [1,[1,-2]],
      [2,[-3,4]]
      ]
    
      column_create Numbers data_numbers COLUMN_INDEX Data numbers
      select Data --filter 'numbers @ -2'
      [[0,0.0,0.0],[[[1],[["_id","UInt32"],["numbers","Int16"]],[1,[1,-2]]]]]
    

    This feature supports for the floowings types.

    • Int8
    • UInt8
    • Int16
    • UInt16
    • Int32
    • UInt32
    • Int64
    • UInt64
  • load Added support for loading a JSON array expressed as a text string as a vector of string.

    For example, Groonga handle as a vector that has two elements like as ["hello", "world"] if we load JSON array expressed as a text string like as "["hello", "world"]" as below.

      table_create Data TABLE_NO_KEY
      [[0,0.0,0.0],true]
      column_create Data strings COLUMN_VECTOR ShortText
      [[0,0.0,0.0],true]
      table_create Terms TABLE_PAT_KEY ShortText   --normalizer NormalizerNFKC130   --default_tokenizer TokenNgram
      [[0,0.0,0.0],true]
      column_create Terms data_strings COLUMN_INDEX Data strings
      [[0,0.0,0.0],true]
      load --table Data
      [
      {"strings": "[\"Hello\", \"World\"]"},
      {"strings": "[\"Good-bye\", \"World\"]"}
      ]
      [[0,0.0,0.0],2]
      dump   --dump_plugins no   --dump_schema no
      load --table Data
      [
      ["_id","strings"],
      [1,["Hello","World"]],
      [2,["Good-bye","World"]]
      ]
    
      column_create Terms data_strings COLUMN_INDEX Data strings
      select Data --filter 'strings @ "bye"'
      [
        [
          0,
          0.0,
          0.0
        ],
        [
          [
            [
              1
            ],
            [
              [
                "_id",
                "UInt32"
              ],
              [
                "strings",
                "ShortText"
              ]
            ],
            [
              2,
              [
                "Good-bye",
                "World"
              ]
            ]
          ]
        ]
      ]
    

    In before version, Groonga handled as a vector that had one element like as ["["hello", "world"]"] if we loaded JSON array expressed as a text string like as "["hello", "world"]".

  • [Documentation] Added a documentation about the following items.

  • Updated to 3.0.0 that the version of Apache Arrow that Groonga requires.

Fixes

  • Fixed a memory leak when we created a table with a tokenizer with invalid option.

  • Fixed a bug that may not add a new entry in Hash table.

    This bug only occurs in Groonga 11.0.6, and it may occur if we quite a lot of add and delete data. If this bug occurs in your environment, you can resolve this problem by executing the following steps.

    1. We upgrade Groonga to 11.0.7 or later from 11.0.6.
    2. We make a new table that has the same schema as the original table.
    3. We copy data to the new table from the original table.
  • [Windows] Fixed a resource leak when Groonga fail open a new file caused by out of memory.

Known Issues

  • Currently, Groonga has a bug that there is possible that data is corrupt when we execute many additions, delete, and update data to vector column.

  • [The browser based administration tool] Currently, Groonga has a bug that a search query that is inputted to non-administration mode is sent even if we input checks to the checkbox for the administration mode of a record list.

  • *< and *> only valid when we use query() the right side of filter condition. If we specify as below, *< and *> work as &&.

    • 'content @ "Groonga" *< content @ "Mroonga"'
  • Groonga may not return records that should match caused by GRN_II_CURSOR_SET_MIN_ENABLE.

Conclusion

Please refert to the following news for more details.

News Release 11.0.7

Let's search by Groonga!

2021-08-29

Groonga 11.0.6 has been released

Groonga 11.0.6 has been released!

How to install: Install

Important notice

Groonga 11.0.6 has had a bug that may not add a new entry in Hash table.

We fixed this bug on Groonga 11.0.7. This bug only occurs in Groonga 11.0.6. Therefore, if you were using Groonga 11.0.6, we highly recommended that you use Groonga 11.0.7 or later.

Changes

Here are important changes in this release:

Improvements

  • Added support for recovering on crash. (experimental)

    This is a experimental feature. Currently, this feature is still not stable.

    If Groonga crashes, it recovers the database automatically when it opens a database for the first time since the crash. However, This feature can't recover the database automatically in all crash cases. We need to recover the database manually depending on timing even if this feature enables.

  • cache_limit Groonga remove cache when we execute cache_limit 0.

    Groonga stores query cache to internally table. The maximum total size of keys of this table is 4GiB. Because this table is hash table. Therefore, If we execute many huge queries, Groonga may be unable to store query cache, because the maximum total size of keys may be over 4GiB. In such cases, We can clear the table for query cache by using cache_limit 0, and Groonga can store query cache.

Fixes

  • Fixed a bug that Groonga doesn't clear lock when some threads open the same object around the same time.

  • query_parallel_or Fixed a bug that result may be different from the query().

    For example, If we used query("tags || tags2", "beginner man"), the following record was a match, but if we used query_parallel_or("tags || tags2", "beginner man"), the following record wasn't a match until now.

      {"_key": "Bob",   "comment": "Hey!",       "tags": ["expert", "man"], "tags2": ["beginner"]}
    

    Even if we use query_parallel_or("tags || tags2", "beginner man"), the above record is match by this modification.

Known Issues

  • Currently, Groonga has a bug that there is possible that data is corrupt when we execute many additions, delete, and update data to vector column.

  • [The browser based administration tool] Currently, Groonga has a bug that a search query that is inputted to non-administration mode is sent even if we input checks to the checkbox for the administration mode of a record list.

  • *< and *> only valid when we use query() the right side of filter condition. If we specify as below, *< and *> work as &&.

    • 'content @ "Groonga" *< content @ "Mroonga"'
  • Groonga may not return records that should match caused by GRN_II_CURSOR_SET_MIN_ENABLE.

Conclusion

Please refert to the following news for more details.

News Release 11.0.6

Let's search by Groonga!

2021-07-29

Groonga 11.0.5 has been released

Groonga 11.0.5 has been released!

How to install: Install

Changes

Here are important changes in this release:

Improvements

  • Normalizers Added support for multiple normalizers.

    We can specify multiple normalizers by --notmalizers option when we create a table since this release. If we can also specify them by --normalizer existing option because of compatibility.

    We added NormalizerTable for customizing a normalizer in Groonga 11.0.4. We can more flexibly behavior of the normalizer by combining NormalizerTable with existing normalizer.

    For example, this feature is useful in the following case.

    • Search for a telephone number. However, we import data handwritten by OCR. If data is handwritten, OCR may misunderstand a number and string(e.g. 5 and S).

    The details are as follows.

       table_create Normalizations TABLE_PAT_KEY ShortText
       column_create Normalizations normalized COLUMN_SCALAR ShortText
       load --table Normalizations
       [
       {"_key": "s", "normalized": "5"}
       ]
    
    
       table_create Tels TABLE_NO_KEY
       column_create Tels tel COLUMN_SCALAR ShortText
    
       table_create TelsIndex TABLE_PAT_KEY ShortText \
         --normalizers 'NormalizerNFKC130("unify_hyphen_and_prolonged_sound_mark", true), \
                        NormalizerTable("column", "Normalizations.normalized")' \
         --default_tokenizer 'TokenNgram("loose_symbol", true, "loose_blank", true)'
       column_create TelsIndex tel_index COLUMN_INDEX|WITH_SECTION Tels tel
    
       load --table Tels
       [
       {"tel": "03-4S-1234"}
       {"tel": "03-45-9876"}
       ]
    
       select --table Tels \
         --filter 'tel @ "03-45-1234"'
       [
         [
           0,
           1625227424.560146,
           0.0001730918884277344
         ],
         [
           [
             [
               1
             ],
             [
               [
                 "_id",
                 "UInt32"
               ],
               [
                 "tel",
                 "ShortText"
               ]
             ],
             [
               1,
               "03-4S-1234"
             ]
           ]
         ]
       ]
    

    Existing normalizers can't meet in such case, but we can meet it by combining NormalizerTable with existing normalizer since this release.

  • query_parallel_or, query Added support for customizing thresholds for sequential search.

    We can customize thresholds in each queries whether to use sequential search by the following options.

    • {"max_n_enough_filtered_records": xx}

      max_n_enough_filtered_records specify the number of records. query or query_parallel_or use sequential search when they seems to narrow down until under this number.

    • {"enough_filtered_ratio": x.x}

      enough_filtered_ratio specify percentage of total. query or query_parallel_or use sequential search when they seems to narrow down until under this percentage. For example, if we specify {"enough_filtered_ratio": 0.5}, query or query_parallel_or use sequential search when they seems to narrow down until half of the whole.

    The details are as follows.

     ```
     table_create Products TABLE_NO_KEY
     column_create Products name COLUMN_SCALAR ShortText
    
     table_create Terms TABLE_PAT_KEY ShortText --normalizer NormalizerAuto
     column_create Terms products_name COLUMN_INDEX Products name
    
     load --table Products
     [
     ["name"],
     ["Groonga"],
     ["Mroonga"],
     ["Rroonga"],
     ["PGroonga"],
     ["Ruby"],
     ["PostgreSQL"]
     ]
    
     select \
       --table Products \
       --filter 'query("name", "r name:Ruby", {"enough_filtered_ratio": 0.5})'
     ```
    
     ```
     table_create Products TABLE_NO_KEY
     column_create Products name COLUMN_SCALAR ShortText
    
     table_create Terms TABLE_PAT_KEY ShortText --normalizer NormalizerAuto
     column_create Terms products_name COLUMN_INDEX Products name
    
     load --table Products
     [
     ["name"],
     ["Groonga"],
     ["Mroonga"],
     ["Rroonga"],
     ["PGroonga"],
     ["Ruby"],
     ["PostgreSQL"]
     ]
    
     select \
       --table Products \
       --filter 'query("name", "r name:Ruby", {"max_n_enough_filtered_records": 10})'
     ```
    
  • betweenin_values Added support for customizing thresholds for sequential search.

    between and in_values have a feature that they switch to sequential search when the target of search records is narrowed down enough.

    The value of GRN_IN_VALUES_TOO_MANY_INDEX_MATCH_RATIO / GRN_BETWEEN_TOO_MANY_INDEX_MATCH_RATIO is used as threshold whether Groonga execute sequential search or search with indexes in such a case.

    This behavior is customized by only the following environment variable until now.

    in_values()

      # Don't use auto sequential search
      GRN_IN_VALUES_TOO_MANY_INDEX_MATCH_RATIO=-1
      # Set threshold to 0.02
      GRN_IN_VALUES_TOO_MANY_INDEX_MATCH_RATIO=0.02
    

    between()

      # Don't use auto sequential search
      GRN_BETWEEN_TOO_MANY_INDEX_MATCH_RATIO=-1
      # Set threshold to 0.02
      GRN_BETWEEN_TOO_MANY_INDEX_MATCH_RATIO=0.02
    

    if customize by the environment variable, this threshold applies to all queries, but we can specify it in each query by using this feature.

    The details are as follows. We can specify the threshold by using {"too_many_index_match_ratio": x.xx} option. The value type of this option is double.

       table_create Memos TABLE_HASH_KEY ShortText
       column_create Memos timestamp COLUMN_SCALAR Time
    
       table_create Times TABLE_PAT_KEY Time
       column_create Times memos_timestamp COLUMN_INDEX Memos timestamp
    
       load --table Memos
       [
       {"_key": "001", "timestamp": "2014-11-10 07:25:23"},
       {"_key": "002", "timestamp": "2014-11-10 07:25:24"},
       {"_key": "003", "timestamp": "2014-11-10 07:25:25"},
       {"_key": "004", "timestamp": "2014-11-10 07:25:26"},
       {"_key": "005", "timestamp": "2014-11-10 07:25:27"},
       {"_key": "006", "timestamp": "2014-11-10 07:25:28"},
       {"_key": "007", "timestamp": "2014-11-10 07:25:29"},
       {"_key": "008", "timestamp": "2014-11-10 07:25:30"},
       {"_key": "009", "timestamp": "2014-11-10 07:25:31"},
       {"_key": "010", "timestamp": "2014-11-10 07:25:32"},
       {"_key": "011", "timestamp": "2014-11-10 07:25:33"},
       {"_key": "012", "timestamp": "2014-11-10 07:25:34"},
       {"_key": "013", "timestamp": "2014-11-10 07:25:35"},
       {"_key": "014", "timestamp": "2014-11-10 07:25:36"},
       {"_key": "015", "timestamp": "2014-11-10 07:25:37"},
       {"_key": "016", "timestamp": "2014-11-10 07:25:38"},
       {"_key": "017", "timestamp": "2014-11-10 07:25:39"},
       {"_key": "018", "timestamp": "2014-11-10 07:25:40"},
       {"_key": "019", "timestamp": "2014-11-10 07:25:41"},
       {"_key": "020", "timestamp": "2014-11-10 07:25:42"},
       {"_key": "021", "timestamp": "2014-11-10 07:25:43"},
       {"_key": "022", "timestamp": "2014-11-10 07:25:44"},
       {"_key": "023", "timestamp": "2014-11-10 07:25:45"},
       {"_key": "024", "timestamp": "2014-11-10 07:25:46"},
       {"_key": "025", "timestamp": "2014-11-10 07:25:47"},
       {"_key": "026", "timestamp": "2014-11-10 07:25:48"},
       {"_key": "027", "timestamp": "2014-11-10 07:25:49"},
       {"_key": "028", "timestamp": "2014-11-10 07:25:50"},
       {"_key": "029", "timestamp": "2014-11-10 07:25:51"},
       {"_key": "030", "timestamp": "2014-11-10 07:25:52"},
       {"_key": "031", "timestamp": "2014-11-10 07:25:53"},
       {"_key": "032", "timestamp": "2014-11-10 07:25:54"},
       {"_key": "033", "timestamp": "2014-11-10 07:25:55"},
       {"_key": "034", "timestamp": "2014-11-10 07:25:56"},
       {"_key": "035", "timestamp": "2014-11-10 07:25:57"},
       {"_key": "036", "timestamp": "2014-11-10 07:25:58"},
       {"_key": "037", "timestamp": "2014-11-10 07:25:59"},
       {"_key": "038", "timestamp": "2014-11-10 07:26:00"},
       {"_key": "039", "timestamp": "2014-11-10 07:26:01"},
       {"_key": "040", "timestamp": "2014-11-10 07:26:02"},
       {"_key": "041", "timestamp": "2014-11-10 07:26:03"},
       {"_key": "042", "timestamp": "2014-11-10 07:26:04"},
       {"_key": "043", "timestamp": "2014-11-10 07:26:05"},
       {"_key": "044", "timestamp": "2014-11-10 07:26:06"},
       {"_key": "045", "timestamp": "2014-11-10 07:26:07"},
       {"_key": "046", "timestamp": "2014-11-10 07:26:08"},
       {"_key": "047", "timestamp": "2014-11-10 07:26:09"},
       {"_key": "048", "timestamp": "2014-11-10 07:26:10"},
       {"_key": "049", "timestamp": "2014-11-10 07:26:11"},
       {"_key": "050", "timestamp": "2014-11-10 07:26:12"}
       ]
    
       select Memos \
         --filter '_key == "003" && \
                   between(timestamp, \
                           "2014-11-10 07:25:24", \
                           "include", \
                           "2014-11-10 07:27:26", \
                           "exclude", \
                           {"too_many_index_match_ratio": 0.03})'
    
       table_create Tags TABLE_HASH_KEY ShortText
    
       table_create Memos TABLE_HASH_KEY ShortText
       column_create Memos tag COLUMN_SCALAR Tags
    
       load --table Memos
       [
       {"_key": "Rroonga is fast!", "tag": "Rroonga"},
       {"_key": "Groonga is fast!", "tag": "Groonga"},
       {"_key": "Mroonga is fast!", "tag": "Mroonga"},
       {"_key": "Groonga sticker!", "tag": "Groonga"},
       {"_key": "Groonga is good!", "tag": "Groonga"}
       ]
    
       column_create Tags memos_tag COLUMN_INDEX Memos tag
    
       select \
         Memos \
         --filter '_id >= 3 && \
                   in_values(tag, \
                            "Groonga", \
                            {"too_many_index_match_ratio": 0.7})' \
         --output_columns _id,_score,_key,tag
    
  • between Added support for GRN_EXPR_OPTIMIZE=yes.

    between() supported for optimizing the order of evaluation of a conditional expression.

  • query_parallel_orquery Added support for specifying group of match_columns as vector.

    We can use vector in match_columns of query and query_parallel_or as below.

       table_create Users TABLE_NO_KEY
       column_create Users name COLUMN_SCALAR ShortText
       column_create Users memo COLUMN_SCALAR ShortText
       column_create Users tag COLUMN_SCALAR ShortText
    
       table_create Terms TABLE_PAT_KEY ShortText \
         --default_tokenizer TokenNgram \
         --normalizer NormalizerNFKC130
       column_create Terms name COLUMN_INDEX|WITH_POSITION Users name
       column_create Terms memo COLUMN_INDEX|WITH_POSITION Users memo
       column_create Terms tag COLUMN_INDEX|WITH_POSITION Users tag
    
       load --table Users
       [
       {"name": "Alice", "memo": "Groonga user", "tag": "Groonga"},
       {"name": "Bob",   "memo": "Rroonga user", "tag": "Rroonga"}
       ]
    
       select Users \
         --output_columns _score,name \
         --filter 'query(["name * 100", "memo", "tag * 10"], \
                         "Alice OR Groonga")'
    
  • select Added support for section and weight in prefix search.

    We can use multi column index and adjusting score in prefix search.

       table_create Memos TABLE_NO_KEY
       column_create Memos title COLUMN_SCALAR ShortText
       column_create Memos tags COLUMN_VECTOR ShortText
    
       table_create Terms TABLE_PAT_KEY ShortText
       column_create Terms index COLUMN_INDEX|WITH_SECTION Memos title,tags
    
       load --table Memos
       [
       {"title": "Groonga", "tags": ["Groonga"]},
       {"title": "Rroonga", "tags": ["Groonga", "Rroonga", "Ruby"]},
       {"title": "Mroonga", "tags": ["Groonga", "Mroonga", "MySQL"]}
       ]
    
       select Memos \
         --match_columns "Terms.index.title * 2" \
         --query 'G*' \
         --output_columns title,tags,_score
       [
         [
           0,
           0.0,
           0.0
         ],
         [
           [
             [
               1
             ],
             [
               [
                 "title",
                 "ShortText"
               ],
               [
                 "tags",
                 "ShortText"
               ],
               [
                 "_score",
                 "Int32"
               ]
             ],
             [
               "Groonga",
               [
                 "Groonga"
               ],
               2
             ]
           ]
         ]
       ]
    
  • grndb Added support for closing used object immediately in grndb recover.

    We can reduce memory usage by this. This may decrease performance but it will be acceptable.

    Note that grndb check doesn't close used objects immediately yet.

  • query_parallel_orquery Added support for specifying scorer_tf_idf in match_columns as below.

       table_create Tags TABLE_HASH_KEY ShortText
    
       table_create Users TABLE_HASH_KEY ShortText
       column_create Users tags COLUMN_VECTOR Tags
    
       load --table Users
       [
       {"_key": "Alice",
        "tags": ["beginner", "active"]},
       {"_key": "Bob",
        "tags": ["expert", "passive"]},
       {"_key": "Chris",
        "tags": ["beginner", "passive"]}
       ]
    
       column_create Tags users COLUMN_INDEX Users tags
    
       select Users \
         --output_columns _key,_score \
         --sort_keys _id \
         --command_version 3 \
         --filter 'query_parallel_or("scorer_tf_idf(tags)", \
                                     "beginner active")'
       {
         "header": {
           "return_code": 0,
           "start_time": 0.0,
           "elapsed_time": 0.0
         },
         "body": {
           "n_hits": 1,
           "columns": [
             {
               "name": "_key",
               "type": "ShortText"
             },
             {
               "name": "_score",
               "type": "Float"
             }
           ],
           "records": [
             [
               "Alice",
               2.098612308502197
             ]
           ]
         }
       }
    
  • query_expand Added support for weighted increment, decrement, and negative.

    We can specify weight against expanded words.

    If we want to increment score, we use >. If we want to decrement score, we use <.

    We can specify the quantity of scores as a number. We can also use a negative numbers in it.

       table_create TermExpansions TABLE_NO_KEY
       column_create TermExpansions term COLUMN_SCALAR ShortText
       column_create TermExpansions expansions COLUMN_VECTOR ShortText
    
       load --table TermExpansions
       [
       {"term": "Rroonga", "expansions": ["Rroonga", "Ruby Groonga"]}
       ]
    
       query_expand TermExpansions "Groonga <-0.2Rroonga Mroonga" \
         --term_column term \
         --expanded_term_column expansions
       [[0,0.0,0.0],"Groonga <-0.2((Rroonga) OR (Ruby Groonga)) Mroonga"]
    
  • [httpd] Updated bundled nginx to 1.21.1.

  • Updated bundled Apache Arrow to 5.0.0.

  • Ubuntu Dropped Ubuntu 20.10 (Groovy Gorilla) support.

    • Because Ubuntu 20.10 reached EOL July 22, 2021.

Fixes

  • query_parallel_orquery Fixed a bug that if we specify query_options and the other options, the other options are ignored.

    For example, "default_operator": "OR" option had been ignored in the following case.

       plugin_register token_filters/stop_word
    
       table_create Memos TABLE_NO_KEY
       column_create Memos content COLUMN_SCALAR ShortText
    
       table_create Terms TABLE_PAT_KEY ShortText \
         --default_tokenizer TokenBigram \
         --normalizer NormalizerAuto \
         --token_filters TokenFilterStopWord
       column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
       column_create Terms is_stop_word COLUMN_SCALAR Bool
    
       load --table Terms
       [
       {"_key": "and", "is_stop_word": true}
       ]
    
       load --table Memos
       [
       {"content": "Hello"},
       {"content": "Hello and Good-bye"},
       {"content": "and"},
       {"content": "Good-bye"}
       ]
    
       select Memos \
         --filter 'query_parallel_or( \
                     "content", \
                     "Hello and", \
                     {"default_operator": "OR", \
                      "options": {"TokenFilterStopWord.enable": false}})' \
         --match_escalation_threshold -1 \
         --sort_keys -_score
       [
         [
           0,
           0.0,
           0.0
         ],
         [
           [
             [
               1
             ],
             [
               [
                 "_id",
                 "UInt32"
               ],
               [
                 "content",
                 "ShortText"
               ]
             ],
             [
               2,
               "Hello and Good-bye"
             ]
           ]
         ]
       ]
    

Known Issues

  • Currently, Groonga has a bug that there is possible that data is corrupt when we execute many additions, delete, and update data to vector column.

  • [The browser based administration tool] Currently, Groonga has a bug that a search query that is inputted to non-administration mode is sent even if we input checks to the checkbox for the administration mode of a record list.

  • *< and *> only valid when we use query() the right side of filter condition. If we specify as below, *< and *> work as &&.

    • 'content @ "Groonga" *< content @ "Mroonga"'
  • If we repeat that we remove any data and load them again, Groonga may not return records that should match.

Conclusion

Please refert to the following news for more details.

News Release 11.0.5

Let's search by Groonga!

2021-06-29

Groonga 11.0.4 has been released

Groonga 11.0.4 has been released!

How to install: Install

Changes

Here are important changes in this release:

Improvements

  • [Normalizer] Added support for customized normalizer.

  • Added a new command object_warm.

    This commnad ship Groonga's DB to OS's page cache.

    If we never startup Groonga after OS startup, Groonga's DB doesn't exist on OS's page cache When Groonga on the first run. Therefore, the first operation to Groonga is slow.

    If we execute this command in advance, the first operation to Groonga is fast. In Linux, we can do the same by also executing cat *.db > dev/null. However, we could not do the same thing in Windows until now.

    By using this command, we can ship Groonga's DB to OS's page cache in both Linux and Windows. Then, we can also do that in units of table, column, and index. Therefore, we can ship only table, column, and index that we often use to OS's page cache.

  • select Added support for adjusting the score of a specific record in --filter.

    We can adjust the score of a specific record by using a oprtator named *~. *~ is logical operator same as && and ||. Therefore, we can use *~ like as && ans ||. Default weight of *~ is -1.

    Therefore, for example, 'content @ "Groonga" *~ content @ "Mroonga"' mean the following operations.

    1. Extract records that match 'content @ "Groonga" and content @ "Mroonga"'.
    2. Add a score as below.
    a. Calculate the score of 'content @ "Groonga"'.
    b. Calculate the score of 'content @ "Mroonga"'.
    c. b's score multiplied by -1 by *~.
    d. The socre of this record is a + b
       Therefore, if a's socre is 1 and b's score is 1, the score of this record  is 1 + (1 * -1) = 0.
    

    Then, we can specify score quantity by *~${score_quantity}.

    In particular, the following query adjust the score of match records by the following condition('content @ "Groonga" *~2.5 content @ "Mroonga")' ).

     ```
     table_create Memos TABLE_NO_KEY
     column_create Memos content COLUMN_SCALAR ShortText
    
     table_create Terms TABLE_PAT_KEY ShortText \
       --default_tokenizer TokenBigram \
       --normalizer NormalizerAuto
     column_create Terms index COLUMN_INDEX|WITH_POSITION Memos content
    
     load --table Memos
     [
     {"content": "Groonga is a full text search engine."},
     {"content": "Rroonga is the Ruby bindings of Groonga."},
     {"content": "Mroonga is a MySQL storage engine based of Groonga."}
     ]
    
     select Memos \
       --command_version 3 \
       --filter 'content @ "Groonga" *~2.5 content @ "Mroonga"' \
       --output_columns 'content, _score' \
       --sort_keys -_score,_id
     {
       "header": {
         "return_code": 0,
         "start_time": 1624605205.641078,
         "elapsed_time": 0.002965450286865234
       },
       "body": {
         "n_hits": 3,
         "columns": [
           {
             "name": "content",
             "type": "ShortText"
           },
           {
             "name": "_score",
             "type": "Float"
           }
         ],
         "records": [
           [
             "Groonga is a full text search engine.",
             1.0
           ],
           [
             "Rroonga is the Ruby bindings of Groonga.",
             1.0
           ],
           [
             "Mroonga is a MySQL storage engine based of Groonga.",
             -1.5
           ]
         ]
       }
     }
     ```
    

    We can do the same by also useing adjuster . If we use adjuster , we need to make --filter condition and --adjuster conditon on our application, but we make only --filter condition on it by this improvement.

    We can also describe filter condition as below by using query().

    • --filter 'content @ "Groonga" *~2.5 content @ "Mroonga"'
  • select Added support for && with weight.

    We can use && with weight by using *< or *>. Default weight of *< is 0.5. Default weight of *> is 2.0.

    We can specify score quantity by *<${score_quantity} and *>${score_quantity}. Then, if we specify *<${score_quantity}, a plus or minus sign of ${score_quantity} is reverse.

    For example, 'content @ "Groonga" *<2.5 query("content", "MySQL")' is as below.

    1. Extract records that match 'content @ "Groonga" and content @ "Mroonga"'.
    2. Add a score as below.
    a. Calculate the score of 'content @ "Groonga"'.
    b. Calculate the score of 'query("content", "MySQL")'.
    c. b's score multiplied by -2.5 by *<.
    d. The socre of this record is a + b
       Therefore, if a's socre is 1 and b's score is 1, the score of this record is 1 + (1 * -2.5) = -1.5.
    

    In particular, the following query adjust the score of match records by the following condition( 'content @ "Groonga" *<2.5 query("content", "Mroonga")' ).

     ```
     table_create Memos TABLE_NO_KEY
     column_create Memos content COLUMN_SCALAR ShortText
    
     table_create Terms TABLE_PAT_KEY ShortText \
       --default_tokenizer TokenBigram \
       --normalizer NormalizerAuto
     column_create Terms index COLUMN_INDEX|WITH_POSITION Memos content
    
     load --table Memos
     [
     {"content": "Groonga is a full text search engine."},
     {"content": "Rroonga is the Ruby bindings of Groonga."},
     {"content": "Mroonga is a MySQL storage engine based of Groonga."}
     ]
    
     select Memos \
       --command_version 3 \
       --filter 'content @ "Groonga" *<2.5 query("content", "Mroonga")' \
       --output_columns 'content, _score' \
       --sort_keys -_score,_id
     {
       "header": {
         "return_code": 0,
         "start_time": 1624605205.641078,
         "elapsed_time": 0.002965450286865234
       },
       "body": {
         "n_hits": 3,
         "columns": [
           {
             "name": "content",
             "type": "ShortText"
           },
           {
             "name": "_score",
             "type": "Float"
           }
         ],
         "records": [
           [
             "Groonga is a full text search engine.",
             1.0
           ],
           [
             "Rroonga is the Ruby bindings of Groonga.",
             1.0
           ],
           [
             "Mroonga is a MySQL storage engine based of Groonga.",
             -1.5
           ]
         ]
       }
     }
     ```
    
  • Log Added support for outputting to stdout and stderr.

    Process-log and Query-log supported output to stdout and stderr.

    • If we specify as --log-path -, --query-log-path -, Groonga output log to stdout.
    • If we specify as --log-path +, --query-log-path +, Groonga output log to stderr.

    Process-log is for all of Groonga works. Query-log is just for query processing.

    This feature is useful when we execute Groonga on Docker. Docker has the feature that records stdout and stderr in standard. Therefore, we don't need to login into the environment of Docker to get Groonga's log.

    For example, this feature is useful as he following case.

    • If we want to analyze slow queries of Groonga on Docker.

      If we specify --query-log-path - when startup Groonga, we can analyze slow queries by only execution the following commands.

      • docker logs ${container_name} | groonga-query-log-analyze

    By this, we can analyze slow query with the query log that output from Groonga on Docker simply.

  • [Documentation] Filled missing documentation of string_substring.

Known Issues

  • Currently, Groonga has a bug that there is possible that data is corrupt when we execute many additions, delete, and update data to vector column.

  • [The browser based administration tool] Currently, Groonga has a bug that a search query that is inputted to non-administration mode is sent even if we input checks to the checkbox for the administration mode of a record list.

  • *< and *> only valid when we use query() the right side of filter condition. If we specify as below, *< and *> work as &&.

    • 'content @ "Groonga" *< content @ "Mroonga"'

Conclusion

Please refert to the following news for more details.

News Release 11.0.4

Let's search by Groonga!