BloGroonga

2020-09-29

Groonga 10.0.7 has been released

Groonga 10.0.7 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • [highlight], [highlight_full] Added support for normalizer options.

  • return code Added a new return code GRN_CONNECTION_RESET for resetting connection.

    • it is returned when an existing connection was forcibly close by the remote host.
  • Dropped Ubuntu 19.10 (Eoan Ermine).

    • Because this version has been EOL.
  • [httpd] Updated bundled nginx to 1.19.2.

  • grndb Added support for detecting duplicate keys.

    • grndb check is also able to detect duplicate keys since this release.
    • This check valid except a table of TABLE_NO_KEY.
    • If the table that was detected duplicate keys by grndb check has only index columns, we can recover by grndb recover.
  • [table_create], [column_create] Added a new option --path.

  • [dump] Added a new option --dump_paths.

  • Added a new function string_toknize().

    • It tokenizes the column value that is specified in the second argument with the tokenizer that is specified in the first argument.
  • [tokenizer] Added a new tokenizer TokenDocumentVectorTFIDF (experimental).

    • It generates automatically document vector by TF-IDF.
  • [tokenizer] Added a new tokenizer TokenDocumentVectorBM25 (experimental).

    • It generates automatically document vector by BM25.
  • [select] Added support for near search in same sentence.

  • Fixed a bug that load didn't a return response when we executed it against 257 columns.

    • This bug may occur from 10.0.4 or later.
    • This bug only occur when we load data by using [a, b, c, ...] format.

      • If we load data by using [{...}], this bug doesn't occur.
  • [MessagePack] Fixed a bug that float32 value isn't be unpacked correctly.

  • Fixed the following bugs related multi column index.

    • _score may be broken with full text search.
    • The records that couldn't hit might hit.

[highlight], highlight_full Added support for normalizer options

  • We can also specify normalizer options into highlight() and highlight_full().
  • Please refer to the following about possible options to set.

    • https://groonga.org/docs/reference/normalizers/normalizer_nfkc100.html#parameters
  • For example, we can identify hyphen that has different code point by using unify_hyphen.

    table_create Entries TABLE_NO_KEY
    column_create Entries body COLUMN_SCALAR ShortText
    
    load --table Entries
    [
    {"body": "full-text-search. Use U+002D HYPHEN-MINUS"},
    {"body": "full֊text֊search. Use U+058A ARMENIAN HYPHEN"},
    {"body": "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"}
    ]
    
    select Entries --output_columns \
      'highlight_full(body, \
                      "NormalizerNFKC121(\\"unify_hyphen\\", true)", \
                      true, \
                      "full-text-search", \
                      "<span class=\\"keyword1\\">", \
                      "</span>")' --output-pretty yes
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            3
          ],
          [
            [
    	  "highlight_full",
    	  null
    	]
          ],
          [
            "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
          ],
          [
            "<span class=\"keyword1\">full֊text֊search</span>. Use U+058A ARMENIAN HYPHEN"
          ],
          [
            "<span class=\"keyword1\">full˗text˗search</span>. Use U+02D7 MODIFIER LETTER MINUS SIGN"
          ]
        ]
      ]
    ]
    
  • If we don't specify unify_hyphen option, {"body": "full-text-search. Use U+002D HYPHEN-MINUS"} is only highlighted as below.

    • Because the other record different code point from the hyphen that is included the search keyword.
    select Entries --output_columns \
      'highlight_full(body, \
                      "NormalizerNFKC121()", \
                      true, \
                      "full-text-search", \
                      "<span class=\\"keyword1\\">", \
                      "</span>")'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
        3
      ],
      [
        [
          "highlight_full",
          null
        ]
      ],
      [
        "<span class=\"keyword1\">full-text-search</span>. Use U+002D HYPHEN-MINUS"
      ],
      [
        "full֊text֊search. Use U+058A ARMENIAN HYPHEN"
      ],
      [
        "full˗text˗search. Use U+02D7 MODIFIER LETTER MINUS SIGN"
      ]
        ]
      ]
    ]
    

table_create, column_create Added a new option --path.

  • We can store specified a table or a column to any path using this option.

  • This option is useful if we want to store a table or a column that we often use to fast storage (e.g. SSD) and store them that we don't often use to slow storage (e.g. HDD).

  • We can specify both relative path and absolute path in this option.

    • If we specify relative path in this option, the path is resolved the path of groonga process as the origin.
  • However, if we specify --path, the result of dump command includes --path informations.

    • Therefore, if we specify --path, we can't restore to host in different enviroment.
    • If we don't want include --path informations to a dump, we need specify --dump_paths no in dump command.

dump Added a new option --dump_paths.

  • --dump_paths option control whether --path is dumped or not.

  • The default value of it is yes.

  • If we specify --path when we create tables or columns and we don't want include --path informations to a dump, we specify no into --dump_paths when we execute dump command.

  • the near search can't search in the same sentence until now.

  • It can search in the same sentence as below from this release.

    table_create Memos TABLE_PAT_KEY ShortText
    column_create Memos content COLUMN_SCALAR ShortText
    
    table_create Terms TABLE_PAT_KEY ShortText \
      --default_tokenizer TokenBigram \
      --normalizer NormalizerAuto
    column_create Terms memos_content COLUMN_INDEX|WITH_POSITION Memos content
    
    load --table Memos
    [
    {"_key":"alphabets1", "content": "a c d ."},
    {"_key":"alphabets2", "content": "a b c d e f ."},
    {"_key":"alphabets3", "content": "a b x c d e f ."},
    {"_key":"alphabets4", "content": "a b x x c d e f ."}
    ]
    
    select \
      --table Memos \
      --match_columns content \
      --query '*NP3,-1"a c .$"' \
      --output_columns _score,_key,content
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            2
          ],
          [
            [
              "_score",
              "Int32"
            ],
            [
              "_key",
              "ShortText"
            ],
            [
              "content",
              "ShortText"
            ]
          ],
          [
            1,
            "alphabets1",
            "a c d ."
          ],
          [
            1,
            "alphabets2",
            "a b x c ."
          ]
        ]
      ]
    ]
    
  • We use the following syntax for using near-search in the same sentence.

    • '"NP${MAX_INTERVAL},${ADDITIONAL_LAST_INTERVAL}"'${FIRST_PHRASE},${LASR_PHRASE} ${SEPARATOR}$

      • If we specify -1 into ${ADDITIONAL_LAST_INTERVAL}, a record that the interval the first phrase and the last phrase less or equal than ${MAX_INTERVAL} hit.

        • In this case, however much the phrase and the separator are apart from each other, it hit.
      • If we specify an integer not smaller than 1 into ${ADDITIONAL_LAST_INTERVAL}, the record of the following conditions hit.

        • The interval of the first phrase and the last phrase less or equal than ${MAX_INTERVAL}.
        • The interval of the first phrase and the separator less or equal than ${MAX_INTERVAL}+${ADDITIONAL_LAST_INTERVAL}.
      • If we specify an integer not smaller than 0 into ${ADDITIONAL_LAST_INTERVAL}, the near-search same behavior as before.

        • The default value of ${ADDITIONAL_LAST_INTERVAL} is .
    • We can specify any character in ${SEPARATOR}.

Fixed the following bugs related multi column index.

  • _score may be broken with full text search.
  • The records that couldn't hit might hit.

    • For example, if we execute the following query, this bug occur.

      select TABLE \
        --match_columns 'LEXICON.INDEX[10]' \
        --query 'XXX' \
        --output_columns _score
      
      • In this case, LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized.

        • Groonga decide indexed of search target by whether the value of each sections is 0 or not.
        • Therefore, if LEXICON.INDEX[0] - LEXICON.INDEX[9] were not initialized, Groonga may choose wrong target.

          • Because the values of LEXICON.INDEX[0] - LEXICON.INDEX[9] are indefinite.
        • In addition, these values are used weight of score.

          • Therefore, the value of _score are also indefinite.
    • However, if we execute the following query, this bug doesn't occur. Because areaes of indefinite don't occur.

      select TABLE \
        --match_columns 'LEXICON.INDEX[0]' \
        --query 'XXX' \
        --output_columns _score
      
    • In other words, it occurred in the following situations.

      • If we specified section as below when there is a multi column index that has source columns a, b, and c.
    • We specified a and c.
    • We specified b and c.
    • We only specified b.
    • We only specified c.

      • However, It doesn't occurred in the following situations. Because areaes of indefinite don't occur in the following situations.
    • We specify by filling space from the first section as below.

      • We specified a, b, and c.
      • We specified a and b.
      • We only specified a.

Conclusion

Let's search by Groonga!

2020-08-29

Groonga 10.0.6 has been released

Groonga 10.0.6 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • logical_range_filter Improved search plan for large data.

    • Normally, logical_range_filter is faster than logical_select. However, it had been slower than logical_select in the below case.

      • If Groonga can't get the number of required records easily, it has the feature that switches index search from sequential search.
        • Normally, logical_range_filter uses a sequential search when records of search target are many.
      • The search process for it is almost the same as logical_select if the above switching occurred.
      • So, logical_range_filter is severalfold slower than logical_select in the above case if the search target is large data. Because logical_range_filter executes sort after the search.
    • If we search for large data, Groonga easily use sequential search than until now since this release.
    • Therefore, logical_range_filter will improve performance. Because the case of the search process almost the same as logical_select decreases.
  • [httpd] Updated bundled nginx to 1.19.1.

  • Modify how to install into Debian GNU/Linux.

    • We modify to use groonga-apt-source instead of groonga-archive-keyring.
    • Because the lintian command recommends using apt-source if a package that it puts files under the /etc/apt/sources.lists.d/.

      • The lintian command is the command which checks for many common packaging errors.
      • Please also refer to the following for the details about installation procedures.

  • logical_select Added a support for highlight_html and highlight_full.

  • Added support for recycling the IDs of records that are deleted when an array without value space delete.

    • If an array that doesn't have value space is deleted, deleted IDs are never recycled.
    • Groonga had used large storage space by large ID. Because it uses large storage space by itself.

      • For example, large ID is caused after many adds and deletes like Mroonga's mroonga_operations
  • select Improved performance of full-text-search without index.

  • function Improved performance for calling of function that all arguments a variable reference or literal.

  • indexing Improved performance of offline index construction by using token column.

  • Improved performance for "_score = func(...)".

    • The performance when the _score value calculate by using only function like "_score = func(...)" improved.
  • Fixed a bug that garbage may be included in response after response send error.

    • It may occur if a client didn't read all responses and closed the connection.

logical_select Added a support for highlight_html.

  • highlight_html and highlight_full can be used in only --output-columns in select until now.

  • They can be also used in logical_select since this release as below.

    plugin_register sharding
    plugin_register functions/number
    
    table_create Memos_20170315 TABLE_NO_KEY
    column_create Memos_20170315 timestamp COLUMN_SCALAR Time
    column_create Memos_20170315 content COLUMN_SCALAR Text
    
    table_create Memos_20170316 TABLE_NO_KEY
    column_create Memos_20170316 timestamp COLUMN_SCALAR Time
    column_create Memos_20170316 content COLUMN_SCALAR Text
    
    table_create Memos_20170317 TABLE_NO_KEY
    column_create Memos_20170317 timestamp COLUMN_SCALAR Time
    column_create Memos_20170317 content COLUMN_SCALAR Text
    
    load --table Memos_20170315
    [
    {"timestamp": "2017/03/15 00:00:00", "content": "Groonga is fast."},
    {"timestamp": "2017/03/15 01:00:00", "content": "Mroonga is fast and easy to use."}
    ]
    
    load --table Memos_20170316
    [
    {"timestamp": "2017/03/16 10:00:00", "content": "PGroonga is fast and easy to use."},
    {"timestamp": "2017/03/16 11:00:00", "content": "Rroonga is fast and easy to use."}
    ]
    
    logical_select Memos \
      --shard_key timestamp \
      --query 'content:@easy' \
      --output_columns 'content, highlight_html(content)'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            3
          ],
          [
            [
              "content",
              "Text"
            ],
            [
              "highlight_html",
              null
            ]
          ],
          [
            "Mroonga is fast and easy to use.",
            "Mroonga is fast and <span class=\"keyword\">easy</span> to use."
          ],
          [
            "PGroonga is fast and easy to use.",
            "PGroonga is fast and <span class=\"keyword\">easy</span> to use."
          ],
          [
            "Rroonga is fast and easy to use.",
            "Rroonga is fast and <span class=\"keyword\">easy</span> to use."
          ]
        ]
      ]
    ]
    

Conclusion

Let's search by Groonga!

2020-07-30

Groonga 10.0.5 has been released

Groonga 10.0.5 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • select Added support for storing reference in table that we specify with --load_table.

  • select Improved sort performance.

  • select Improved performance a bit on below cases.

    • A case of searching for many records matches.
    • A case of drilldown for many records.
  • [aggregator] Added support for score accessor for target.

  • indexing Improved performance of offline index construction on VC++ version.

  • select Use null instead NaN, Infinity, and -Infinity when Groonga outputs result for JSON format.

    • Because JSON doesn't support them.
  • select Add support fot aggregating standard deviation value.

  • [Windows] Dropped Visual Studio 2013 support.

  • Groonga HTTP Server Fixed a bug that a request can't halt even if we execute shutdown?mode=immediate when the response was halted by error occurrence.

  • Fixed a crash bug when an error occurs while a request.

    • It only occurs when we use Apache Arrow Format.
    • Groonga crashes when we send request to Groonga again after the previous request was halted by error occurrence.
  • between Fixed a crash bug when temporary table is used.

    • For example, if we specify a dynamic column in the first argument for between, Groonga had crashed.
  • Fixed a bug that procedure created by plugin is freed unexpectedly.

    • It's only occurred in reference count mode.
    • It's not occurred if we don't use plugin_register.
    • It's not occurred in the process that executes plugin_register.
    • It's occurred in the process that doesn't execute plugin_register.
  • Fixed a bug that normalization error occurred while static index construction by token_column.

select Added support for storing reference in table that we specify with --load_table.

  • --load_table is a feature that stores search results to the table in a prepared.

    • If the searches are executed multiple times, we can cache the search results by storing them to this table.
    • We can shorten the search times that the search after the first time by using this table.
  • We can store a reference to other tables into the key of this table as below since this release.

    • We can make a smaller size of this table. Because we only store references without store column values.
    • If we search against this table, we can search by using indexes for reference destination.

      table_create Logs TABLE_HASH_KEY ShortText
      column_create Logs timestamp COLUMN_SCALAR Time
      
      table_create Times TABLE_PAT_KEY Time
      column_create Times logs_timestamp COLUMN_INDEX Logs timestamp
      
      table_create LoadedLogs TABLE_HASH_KEY Logs
      
      load --table Logs
      [
      {
        "_key": "2015-02-03:1",
        "timestamp": "2015-02-03 10:49:00"
      },
      {
        "_key": "2015-02-03:2",
        "timestamp": "2015-02-03 12:49:00"
      },
      {
        "_key": "2015-02-04:1",
        "timestamp": "2015-02-04 00:00:00"
      }
      ]
      
      select \
        Logs \
        --load_table LoadedLogs \
        --load_columns "_key" \
        --load_values "_key" \
        --limit 0
      
      select \
        --table LoadedLogs \
        --filter 'timestamp >= "2015-02-03 12:49:00"'
      [
        [
          0,
          0.0,
          0.0
        ],
        [
          [
            [
              2
            ],
            [
              [
                "_id",
                "UInt32"
              ],
              [
                "_key",
                "ShortText"
              ],
              [
                "timestamp",
                "Time"
              ]
            ],
            [
              2,
              "2015-02-03:2",
              1422935340.0
            ],
            [
              3,
              "2015-02-04:1",
              1422975600.0
            ]
          ]
        ]
      ]
      

select Improved sort performance on below cases.

  • When many sort keys need ID resolution.

    • For example, the following expression needs ID resolution.

      • --filter true --sort_keys column
    • For example, the following expression doesn't need ID resolution. Because the _score pseudo column exists in the result table not the source table.

      • --filter true --sort_keys _score
  • When a sort target table has a key.

    • Therefore, TABLE_NO_KEY isn't supported this improvement.

[aggregator] Added support for score accessor for target.

  • For example, we can _score subject to aggregator_* as below.

    table_create Items TABLE_HASH_KEY ShortText
    column_create Items price COLUMN_SCALAR UInt32
    column_create Items tag COLUMN_SCALAR ShortText
    
    load --table Items
    [
    {"_key": "Book",  "price": 1000, "tag": "A"},
    {"_key": "Note",  "price": 1000, "tag": "B"},
    {"_key": "Box",   "price": 500,  "tag": "B"},
    {"_key": "Pen",   "price": 500,  "tag": "A"},
    {"_key": "Food",  "price": 500,  "tag": "C"},
    {"_key": "Drink", "price": 300,  "tag": "B"}
    ]
    
    select Items \
      --filter true \
      --drilldowns[tag].keys tag \
      --drilldowns[tag].output_columns _key,_nsubrecs,score_mean \
      --drilldowns[tag].columns[score_mean].stage group \
      --drilldowns[tag].columns[score_mean].type Float \
      --drilldowns[tag].columns[score_mean].flags COLUMN_SCALAR \
      --drilldowns[tag].columns[score_mean].value 'aggregator_mean(_score)'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            6
          ],
          [
            [
              "_id",
              "UInt32"
            ],
            [
              "_key",
              "ShortText"
            ],
            [
              "price",
              "UInt32"
            ],
            [
              "tag",
              "ShortText"
            ]
          ],
          [
            1,
            "Book",
            1000,
            "A"
          ],
          [
            2,
            "Note",
            1000,
            "B"
          ],
          [
            3,
            "Box",
            500,
            "B"
          ],
          [
            4,
            "Pen",
            500,
            "A"
          ],
          [
            5,
            "Food",
            500,
            "C"
          ],
          [
            6,
            "Drink",
            300,
            "B"
          ]
        ],
        {
          "tag": [
            [
              3
            ],
            [
              [
                "_key",
                "ShortText"
              ],
              [
                "_nsubrecs",
                "Int32"
              ],
              [
                "score_mean",
                "Float"
              ]
            ],
            [
              "A",
              2,
              1.0
            ],
            [
              "B",
              3,
              1.0
            ],
            [
              "C",
              1,
              1.0
            ]
          ]
        }
      ]
    ]
    

select Add support fot aggregating standard deviation value.

  • For example, we can calculate a standard deviation for every group as below.

    table_create Items TABLE_HASH_KEY ShortText
    column_create Items price COLUMN_SCALAR UInt32
    column_create Items tag COLUMN_SCALAR ShortText
    
    load --table Items
    [
    {"_key": "Book",  "price": 1000, "tag": "A"},
    {"_key": "Note",  "price": 1000, "tag": "B"},
    {"_key": "Box",   "price": 500,  "tag": "B"},
    {"_key": "Pen",   "price": 500,  "tag": "A"},
    {"_key": "Food",  "price": 500,  "tag": "C"},
    {"_key": "Drink", "price": 300,  "tag": "B"}
    ]
    
    select Items \
      --drilldowns[tag].keys tag \
      --drilldowns[tag].output_columns _key,_nsubrecs,price_sd \
      --drilldowns[tag].columns[price_sd].stage group \
      --drilldowns[tag].columns[price_sd].type Float \
      --drilldowns[tag].columns[price_sd].flags COLUMN_SCALAR \
      --drilldowns[tag].columns[price_sd].value 'aggregator_sd(price)' \
      --output_pretty yes
    [
      [
        0,
        1594339851.924836,
        0.002813816070556641
      ],
      [
        [
          [
            6
          ],
          [
            [
              "_id",
              "UInt32"
            ],
            [
              "_key",
              "ShortText"
            ],
            [
              "price",
              "UInt32"
            ],
            [
              "tag",
              "ShortText"
            ]
          ],
          [
            1,
            "Book",
            1000,
            "A"
          ],
          [
            2,
            "Note",
            1000,
            "B"
          ],
          [
            3,
            "Box",
            500,
            "B"
          ],
          [
            4,
            "Pen",
            500,
            "A"
          ],
          [
            5,
            "Food",
            500,
            "C"
          ],
          [
            6,
            "Drink",
            300,
            "B"
          ]
        ],
        {
          "tag": [
            [
              3
            ],
            [
              [
                "_key",
                "ShortText"
              ],
              [
                "_nsubrecs",
                "Int32"
              ],
              [
                "price_sd",
                "Float"
              ]
            ],
            [
              "A",
              2,
              250.0
            ],
            [
              "B",
              3,
              294.3920288775949
            ],
            [
              "C",
              1,
              0.0
            ]
          ]
        }
      ]
    ]
    
    • We can also calculate sample standard deviation to specifing aggregate_sd(target, {"unbiased": true}).

Conclusion

Let's search by Groonga!

2020-06-29

Groonga 10.0.4 has been released

Groonga 10.0.4 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • Added support for registering 400M records into a hash table.

  • select Improve scorer performance when the _score doesn't get recursively values.

    • Groonga get recursively value of _score when search result is search target.
    • For example, the search targets of slices are search result. Therefore, if we use slice in a query, this improvement doesn't ineffective.
  • log Improved that we output drilldown keys in query-log.

  • reference_acquire, reference_release Added new commands for reference count mode.

    • If we need to call multiple load in a short time, auto close by the reference count mode will degrade performance.
    • We can avoid the performance degrading by calling reference_acquire before multiple load and calling reference_release after multiple load. Between reference_acquire and reference_release, auto close is disabled.

      • Because reference_acquire acquires a reference of target objects.
    • We can must call reference_release after you finish performance impact operations.
    • If we don’t call reference_release, the reference count mode doesn’t work.
  • select Added support for aggregating multiple groups on one time drilldown.

  • groonga-executable-fille Added support for --pid-path in standalone mode.

    • Because --pid-path had been ignored in standalone mode in before version.
  • io_flush Added support for reference count mode.

  • logical_range_filter, logical_count Added support for reference count mode.

  • groonga-server-http We didn't add header after the last chunk.

    • Because there is a possibility to exist that the HTTP client ignores header after the last chunk.
  • [vector_slice] Added support for a vector that has the value of the Float32 type.

  • Added support for parallel offline index construction using token column.

    • We came to be able to construct an offline index on parallel threads from data that are tokenized in advance.

    • We can tune parallel offline construction by the following environment variables

      • GRN_TOKEN_COLUMN_PARALLEL_CHUNK_SIZE: How many records are processed per thread.

        • The default value is 1024 records.
      • GRN_TOKEN_COLUMN_PARALLEL_TABLE_SIZE_THRESHOLD: How many source records are required for parallel offline construction.

        • The default value is 102400 records.
  • select Improved performance for load_table on the reference count mode.

  • Fixed a bug that the database of Groonga was broken when we search by using the dynamic columns that don't specify a --filter and stridden over shard.

  • Fixed a bug that Float32 type had not displayed on a result of schema command.

  • Fixed a bug that we count in surplus to _nsubrecs when the reference uvector hasn't element.

select Added support for aggregating multiple groups on one time drilldown.

  • We came to be able to calculate sum or arithmetic mean every different multiple groups on one time drilldown as below.

    table_create Items TABLE_HASH_KEY ShortText
    column_create Items price COLUMN_SCALAR UInt32
    column_create Items quantity COLUMN_SCALAR UInt32
    column_create Items tag COLUMN_SCALAR ShortText
    
    load --table Items
    [
    {"_key": "Book",  "price": 1000, "quantity": 100, "tag": "A"},
    {"_key": "Note",  "price": 1000, "quantity": 10,  "tag": "B"},
    {"_key": "Box",   "price": 500,  "quantity": 15,  "tag": "B"},
    {"_key": "Pen",   "price": 500,  "quantity": 12,  "tag": "A"},
    {"_key": "Food",  "price": 500,  "quantity": 111, "tag": "C"},
    {"_key": "Drink", "price": 300,  "quantity": 22,  "tag": "B"}
    ]
    
    select Items \
      --drilldowns[tag].keys tag \
      --drilldowns[tag].output_columns _key,_nsubrecs,price_sum,quantity_sum \
      --drilldowns[tag].columns[price_sum].stage group \
      --drilldowns[tag].columns[price_sum].type UInt32 \
      --drilldowns[tag].columns[price_sum].flags COLUMN_SCALAR \
      --drilldowns[tag].columns[price_sum].value 'aggregator_sum(price)' \
      --drilldowns[tag].columns[quantity_sum].stage group \
      --drilldowns[tag].columns[quantity_sum].type UInt32 \
      --drilldowns[tag].columns[quantity_sum].flags COLUMN_SCALAR \
      --drilldowns[tag].columns[quantity_sum].value 'aggregator_sum(quantity)'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            6
          ],
          [
            [
              "_id",
              "UInt32"
            ],
            [
              "_key",
              "ShortText"
            ],
            [
              "price",
              "UInt32"
            ],
            [
              "quantity",
              "UInt32"
            ],
            [
              "tag",
              "ShortText"
            ]
          ],
          [
            1,
            "Book",
            1000,
            100,
            "A"
          ],
          [
            2,
            "Note",
            1000,
            10,
            "B"
          ],
          [
            3,
            "Box",
            500,
            15,
            "B"
          ],
          [
            4,
            "Pen",
            500,
            12,
            "A"
          ],
          [
            5,
            "Food",
            500,
            111,
            "C"
          ],
          [
            6,
            "Drink",
            300,
            22,
            "B"
          ]
        ],
        {
          "tag": [
            [
              3
            ],
            [
              [
                "_key",
                "ShortText"
              ],
              [
                "_nsubrecs",
                "Int32"
              ],
              [
                "price_sum",
                "UInt32"
              ],
              [
                "quantity_sum",
                "UInt32"
              ]
            ],
            [
              "A",
              2,
              1500,
              112
            ],
            [
              "B",
              3,
              1800,
              47
            ],
            [
              "C",
              1,
              500,
              111
            ]
          ]
        }
      ]
    ]
    

Conclusion

Let's search by Groonga!

2020-05-29

Groonga 10.0.3 has been released

Groonga 10.0.3 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • We came to be able to construct an inverted index from data that are tokenized in advance.

  • select We came to be able to specify a vector for the argument of a function.

  • select Added a new stage result_set for dynamic columns.

    • This stage generates a column into a result set table. Therefore, it is not generated if query or filter doesn't exist

      • Because if query or filter doesn't exist, Groonga doesn't make a result set table.
    • We can't use _value for the stage. The result_set stage is for storing value by score_column.

  • [vector_slice] Added support for weight vector that has weight of Float32 type.

  • select Added support for filtered stage and output stage of dynamic columns on drilldowns.

    • We can use filtered and output stage of dynamic columns on drilldowns as with drilldowns[Label].stage filtered and drilldowns[Label].stage output.
  • select Added support for Float type value in aggregating on drilldown.

    • We can aggregate max value, min value, and sum value for Float type value using MAX, MIN, and SUM.
  • query, geo_in_rectangle, geo_in_circle Added a new option score_column for query(), geo_in_rectangle(), and geo_in_circle().

  • [Windows] Groonga came to be able to output backtrace when it occurs error even if it doesn't crash.

  • [Windows] Dropped support for old Windows.

    • Groonga for Windows come to require Windows 8 (Windows Server 2012) or later from 10.0.3.
  • select Improved sort performance when sort keys were mixed referable sort keys and the other sort keys.

  • select Improved sort performance when sort keys are all referable keys case.

  • select Improve scorer performance as a _socre = column1*X + column2*Y + ... case.

    • This optimization effective when there are many + or * in _score.
    • At the moment, it has only effective against + and *.
  • select Added support for phrase near search.

  • vector Added support for float32 weight vector.

  • Fixed a crash bug if the modules (tokenizers, normalizers, and token filters) are used at the same time from multiple threads.

  • Fixed precision of Float32 value when it outputted.

    • The precision of it changes to 8-digit to 7-digit from 10.0.3.
  • Fixed a bug that Groonga used the wrong cache when the query that just the parameters of dynamic column different was executed.

We came to be able to construct an inverted index from data that are tokenized in advance.

  • The construct of an index is speeded up from this.

  • We need to prepare token column to use this improvement.

  • token column is an auto generated value column like an index column.

  • token column value is generated from source column value by tokenizing the source column value.

  • We can create a token column by setting the source column as below.

    table_create Terms TABLE_PAT_KEY ShortText \
      --normalizer NormalizerNFKC121 \
      --default_tokenizer TokenNgram
    
    table_create Notes TABLE_NO_KEY
    column_create Notes title COLUMN_SCALAR Text
    
    # The last "title" is the source column.
    column_create Notes title_terms COLUMN_VECTOR Terms title
    

select We came to be able to specify a vector for the argument of a function.

  • For example, flags options of query can describe by a vector as below.

    select \
      --table Memos \
      --filter 'query("content", "-content:@mroonga", \
                      { \
                        "expander": "QueryExpanderTSV", \
                        "flags": ["ALLOW_LEADING_NOT", "ALLOW_COLUMN"] \
                      })'
    

query, geo_in_rectangle, geo_in_circle Added a new option score_column for query(), geo_in_rectangle(), and geo_in_circle().

  • We can store a score value by condition using score_column.

  • Normally, Groonga calculate a score by adding scores of all conditions. However, we sometimes want to get a score value by condition.

  • For example, if we want to only use how near central coordinate as score as below, we use score_column.

    table_create LandMarks TABLE_NO_KEY
    column_create LandMarks name COLUMN_SCALAR ShortText
    column_create LandMarks category COLUMN_SCALAR ShortText
    column_create LandMarks point COLUMN_SCALAR WGS84GeoPoint
    
    table_create Points TABLE_PAT_KEY WGS84GeoPoint
    column_create Points land_mark_index COLUMN_INDEX LandMarks point
    
    load --table LandMarks
    [
      {"name": "Aries"      , "category": "Tower"     , "point": "11x11"},
      {"name": "Taurus"     , "category": "Lighthouse", "point": "9x10" },
      {"name": "Gemini"     , "category": "Lighthouse", "point": "8x8"  },
      {"name": "Cancer"     , "category": "Tower"     , "point": "12x12"},
      {"name": "Leo"        , "category": "Tower"     , "point": "11x13"},
      {"name": "Virgo"      , "category": "Temple"    , "point": "22x10"},
      {"name": "Libra"      , "category": "Tower"     , "point": "14x14"},
      {"name": "Scorpio"    , "category": "Temple"    , "point": "21x9" },
      {"name": "Sagittarius", "category": "Temple"    , "point": "43x12"},
      {"name": "Capricorn"  , "category": "Tower"     , "point": "33x12"},
      {"name": "Aquarius"   , "category": "mountain"  , "point": "55x11"},
      {"name": "Pisces"     , "category": "Tower"     , "point": "9x9"  },
      {"name": "Ophiuchus"  , "category": "mountain"  , "point": "21x21"}
    ]
    
    select LandMarks \
      --sort_keys 'distance' \
      --columns[distance].stage initial \
      --columns[distance].type Float \
      --columns[distance].flags COLUMN_SCALAR \
      --columns[distance].value 0.0 \
      --output_columns 'name, category, point, distance, _score' \
      --limit -1 \
      --filter 'geo_in_circle(point, "11x11", "11x1", {"score_column": distance}) && category == "Tower"'
    [
      [
        0,
        1590647445.406149,
        0.0002503395080566406
      ],
      [
        [
          [
            5
          ],
          [
            [
              "name",
              "ShortText"
            ],
            [
              "category","ShortText"
            ],
            [
              "point",
              "WGS84GeoPoint"
            ],
            [
              "distance",
              "Float"
            ],
            [
              "_score",
              "Int32"
            ]
          ],
          [
            "Aries",
            "Tower",
            "11x11",
            0.0,
            1
          ],
          [
            "Cancer",
            "Tower",
            "12x12",
            0.0435875803232193,
            1
          ],
          [
            "Leo",
            "Tower",
            "11x13",
            0.06164214760065079,
            1
          ],
          [
            "Pisces",
            "Tower",
            "9x9",
            0.0871751606464386,
            1
          ],
          [
            "Libra",
            "Tower",
            "14x14",
            0.1307627409696579,
            1
          ]
        ]
      ]
    ]
    
  • The sort by _score is meaningless in the above example. Because the value of _score is all 1 by category == "Tower". However, we can sort distance from central coordinate using score_column.

select Improved sort performance when sort keys were mixed referable sort keys and the other sort keys.

  • We improved sort performance if mixed referable sort keys and the other and there are referable keys two or more.

    • Referable sort keys are sort keys that except below them.

      • Compressed columns
      • _value against the result of drilldown that is specified multiple values to the key of drilldown.
      • _key against patricia trie table that has not the key of ShortText type.
      • _score
  • The more sort keys that except string, a decrease in the usage of memory for sort.

  • We can search phrase by phrase by a near search.

    • Query syntax for near phrase search is *NP"Phrase1 phrase2 ...".
    • Script syntax for near phrase search is column *NP "phrase1 phrase2 ...".

    • If the search target phrase includes space, we can search for it by surrounding it with " as below.

      table_create Entries TABLE_NO_KEY
      column_create Entries content COLUMN_SCALAR Text
      
      table_create Terms TABLE_PAT_KEY ShortText \
        --default_tokenizer 'TokenNgram("unify_alphabet", false, \
                                        "unify_digit", false)' \
        --normalizer NormalizerNFKC121
      column_create Terms entries_content COLUMN_INDEX|WITH_POSITION Entries content
      
      load --table Entries
      [
      {"content": "I started to use Groonga. It's very fast!"},
      {"content": "I also started to use Groonga. It's also very fast! Really fast!"}
      ]
      
      select Entries --filter 'content *NP "\\"I started\\" \\"use Groonga\\""' --output_columns 'content'
      [
        [
          0,
          1590469700.715882,
          0.03997230529785156
        ],
        [
          [
            [
              1
            ],
            [
              [
                "content",
                "Text"
              ]
            ],
            [
              "I started to use Groonga. It's very fast!"
            ]
          ]
        ]
      ]
      

vector Added support for float32 weight vector.

  • We can store weight as float32 instead of uint32.
  • We need to add WEIGHT_FLOAT32 flag when execute column_create to use this feature.

    column_create Records tags COLUMN_VECTOR|WITH_WEIGHT|WEIGHT_FLOAT32 Tags
    
  • However, WEIGHT_FLOAT32 flag isn't available with COLUMN_INDEX flag for now.

Conclusion

Let's search by Groonga!