BloGroonga

2023-09-29

Groonga 13.0.8 has been released

Groonga 13.0.8 has been released!

How to install: Install

Changes

Here are important changes in this release:

Improvements

  • [column_create] Improved column_create flags with additional new flags COLUMN_FILTER_SHUFFLE, COLUMN_FILTER_BYTE_DELTA, COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE, and COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES.

    These new flags are for compression rate of COMPRESS_ZBLIB, COMPRESS_ZSTD to increase by filtering the data before the compression. Even though applying these flags, the compression rate may not increase depends on the data . Also, applying these flags on the filters causes delay to save and reference the column because the using filters with these flags causes additional process.

    It is important to choose and use effective filters only.

    Regardless of setting COMPRESS_ZLIB, COMPRESS_LZ4, COMPRESS_ZSTD, these filters use BloscLZ as the compression alogolizm, thus make the column size smaller than the noncompressed column in most cases. Yet, there would be a case that some data would show sufficient work by only suetting COMPRESS_ZLIB, COMPRESS_LZ4, COMPRESS_ZSTD . So it is advised to set suitable flags depending on the data.

    These flags are only available with COLUMN_VECTOR and are ignored with COLUMN_SCALAR.

    Here are a list of which kind of data is suitable for each filter.

    • COLUMN_FILTER_SHUFFLE

      This filter focus on the N-th byte element to put in order. This filter places data in continuance gathered from only 0th byte of each element within the vector column, then from only 1st byte, and then from only 2nd byte. It repeats gathering and placing the data in continuance for every bytes.

      The point for this filter is reorganized data in continuance with same value because data are gathered based on each N byte. It is because compressed algorithm of COMPRESS_ZLIB, COMPRESS_LZ4,and ` COMPRESS_ZSTD` tend to increase compression rate for the continuance data with same value.

      Here is how to handle in actual example.

      In case of vector column [1,1051,515] of UInt32, the little endian byte column would be as follows;

      | Byte 0  | Byte 1  |  | Byte 0  | Byte 1  |  | Byte 0  | Byte 1  |
      | ------- | ------- |  | ------- | ------- |  | ------- | ------- |
      | 0x00    | 0x01    |, | 0x04    | 0x1b    |, | 0x02    | 0x03    |
      

      This flag collects values of Byte 0, then values of Byte 1 to create the data as below. This operation collecting values by the each N byte is called shuffle from now on.

      Point to judge whether this flag is good to use or not for the data is "is there same value in continuance or not?". In the case of above sample, the data does not include same value in continuance after the shuffle. It means the data is not suitable for this flag because shuffle does not increase compression rate without the continuance same values.

      On the other hand, vector column [1,2,3] of UInt32 would be as follow.

      The vector column [1,2,3] is going to be as follow with a byte column representation.

      | Byte 0 | Byte 1 |  | Byte 0 | Byte 1 |  | Byte 0 | Byte 1 |
      |--------|--------|  |--------|--------|  |--------|--------|
      | 0x00   | 0x01   |, | 0x00   | 0x02   |, | 0x00   | 0x03   |
      

      Then, it will be as follow after the shuffle with this flag.

      | Byte 0 | Byte 0 | Byte 0 |  | Byte 1 | Byte 1 | Byte 1 |
      |--------|--------|--------|  |--------|--------|--------|
      | 0x00   | 0x00   | 0x00   |, | 0x01   | 0x02   | 0x03   |
      

      After the shuffle UInt32:[1,2,3] data have same continuance 0 value of all Byte 0. Thus, the data is suitable to use this flag because the continuance values in the data lead to increase compression rate.

      A floating-point number data with matched sign and characteristic is suitable. In case of single floating-point number by IEEE 754 type, a sign would be place in 31st bit and, a characteristic would be placed in 30th bit to 23rd bit. Since the sign bit and the top 7 bits of characteristic structures the highest byte, there would be the continuance same values after shuffle as long as the sign bit matches top 7 bits of characteristic.

      For example, the data Float32:[0.5,0.6] have same value of sign bit and characteristic bit in single floating-poin number by IEEE 754 type representation as follow.

      | Sign(1bit) | characteristic(8bit) | significand(23bit) |
      |------------|-------------|------------------------------|
      | 0          | 0111 1110   | 000 0000 0000 0000 0000 0000 |
      | 0          | 0111 1110   | 001 1001 1001 1001 1001 1010 |
      

      Then, there are same continuance values at Byte 3 after the shuffle as follow.

      | Byte 0 | Byte 0 |  | Byte 1 | Byte 1 |  | Byte 2 | Byte 2 |  | Byte 3 | Byte 3 |
      |--------|--------|  |--------|--------|  |--------|--------|  |--------|--------|
      | 0x00   | 0x9a   |, | 0x00   | 0x99   |, | 0x00   | 0x19   |, | 0x3f   | 0x3f   |
      
    • COLUMN_FILTER_BYTE_DELTA

      This flag is to calculate a difference between byte of compression target values and a difference between adjacent bytes. For example in case of UInt8 vector column [1,2,3,4,5], the data would be [1,(2-1),(3-2),(4-3),(5-4)]=[1,1,1,1,1] after applying this flag.

      As above example, this flag target to increase compression rate by calculating difference data and creating data with continuance same values. The point is whether continuance same values would be created by calculating difference between each elements.

      As it is said, this flag is not suitable for the data not creating continuance same values with calculating difference. Here are examples of data creating same continuance same values as follow.

      First, there are data with fixed pace of addition such as UInt8:[1,2,3,4,5] in the previous example. The value of addition does not limit number as long as it has fixed pace such as UInt8[1,11,21,31,41] adding 10 each.

      Next, there are data with continuance same value such as UInt8:[5,5,5,5,5]. This type of data create data such as [5,0,0,0,0] after calculating the difference, and include same continuance value 0.

      In opposite, types of data such as UInt8:[1,255,2,200,1] is not suitable for this flag since calculating the difference for the data does not create any continuance same vale.

    • COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE

      This flag is available only for Float/Float32. It is expected to be used in combination with COLUMN_FILTER_BYTE_DELTA.

      This flag deteriorate 1byte from the accuracy of each elements in Float/Float32 vector column. Deterioration of the accuracy is to make all bottom byte "0" in floating-point numbers.

      For example, data Float32:[1.24,3.67,4.55] represent as follow in IEEE 754 type single accuracy floating-point numbers.

      | Sign(1bit) | characteristic(8bit) | significand(23bit) |
      |------------|-------------|------------------------------|
      | 0          | 0111 1111   | 010 0000 0000 0000 0000 0000 |
      | 0          | 1000 0000   | 110 1010 1110 0001 0100 1000 |
      | 0          | 1000 0001   | 001 0001 1001 1001 1001 1010 |
      

      After applying this flag, the data are with 0 bottom byte as follows. You can see the bottom byte are 0.

      | Sign(1bit) | characteristic(8bit) | significand(23bit) |
      |------------|-------------|------------------------------|
      | 0          | 0111 1111   | 010 0000 0000 0000 0000 0000 |
      | 0          | 1000 0000   | 110 1010 1110 0001 0000 0000 |
      | 0          | 1000 0001   | 001 0001 1001 1001 0000 0000 |
      

      This is how this flag works. As mentioned, this flag is expected to be used in combination with COLUMN_FILTER_BYTE_DELTA. So, the data will be expected to have better compression rate applying shuffle by COLUMN_FILTER_BYTE_DELTA.

      Here is the data applying COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE then shuffle by COLUMN_FILTER_BYTE_DELTA.

      | Byte 0 | Byte 0 | Byte 0 |  | Byte 1 | Byte 1 | Byte 1 |  | Byte 2 | Byte 2 | Byte 2 |  | Byte 3 | Byte 3 | Byte 3 |
      |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|
      | 0x00   | 0x00   | 0x00   |, | 0x00   | 0xe1   | 0x99   |, | 0xa0   | 0x6a   | 0x91   |, | 0x3f   | 0x40   | 0x40   |
      

      Check Byte 0 in the data. There are continuance 0 in Byte 0. If 'COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE is not applied and shuffled, the data does not have any continuance values as follow. It means there are cases where the compression rate can be improved with COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE` if the shuffle only is not effective.

      | Byte 0 | Byte 0 | Byte 0 |  | Byte 1 | Byte 1 | Byte 1 |  | Byte 2 | Byte 2 | Byte 2 |  | Byte 3 | Byte 3 | Byte 3 |
      |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|
      | 0x00   | 0x48   | 0x9a   |, | 0x00   | 0xe1   | 0x99   |, | 0xa0   | 0x6a   | 0x91   |, | 0x3f   | 0x40   | 0x40   |
      

      Yet, note that the data would be less accuracy since the accuracy is 1 byte less for compression target floating-point numbers.

    • COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE

      This flag is available only for Float/Float32. It is expected to be used in combination with COLUMN_FILTER_SHUFFLE.

      This flag deteriorate 2 byte from the accuracy of each elements in Float/Float32 vector column. Deterioration of the accuracy is to make all bottom two bytes "0" in floating-point numbers.

      For example, data Float32:[1.24,3.67,4.55] represent as follow in IEEE 754 type single accuracy floating-point numbers.

      | Sign(1bit) | characteristic(8bit) | significand(23bit) |
      |------------|-------------|------------------------------|
      | 0          | 0111 1111   | 010 0000 0000 0000 0000 0000 |
      | 0          | 1000 0000   | 110 1010 1110 0001 0100 1000 |
      | 0          | 1000 0001   | 001 0001 1001 1001 1001 1010 |
      

      After applying this flag, the data are with "0" bottom two bytes as follows. You can see the bottom two bytes are 0.

      | Sign(1bit) | characteristic(8bit) | significand(23bit) |
      |------------|-------------|------------------------------|
      | 0          | 0111 1111   | 010 0000 0000 0000 0000 0000 |
      | 0          | 1000 0000   | 110 1010 0000 0000 0000 0000 |
      | 0          | 1000 0001   | 001 0001 0000 0000 0000 0000 |
      

      This is how this flag works.

      As mentioned, this flag is expected to be used in combination with COLUMN_FILTER_SHUFFLE. So, the data will be expected to have better compression rate applying shuffle by COLUMN_FILTER_SHUFFLE.

      Here is the data applying COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE then shuffle by COLUMN_FILTER_SHUFFLE.

      | Byte 0 | Byte 0 | Byte 0 |  | Byte 1 | Byte 1 | Byte 1 |  | Byte 2 | Byte 2 | Byte 2 |  | Byte 3 | Byte 3 | Byte 3 |
      |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|
      | 0x00   | 0x00   | 0x00   |, | 0x00   | 0x00   | 0x00   |, | 0xa0   | 0x6a   | 0x91   |, | 0x3f   | 0x40   | 0x40   |
      

      Check Byte 0 and Byte 1 in the data. There are continuance 0 in Byte 0 and Byte 1 . If COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE is not applied and shuffled, the data does not have any continuance values as follow. It means there are cases where the compression rate can be improved with COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE if the shuffle only is not effective.

      | Byte 0 | Byte 0 | Byte 0 |  | Byte 1 | Byte 1 | Byte 1 |  | Byte 2 | Byte 2 | Byte 2 |  | Byte 3 | Byte 3 | Byte 3 |
      |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|  |--------|--------|--------|
      | 0x00   | 0x48   | 0x9a   |, | 0x00   | 0xe1   | 0x99   |, | 0xa0   | 0x6a   | 0x91   |, | 0x3f   | 0x40   | 0x40   |
      

      Yet, note that the data would be less accuracy since the accuracy is 2 byte2 less for compression target floating-point numbers.

    • Added bundling new Blosc library.

      Blosc is required to use COLUMN_FILTER_SHUFFLE, COLUMN_FILTER_BYTE_DELTA, COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE, and COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES flags.

  • [status] Improved status output with additional new features "blosc".

  • [groonga] Improved groonga --version output with additional new value blosc.

  • [select] Improved select arguments with additional new argument --fuzzy_max_distance (experimental).

    This argument is useful when search process can tolerance typographical errors. This argument revise typographical errors by users.

    In case of setting this argument, Groonga proceed search again after checking a word with edit distance indicated under --fuzzy_max_distance from the vocabulary list when the query does not match as shown in following example.

    table_create Memos TABLE_NO_KEY
    column_create Memos content COLUMN_SCALAR ShortText
    
    table_create Lexicon TABLE_PAT_KEY ShortText   --default_tokenizer TokenNgram   --normalizer NormalizerNFKC150
    column_create Lexicon memos_content   COLUMN_INDEX|WITH_POSITION Memos content
    load --table Memos
    [
    {"content": "This is a pen"},
    {"content": "That is a pen"}
    ]
    
    select Memos   --match_columns content   --query "Thjs"   --fuzzy_max_distance 1   --output_columns *,_score
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            1
          ],
          [
            [
              "content",
              "ShortText"
            ],
            [
              "_score",
              "Int32"
            ]
          ],
          [
            "This is a pen",
            1
          ]
        ]
      ]
    ]
    
  • [select] Improved select arguments with additional new argument --fuzzy_max_expansions (experimental).

    As mentioned, --fuzzy_max_distance let Groonga proceed search again after checking a word with edit distance indicated.

    For example, if there are This and That in the vocabulary list, given query Thjs would not match and THjs of --fuzzy_max_distance 10 would be searched again in the vocabulary list as This and That.

    In the above example, the vocabulary list have only 2 words that --fuzzy_max_distance is smaller than 10. If there were too many words that have edit distance smaller than 10, the performance would drop since all of those candidate words are searched.

    However, using --fuzzy_max_expansions can limit number of words that has close edit distance to use search process. This argument can help to balance hit numbers and performance of the search.

    Default value of --fuzzy max_expansions is 0. When it is 0, the search use all words that the edit distance are under --fuzzy max_distance in the vocabulary list.

  • [select] Improved select arguments with additional new argument --fuzzy_max_distance_ratio (experimental).

    --fuzzy_max_distanceset same value against every query. Thus, given query, "abc OR defghijklmn", would be set value --fuzzy_max_distanceto both "abc" and "defhjklmn".

    However, the length of keywords might influence how much tolerance against typographical errors. For example, "abc" has just 3 characters so that only 1 typography is tolerance. On the other hand, "defhjklmn" has many characters so that 3 typography is tolerance.

    --fuzzy_max_distance_ratio is to change edit distance automatically depending on the number of charcters. This argument allow to set different edit distance for each keywords.

    If --fuzzy max_distance ratio FUZZY MAX_DISTANCE is set, the edit distance would be calculate as " number of characters * FUZZY_MAX_DISTANCE(round down decimall)". 0.34 is recommended for "FUZZY_MAX_DISTANCE". It is because 0.34 is just about same as fuzziness=AUTO of Elasticsearch.

  • [select] Improved select arguments with additional new argument --fuzzy_prefix_length (experimental).

    This is to accelerate search speed with --fuzzy_max_distance.

    After applying it, the hit data would have characters that the leading characters of key word matches for the number of characters set in --fuzzy prefix_length.

    It accelerate the process to limit the process target only for character strings that have common leading N characters since setting --fuzzy_max_distance large would expand process target and lead slow process.

  • [cast] Added support for casting "[0.0, 1.0, 1.1, ...]" to Float/Float32 vector as below.

    plugin_register functions/vector
    
    table_create Data TABLE_NO_KEY
    column_create Data numbers COLUMN_VECTOR Float32
    
    table_create Numbers TABLE_PAT_KEY Float32
    column_create Numbers data_numbers COLUMN_INDEX Data numbers
    
    load --table Data
    [
    {"numbers": "[0.1, 0.0, -0.2]"},
    {"numbers": "[-3, 2, -4294967296, 4294967296]"}
    ]
    
    dump   --dump_plugins no   --dump_schema no
    load --table Data
    [
    ["_id","numbers"],
    [1,[0.1,0.0,-0.2]],
    [2,[-3.0,2.0,-4.294967e+09,4.294967e+09]]
    ]
    
    column_create Numbers data_numbers COLUMN_INDEX Data numbers
    select Data --filter 'vector_size(numbers) == 3'
    [
      [
        0,
        0.0,
        0.0
      ],
      [
        [
          [
            1
          ],
          [
            [
              "_id",
              "UInt32"
            ],
            [
              "numbers",
              "Float32"
            ]
          ],
          [
            1,
            [
              0.1,
              0.0,
              -0.2
            ]
          ]
        ]
      ]
    ]
    
  • [fuzzy_search] Rename max_expansion option to max_expansions option.

    max_expansion option is deprecate since this release. However, we can use max_expansion in the feature to backward compatibility.

  • Rename master branch to main branch.

  • [RPM] Used CMake for building.

  • [Debian] Added support for Debian trixie.

Fixes

  • [fuzzy_search] Fixed a bug that Groonga may get records that should not match.

  • [near-phrase-search] Fixed a bug that Groonga crashed when the first phrase group doesn’t match anything as below.

    table_create Entries TABLE_NO_KEY
    column_create Entries content COLUMN_SCALAR Text
    
    table_create Terms TABLE_PAT_KEY ShortText \
      --default_tokenizer TokenNgram \
      --normalizer NormalizerNFKC121
    column_create Terms entries_content COLUMN_INDEX|WITH_POSITION \
      Entries content
    
    load --table Entries
    [
      {"content": "x y z"}
    ]
    
    select Entries \
      --match_columns Terms.entries_content.content \
      --query '*NPP1"(NONEXISTENT) (z)"' \
      --output_columns '_score, content'
    

Conclusion

Please refert to the following news for more details. News Release 13.0.8

Let's search by Groonga!