Groonga 13.0.8 has been released
Groonga 13.0.8 has been released!
How to install: Install
Changes
Here are important changes in this release:
Improvements
-
[column_create] Improved
column_create
flags with additional new flagsCOLUMN_FILTER_SHUFFLE
,COLUMN_FILTER_BYTE_DELTA
,COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE
, andCOMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES
.These new flags are for compression rate of
COMPRESS_ZBLIB
,COMPRESS_ZSTD
to increase by filtering the data before the compression. Even though applying these flags, the compression rate may not increase depends on the data . Also, applying these flags on the filters causes delay to save and reference the column because the using filters with these flags causes additional process.It is important to choose and use effective filters only.
Regardless of setting
COMPRESS_ZLIB
,COMPRESS_LZ4
,COMPRESS_ZSTD
, these filters use BloscLZ as the compression alogolizm, thus make the column size smaller than the noncompressed column in most cases. Yet, there would be a case that some data would show sufficient work by only suettingCOMPRESS_ZLIB
,COMPRESS_LZ4
,COMPRESS_ZSTD
. So it is advised to set suitable flags depending on the data.These flags are only available with
COLUMN_VECTOR
and are ignored withCOLUMN_SCALAR
.Here are a list of which kind of data is suitable for each filter.
-
COLUMN_FILTER_SHUFFLE
This filter focus on the N-th byte element to put in order. This filter places data in continuance gathered from only 0th byte of each element within the vector column, then from only 1st byte, and then from only 2nd byte. It repeats gathering and placing the data in continuance for every bytes.
The point for this filter is reorganized data in continuance with same value because data are gathered based on each N byte. It is because compressed algorithm of
COMPRESS_ZLIB
,COMPRESS_LZ4
,and ` COMPRESS_ZSTD` tend to increase compression rate for the continuance data with same value.Here is how to handle in actual example.
In case of vector column
[1,1051,515]
ofUInt32
, the little endian byte column would be as follows;| Byte 0 | Byte 1 | | Byte 0 | Byte 1 | | Byte 0 | Byte 1 | | ------- | ------- | | ------- | ------- | | ------- | ------- | | 0x00 | 0x01 |, | 0x04 | 0x1b |, | 0x02 | 0x03 |
This flag collects values of
Byte 0
, then values ofByte 1
to create the data as below. This operation collecting values by the each N byte is called shuffle from now on.Point to judge whether this flag is good to use or not for the data is "is there same value in continuance or not?". In the case of above sample, the data does not include same value in continuance after the shuffle. It means the data is not suitable for this flag because shuffle does not increase compression rate without the continuance same values.
On the other hand, vector column [1,2,3] of
UInt32
would be as follow.The vector column [1,2,3] is going to be as follow with a byte column representation.
| Byte 0 | Byte 1 | | Byte 0 | Byte 1 | | Byte 0 | Byte 1 | |--------|--------| |--------|--------| |--------|--------| | 0x00 | 0x01 |, | 0x00 | 0x02 |, | 0x00 | 0x03 |
Then, it will be as follow after the shuffle with this flag.
| Byte 0 | Byte 0 | Byte 0 | | Byte 1 | Byte 1 | Byte 1 | |--------|--------|--------| |--------|--------|--------| | 0x00 | 0x00 | 0x00 |, | 0x01 | 0x02 | 0x03 |
After the shuffle
UInt32
:[1,2,3] data have same continuance 0 value of allByte 0
. Thus, the data is suitable to use this flag because the continuance values in the data lead to increase compression rate.A floating-point number data with matched sign and characteristic is suitable. In case of single floating-point number by IEEE 754 type, a sign would be place in 31st bit and, a characteristic would be placed in 30th bit to 23rd bit. Since the sign bit and the top 7 bits of characteristic structures the highest byte, there would be the continuance same values after shuffle as long as the sign bit matches top 7 bits of characteristic.
For example, the data
Float32
:[0.5,0.6] have same value of sign bit and characteristic bit in single floating-poin number by IEEE 754 type representation as follow.| Sign(1bit) | characteristic(8bit) | significand(23bit) | |------------|-------------|------------------------------| | 0 | 0111 1110 | 000 0000 0000 0000 0000 0000 | | 0 | 0111 1110 | 001 1001 1001 1001 1001 1010 |
Then, there are same continuance values at
Byte 3
after the shuffle as follow.| Byte 0 | Byte 0 | | Byte 1 | Byte 1 | | Byte 2 | Byte 2 | | Byte 3 | Byte 3 | |--------|--------| |--------|--------| |--------|--------| |--------|--------| | 0x00 | 0x9a |, | 0x00 | 0x99 |, | 0x00 | 0x19 |, | 0x3f | 0x3f |
-
COLUMN_FILTER_BYTE_DELTA
This flag is to calculate a difference between byte of compression target values and a difference between adjacent bytes. For example in case of
UInt8
vector column [1,2,3,4,5], the data would be [1,(2-1),(3-2),(4-3),(5-4)]=[1,1,1,1,1] after applying this flag.As above example, this flag target to increase compression rate by calculating difference data and creating data with continuance same values. The point is whether continuance same values would be created by calculating difference between each elements.
As it is said, this flag is not suitable for the data not creating continuance same values with calculating difference. Here are examples of data creating same continuance same values as follow.
First, there are data with fixed pace of addition such as
UInt8
:[1,2,3,4,5] in the previous example. The value of addition does not limit number as long as it has fixed pace such asUInt8
[1,11,21,31,41] adding 10 each.Next, there are data with continuance same value such as
UInt8
:[5,5,5,5,5]. This type of data create data such as [5,0,0,0,0] after calculating the difference, and include same continuance value 0.In opposite, types of data such as
UInt8
:[1,255,2,200,1] is not suitable for this flag since calculating the difference for the data does not create any continuance same vale. -
COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE
This flag is available only for
Float
/Float32
. It is expected to be used in combination withCOLUMN_FILTER_BYTE_DELTA
.This flag deteriorate 1byte from the accuracy of each elements in
Float
/Float32
vector column. Deterioration of the accuracy is to make all bottom byte "0" in floating-point numbers.For example, data
Float32
:[1.24,3.67,4.55] represent as follow in IEEE 754 type single accuracy floating-point numbers.| Sign(1bit) | characteristic(8bit) | significand(23bit) | |------------|-------------|------------------------------| | 0 | 0111 1111 | 010 0000 0000 0000 0000 0000 | | 0 | 1000 0000 | 110 1010 1110 0001 0100 1000 | | 0 | 1000 0001 | 001 0001 1001 1001 1001 1010 |
After applying this flag, the data are with 0 bottom byte as follows. You can see the bottom byte are 0.
| Sign(1bit) | characteristic(8bit) | significand(23bit) | |------------|-------------|------------------------------| | 0 | 0111 1111 | 010 0000 0000 0000 0000 0000 | | 0 | 1000 0000 | 110 1010 1110 0001 0000 0000 | | 0 | 1000 0001 | 001 0001 1001 1001 0000 0000 |
This is how this flag works. As mentioned, this flag is expected to be used in combination with
COLUMN_FILTER_BYTE_DELTA
. So, the data will be expected to have better compression rate applying shuffle byCOLUMN_FILTER_BYTE_DELTA
.Here is the data applying
COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE
then shuffle byCOLUMN_FILTER_BYTE_DELTA
.| Byte 0 | Byte 0 | Byte 0 | | Byte 1 | Byte 1 | Byte 1 | | Byte 2 | Byte 2 | Byte 2 | | Byte 3 | Byte 3 | Byte 3 | |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| | 0x00 | 0x00 | 0x00 |, | 0x00 | 0xe1 | 0x99 |, | 0xa0 | 0x6a | 0x91 |, | 0x3f | 0x40 | 0x40 |
Check
Byte 0
in the data. There are continuance 0 inByte 0
. If 'COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTEis not applied and shuffled, the data does not have any continuance values as follow. It means there are cases where the compression rate can be improved with
COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE` if the shuffle only is not effective.| Byte 0 | Byte 0 | Byte 0 | | Byte 1 | Byte 1 | Byte 1 | | Byte 2 | Byte 2 | Byte 2 | | Byte 3 | Byte 3 | Byte 3 | |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| | 0x00 | 0x48 | 0x9a |, | 0x00 | 0xe1 | 0x99 |, | 0xa0 | 0x6a | 0x91 |, | 0x3f | 0x40 | 0x40 |
Yet, note that the data would be less accuracy since the accuracy is 1 byte less for compression target floating-point numbers.
-
COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE
This flag is available only for
Float
/Float32
. It is expected to be used in combination withCOLUMN_FILTER_SHUFFLE
.This flag deteriorate 2 byte from the accuracy of each elements in
Float
/Float32
vector column. Deterioration of the accuracy is to make all bottom two bytes "0" in floating-point numbers.For example, data
Float32
:[1.24,3.67,4.55] represent as follow in IEEE 754 type single accuracy floating-point numbers.| Sign(1bit) | characteristic(8bit) | significand(23bit) | |------------|-------------|------------------------------| | 0 | 0111 1111 | 010 0000 0000 0000 0000 0000 | | 0 | 1000 0000 | 110 1010 1110 0001 0100 1000 | | 0 | 1000 0001 | 001 0001 1001 1001 1001 1010 |
After applying this flag, the data are with "0" bottom two bytes as follows. You can see the bottom two bytes are 0.
| Sign(1bit) | characteristic(8bit) | significand(23bit) | |------------|-------------|------------------------------| | 0 | 0111 1111 | 010 0000 0000 0000 0000 0000 | | 0 | 1000 0000 | 110 1010 0000 0000 0000 0000 | | 0 | 1000 0001 | 001 0001 0000 0000 0000 0000 |
This is how this flag works.
As mentioned, this flag is expected to be used in combination with
COLUMN_FILTER_SHUFFLE
. So, the data will be expected to have better compression rate applying shuffle byCOLUMN_FILTER_SHUFFLE
.Here is the data applying
COMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE
then shuffle byCOLUMN_FILTER_SHUFFLE
.| Byte 0 | Byte 0 | Byte 0 | | Byte 1 | Byte 1 | Byte 1 | | Byte 2 | Byte 2 | Byte 2 | | Byte 3 | Byte 3 | Byte 3 | |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| | 0x00 | 0x00 | 0x00 |, | 0x00 | 0x00 | 0x00 |, | 0xa0 | 0x6a | 0x91 |, | 0x3f | 0x40 | 0x40 |
Check
Byte 0
andByte 1
in the data. There are continuance 0 inByte 0
andByte 1
. IfCOMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE
is not applied and shuffled, the data does not have any continuance values as follow. It means there are cases where the compression rate can be improved withCOMPRESS_FILTER_TRUNCATE_PRECISION_2BYTE
if the shuffle only is not effective.| Byte 0 | Byte 0 | Byte 0 | | Byte 1 | Byte 1 | Byte 1 | | Byte 2 | Byte 2 | Byte 2 | | Byte 3 | Byte 3 | Byte 3 | |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| |--------|--------|--------| | 0x00 | 0x48 | 0x9a |, | 0x00 | 0xe1 | 0x99 |, | 0xa0 | 0x6a | 0x91 |, | 0x3f | 0x40 | 0x40 |
Yet, note that the data would be less accuracy since the accuracy is 2 byte2 less for compression target floating-point numbers.
-
Added bundling new Blosc library.
Blosc is required to use
COLUMN_FILTER_SHUFFLE
,COLUMN_FILTER_BYTE_DELTA
,COMPRESS_FILTER_TRUNCATE_PRECISION_1BYTE
, andCOMPRESS_FILTER_TRUNCATE_PRECISION_2BYTES
flags.
-
-
[status] Improved
status
output with additional new features"blosc"
. -
[groonga] Improved
groonga --version
output with additional new valueblosc
. -
[select] Improved
select
arguments with additional new argument--fuzzy_max_distance
(experimental).This argument is useful when search process can tolerance typographical errors. This argument revise typographical errors by users.
In case of setting this argument, Groonga proceed search again after checking a word with edit distance indicated under
--fuzzy_max_distance
from the vocabulary list when the query does not match as shown in following example.table_create Memos TABLE_NO_KEY column_create Memos content COLUMN_SCALAR ShortText table_create Lexicon TABLE_PAT_KEY ShortText --default_tokenizer TokenNgram --normalizer NormalizerNFKC150 column_create Lexicon memos_content COLUMN_INDEX|WITH_POSITION Memos content load --table Memos [ {"content": "This is a pen"}, {"content": "That is a pen"} ] select Memos --match_columns content --query "Thjs" --fuzzy_max_distance 1 --output_columns *,_score [ [ 0, 0.0, 0.0 ], [ [ [ 1 ], [ [ "content", "ShortText" ], [ "_score", "Int32" ] ], [ "This is a pen", 1 ] ] ] ]
-
[select] Improved
select
arguments with additional new argument--fuzzy_max_expansions
(experimental).As mentioned,
--fuzzy_max_distance
let Groonga proceed search again after checking a word with edit distance indicated.For example, if there are
This
andThat
in the vocabulary list, given queryThjs
would not match andTHjs
of--fuzzy_max_distance 10
would be searched again in the vocabulary list asThis
andThat
.In the above example, the vocabulary list have only 2 words that
--fuzzy_max_distance
is smaller than 10. If there were too many words that have edit distance smaller than 10, the performance would drop since all of those candidate words are searched.However, using
--fuzzy_max_expansions
can limit number of words that has close edit distance to use search process. This argument can help to balance hit numbers and performance of the search.Default value of
--fuzzy max_expansions
is 0. When it is 0, the search use all words that the edit distance are under--fuzzy max_distance
in the vocabulary list. -
[select] Improved
select
arguments with additional new argument--fuzzy_max_distance_ratio
(experimental).--fuzzy_max_distance
set same value against every query. Thus, given query, "abc OR defghijklmn", would be set value--fuzzy_max_distance
to both "abc" and "defhjklmn".However, the length of keywords might influence how much tolerance against typographical errors. For example, "abc" has just 3 characters so that only 1 typography is tolerance. On the other hand, "defhjklmn" has many characters so that 3 typography is tolerance.
--fuzzy_max_distance_ratio
is to change edit distance automatically depending on the number of charcters. This argument allow to set different edit distance for each keywords.If
--fuzzy max_distance ratio FUZZY MAX_DISTANCE
is set, the edit distance would be calculate as " number of characters *FUZZY_MAX_DISTANCE
(round down decimall)". 0.34 is recommended for "FUZZY_MAX_DISTANCE". It is because 0.34 is just about same asfuzziness=AUTO
of Elasticsearch. -
[select] Improved
select
arguments with additional new argument--fuzzy_prefix_length
(experimental).This is to accelerate search speed with
--fuzzy_max_distance
.After applying it, the hit data would have characters that the leading characters of key word matches for the number of characters set in
--fuzzy prefix_length
.It accelerate the process to limit the process target only for character strings that have common leading N characters since setting
--fuzzy_max_distance
large would expand process target and lead slow process. -
[cast] Added support for casting
"[0.0, 1.0, 1.1, ...]"
toFloat
/Float32
vector as below.plugin_register functions/vector table_create Data TABLE_NO_KEY column_create Data numbers COLUMN_VECTOR Float32 table_create Numbers TABLE_PAT_KEY Float32 column_create Numbers data_numbers COLUMN_INDEX Data numbers load --table Data [ {"numbers": "[0.1, 0.0, -0.2]"}, {"numbers": "[-3, 2, -4294967296, 4294967296]"} ] dump --dump_plugins no --dump_schema no load --table Data [ ["_id","numbers"], [1,[0.1,0.0,-0.2]], [2,[-3.0,2.0,-4.294967e+09,4.294967e+09]] ] column_create Numbers data_numbers COLUMN_INDEX Data numbers select Data --filter 'vector_size(numbers) == 3' [ [ 0, 0.0, 0.0 ], [ [ [ 1 ], [ [ "_id", "UInt32" ], [ "numbers", "Float32" ] ], [ 1, [ 0.1, 0.0, -0.2 ] ] ] ] ]
-
[fuzzy_search] Rename
max_expansion
option tomax_expansions
option.max_expansion
option is deprecate since this release. However, we can usemax_expansion
in the feature to backward compatibility. -
Rename master branch to main branch.
-
[RPM] Used CMake for building.
-
[Debian] Added support for Debian trixie.
Fixes
-
[fuzzy_search] Fixed a bug that Groonga may get records that should not match.
-
[near-phrase-search] Fixed a bug that Groonga crashed when the first phrase group doesn’t match anything as below.
table_create Entries TABLE_NO_KEY column_create Entries content COLUMN_SCALAR Text table_create Terms TABLE_PAT_KEY ShortText \ --default_tokenizer TokenNgram \ --normalizer NormalizerNFKC121 column_create Terms entries_content COLUMN_INDEX|WITH_POSITION \ Entries content load --table Entries [ {"content": "x y z"} ] select Entries \ --match_columns Terms.entries_content.content \ --query '*NPP1"(NONEXISTENT) (z)"' \ --output_columns '_score, content'
Conclusion
Please refert to the following news for more details. News Release 13.0.8
Let's search by Groonga!