BloGroonga

2019-04-29

Groonga 9.0.2 has been released

Groonga 9.0.2 has been released!

We provide a package for Windows made from VC++ from this release.

We also provide a package for Windows made form MinGW as in the past.

However, we will provide it made from VC++ instead of making from MinGW sooner or later.

How to install: Install

Changes

Here are important changes in this release:

  • column_create Added a new flag INDEX_LARGE for index column.

  • object_inspect Added a new statistics next_physical_segment_id and max_n_physical_segments for physical segment information.

  • logical_select Added support for window function over shard.

  • logical_range_filter Added support for window function over shard.

  • logical_count Added support for window function over shard.

  • io_flush Added a new option --recursive dependent

  • Fixed "unknown type name 'bool'" compilation error in some environments.

  • Fixed a bug that incorrect output number over Int32 by command of execute via mruby (e.g. logical_select, logical_range_filter, logical_count, etc.).

column_create Added a new flag INDEX_LARGE for index column.

We can make an index column has space that two times of default by this flag. However, note that it also uses two times of memory usage.

This flag useful when index target data are large. Large data must have many records (normally at least 10 millions records) and at least one of the following features.

  • Index targets are multiple columns
  • Index table has tokenizer

Here is an example to create a large index column:

  column_create \
  --table Terms \
  --name people_roles_large_index \
  --flags COLUMN_INDEX|WITH_POSITION|WITH_SECTION|INDEX_LARGE \
  --type People \
  --source roles
  [[0, 1337566253.89858, 0.000355720520019531], true]

object_inspect Added a new statistics next_physical_segment_id and max_n_physical_segments for physical segment information.

next_physical_segment_id is the ID of the segment to the inspected index column use next. That is this number shows currently the usage of the segment.

max_n_physical_segments is the max number of the segments to the inspected index column.

The max number of these statistics depend on index column size:

Index column size The max number of segments
INDEX_SMALL 2**9 (512)
INDEX_MEDIUM 2**16 (65536)
INDEX_LARGE 2**17 * 2 (262144)
Default 2**17 (131072)

logical_select Added support for window function over shard.

We can apply the window function to over multiple tables. However, we need to align the same order for shard key and leading group key or sort key.

For example, we can apply the window function to over multiple tables as below case. Because the below example aligns the same order for shard key and leading group key.

The leading group key is price and shard key is timestamp in the below example:

  plugin_register sharding
  
  table_create Logs_20170415 TABLE_NO_KEY
  column_create Logs_20170415 timestamp COLUMN_SCALAR Time
  column_create Logs_20170415 price COLUMN_SCALAR UInt32
  column_create Logs_20170415 n_likes COLUMN_SCALAR UInt32
  
  table_create Logs_20170416 TABLE_NO_KEY
  column_create Logs_20170416 timestamp COLUMN_SCALAR Time
  column_create Logs_20170416 price COLUMN_SCALAR UInt32
  column_create Logs_20170416 n_likes COLUMN_SCALAR UInt32
  
  load --table Logs_20170415
  [
  {"timestamp": "2017/04/15 00:00:00", "n_likes": 2, "price": 100},
  {"timestamp": "2017/04/15 01:00:00", "n_likes": 1, "price": 100},
  {"timestamp": "2017/04/15 01:00:00", "n_likes": 2, "price": 200}
  ]
  
  load --table Logs_20170416
  [
  {"timestamp": "2017/04/16 10:00:00", "n_likes": 1, "price": 200},
  {"timestamp": "2017/04/16 11:00:00", "n_likes": 2, "price": 300},
  {"timestamp": "2017/04/16 11:00:00", "n_likes": 1, "price": 300}
  ]
  
  logical_select Logs \
    --shard_key timestamp \
    --columns[count].stage initial \
    --columns[count].type UInt32 \
    --columns[count].flags COLUMN_SCALAR \
    --columns[count].value 'window_count()' \
    --columns[count].window.group_keys price \
    --output_columns price,count
  [
    [
      0,
      0.0,
      0.0
    ],
    [
      [
        [
          6
        ],
        [
          [
            "price",
            "UInt32"
          ],
          [
            "count",
            "UInt32"
          ]
        ],
        [
          100,
          2
        ],
        [
          100,
          2
        ],
        [
          200,
          2
        ],
        [
          200,
          2
        ],
        [
          300,
          2
        ],
        [
          300,
          2
        ]
      ]
    ]
  ]

logical_range_filter Added support for window function over shard.

We can apply the window function to over multiple tables. However, we need to align the same order for shard key and leading group key or sort key as with logical_select.

Here is an example to apply the window function to over multiple tables by logical_range_filter:

  plugin_register sharding
  
  table_create Logs_20170415 TABLE_NO_KEY
  column_create Logs_20170415 timestamp COLUMN_SCALAR Time
  column_create Logs_20170415 price COLUMN_SCALAR UInt32
  column_create Logs_20170415 n_likes COLUMN_SCALAR UInt32
  
  table_create Logs_20170416 TABLE_NO_KEY
  column_create Logs_20170416 timestamp COLUMN_SCALAR Time
  column_create Logs_20170416 price COLUMN_SCALAR UInt32
  column_create Logs_20170416 n_likes COLUMN_SCALAR UInt32
  
  load --table Logs_20170415
  [
  {"timestamp": "2017/04/15 00:00:00", "n_likes": 2, "price": 100},
  {"timestamp": "2017/04/15 01:00:00", "n_likes": 1, "price": 100},
  {"timestamp": "2017/04/15 01:00:00", "n_likes": 2, "price": 200}
  ]
  
  load --table Logs_20170416
  [
  {"timestamp": "2017/04/16 10:00:00", "n_likes": 1, "price": 200},
  {"timestamp": "2017/04/16 11:00:00", "n_likes": 2, "price": 300},
  {"timestamp": "2017/04/16 11:00:00", "n_likes": 1, "price": 300}
  ]
  
  logical_range_filter Logs \
    --shard_key timestamp \
    --columns[count].stage initial \
    --columns[count].type UInt32 \
    --columns[count].flags COLUMN_SCALAR \
    --columns[count].value 'window_count()' \
    --columns[count].window.group_keys price \
    --output_columns price,count
  [
    [
      0,
      0.0,
      0.0
    ],
    [
      [
        [
          6
        ],
        [
          [
            "price",
            "UInt32"
          ],
          [
            "count",
            "UInt32"
          ]
        ],
        [
          100,
          2
        ],
        [
          100,
          2
        ],
        [
          200,
          2
        ],
        [
          200,
          2
        ],
        [
          300,
          2
        ],
        [
          300,
          2
        ]
      ]
    ]
  ]

logical_count Added support for window function over shard.

We can apply the window function to over multiple tables. However, we need to align the same order for shard key and leading group key or sort key as with logical_select.

Here is an example to apply the window function to over multiple tables by logical_count:

  plugin_register sharding
  
  table_create Logs_20170415 TABLE_NO_KEY
  column_create Logs_20170415 timestamp COLUMN_SCALAR Time
  column_create Logs_20170415 price COLUMN_SCALAR UInt32
  column_create Logs_20170415 n_likes COLUMN_SCALAR UInt32
  
  table_create Logs_20170416 TABLE_NO_KEY
  column_create Logs_20170416 timestamp COLUMN_SCALAR Time
  column_create Logs_20170416 price COLUMN_SCALAR UInt32
  column_create Logs_20170416 n_likes COLUMN_SCALAR UInt32
  
  load --table Logs_20170415
  [
  {"timestamp": "2017/04/15 00:00:00", "n_likes": 2, "price": 100},
  {"timestamp": "2017/04/15 01:00:00", "n_likes": 1, "price": 100},
  {"timestamp": "2017/04/15 01:00:00", "n_likes": 2, "price": 200}
  ]
  
  load --table Logs_20170416
  [
  {"timestamp": "2017/04/16 10:00:00", "n_likes": 1, "price": 200},
  {"timestamp": "2017/04/16 11:00:00", "n_likes": 2, "price": 300},
  {"timestamp": "2017/04/16 11:00:00", "n_likes": 1, "price": 300}
  ]
  
  logical_count Logs \
    --shard_key timestamp \
    --columns[count].stage initial \
    --columns[count].type UInt32 \
    --columns[count].flags COLUMN_SCALAR \
    --columns[count].value 'window_count()' \
    --columns[count].window.group_keys price \
    --filter 'count >= 1'
  [
    [
      0,
      0.0,
      0.0
    ],
    [
      4
    ]
  ]

io_flush Added a new option --recursive dependent

We can flush not only target object and child objects, but also related objects by this option.

The related objects are:

  • A referenced table
  • A related index column (There is source column in target TABLE_NAME)
  • A table of related index column (There is source column in target TABLE_NAME)

Here is an example to use this option:

  io_flush --recursive "dependent" --target_name "Users"

Conclusion

See Release 9.0.2 2019-04-29 about detailed changes since 9.0.1

Let's search by Groonga!

2019-03-29

Groonga 9.0.1 has been released

Groonga 9.0.1 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • select Added new argument --load_table, --load_columns and --load_values.

  • Added index_column_diff command to check broken index column. (This feature is during verification.)

  • Fixed a bug that deleted records may be matched because of updating indexes incorrectly.

    • It may occure when large number of records is added or deleted.
  • Fixed a memory leak when logical_range_filter returns no records.

  • Fixed a bug that query will not match because of loading data is not normalized correctly.

    • This bug occurs when load data contains whitespace after KATAKANA and unify_kana option is used for normalizer.
  • Fixed a bug that an indexes is broken during updating indexes.

    • It may occurs when repeating to add large number of records or delete them for a long term.
  • Fixed a crash bug that allocated working area is not enough size when updating indexes.

select Added new argument --load_table, --load_columns and --load_values.

We can store a result of select in a table that specifying --load_table. Explanation of arguments is below.

  • --load_table option: Specifies table that store a result of select.
  • --load_values option: Specifies columns of result of select.
  • --load_columns options: Specifies columns of table that specifying --load_table.

In this way, we can store values of columns that specifying with --load_values into columns that specifying with --load_columns.

For example, we can store _id and timestamp that a result of select in a Logs table specified by --load_table as below.

table_create Logs_20150203 TABLE_HASH_KEY ShortText
column_create Logs_20150203 timestamp COLUMN_SCALAR Time

table_create Logs TABLE_HASH_KEY ShortText
column_create Logs original_id COLUMN_SCALAR UInt32
column_create Logs timestamp_text COLUMN_SCALAR ShortText

load --table Logs_20150203
[
{
  "_key": "2015-02-03:1",
  "timestamp": "2015-02-03 10:49:00"
},
{
  "_key": "2015-02-03:2",
  "timestamp": "2015-02-03 12:49:00"
}
]

select \
  --table Logs_20150203 \
  --load_table Logs \
  --load_columns "original_id, timestamp_text" \
  --load_values "_id, timestamp"
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "timestamp",
          "Time"
        ]
      ],
      [
        1,
        "2015-02-03:1",
        1422928140.0
      ],
      [
        2,
        "2015-02-03:2",
        1422935340.0
      ]
    ]
  ]
]

select --table Logs
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "original_id",
          "UInt32"
        ],
        [
          "timestamp_text",
          "ShortText"
        ]
      ],
      [
        1,
        "2015-02-03:1",
        1,
        "1422928140000000"
      ],
      [
        2,
        "2015-02-03:2",
        2,
        "1422935340000000"
      ]
    ]
  ]
]

Added index_column_diff command to check broken index column. (This feature is during the verification.)

We can check the broken index by this command. However, this feature has been during the verification yet.

This command compares values of an index column to tokenized an index source values and display diff of them.

We can use this command as below.

  • We specify the name of an index table name includes an index column of a target in the first argument.
  • We specify the index name of the target in the second argument.
index_column_diff index_table_name index_column_name

result of this command have three items as below

  • token : This item shows broken token.
  • remains : This item shows it has remained posting-list in index unintentionally.
  • missings : This item shows it has been deleted posting-list in index unintentionally.

If indexes are normal, this command returns empty value as below.

index_column_diff --table Term --name data_index
[[0,1553654816.796513,0.001804113388061523],[]]

Conclusion

See Release 9.0.1 2019-03-29 about detailed changes since 9.0.0

Let's search by Groonga!

2019-02-09

Groonga 9.0.0 has been released

Groonga 9.0.0 has been released!

This is a major version up! But It keeps backward compatibility. You can upgrade to 9.0.0 without rebuilding database.

How to install: Install

Changes

Here are important changes in this release:

Tokenizers Added a new tokenizer TokenPattern.

You can extract tokens by regular expression as below. This tokenizer extracts only token that matches the regular expression.

You can also specify multiple patterns of regular expression.

tokenize 'TokenPattern("pattern", "\\\\$[0-9]", "pattern", "apples|oranges")' "I bought apples for $3 and oranges for $4."
[
  [
    0,
    1549612606.784344,
    0.0003230571746826172
  ],
  [
    {
      "value": "apples",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$3",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "oranges",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$4",
      "position": 3,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

Tokenizers Added a new tokenizer TokenTable.

You can extract tokens by a key of existing a table as below.

table_create Keywords TABLE_PAT_KEY ShortText --normalizer NormalizerNFKC100
load --table Keywords
[
{"_key": "$4"},
{"_key": "apples"},
{"_key": "$3"}
]
tokenize 'TokenTable("table", "Keywords")' "I bought apples for $4 at $3."
[
  [
    0,
    1549613095.146393,
    0.0003008842468261719
  ],
  [
    {
      "value": "apples",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$4",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$3",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

select Supported similer search against index column.

If you have used multi column index, you can similar search against all source columns by this feature.

table_create Documents TABLE_HASH_KEY ShortText
column_create Documents content1 COLUMN_SCALAR Text
column_create Documents content2 COLUMN_SCALAR Text
table_create Terms TABLE_PAT_KEY|KEY_NORMALIZE ShortText --default_tokenizer TokenBigram
column_create Terms document_index COLUMN_INDEX|WITH_POSITION|WITH_SECTION Documents content1,content2
load --table Documents
[
["_key", "content1"],
["Groonga overview", "Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results."],
["Full text search and Instant update", "In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure."],
["Column store and aggregate query", "People can collect more than enough data in the Internet era."]
]
load --table Documents
[
["_key", "content2"],
["Inverted index and tokenizer", "An inverted index is a traditional data structure used for large-scale full text search."],
["Sharable storage and read lock-free", "Multi-core processors are mainstream today and the number of cores per processor is increasing."],
["Geo-location (latitude and longitude) search", "Location services are getting more convenient because of mobile devices with GPS."],
["Groonga library", "The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database."],
["Groonga server", "Groonga provides a built-in server command which supports HTTP, the memcached binary protocol and the Groonga Query Transfer Protocol (GQTP)."],
["Mroonga storage engine", "Groonga works not only as an independent column-oriented DBMS but also as storage engines of well-known DBMSs."]
]
select Documents --filter 'Terms.document_index *S "Full text seach by MySQL"' --output_columns '_key, _score, content1, content2'
[
  [
    0,
    1549615598.381915,
    0.0007889270782470703
  ],
  [
    [
      [
        4
      ],
      [
        [
          "_key",
          "ShortText"
        ],
        [
          "_score",
          "Int32"
        ],
        [
          "content1",
          "Text"
        ],
        [
          "content2",
          "Text"
        ]
      ],
      [
        "Groonga overview",
        87382,
        "Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results.",
        ""
      ],
      [
        "Full text search and Instant update",
        87382,
        "In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure.",
        ""
      ],
      [
        "Inverted index and tokenizer",
        87382,
        "",
        "An inverted index is a traditional data structure used for large-scale full text search."
      ],
      [
        "Groonga library",
        87382,
        "",
        "The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database."
      ]
    ]
  ]
]

Normalizers Added new option remove_blank for NormalizerNFKC100.

This option remove white spaces as below.

normalize 'NormalizerNFKC100("remove_blank", true)' "This is a pen."
[
  [
    0,
    1549528178.608151,
    0.0002171993255615234
  ],
  {
    "normalized": "thisisapen.",
    "types": [
    ],
    "checks": [
    ]
  }
]

groonga executable file Improve display of thread id in log.

Because It was easy to confuse thread id and process id on Windows version, it made clear which is a thread id or a process id.

  • (Before): |2436|1032:
    • 2436 is a process id. 1032 is a thread id.
  • (After): |2436|00001032:
    • 2436 is a process id, 00001032 is a thread id.

Conclusion

See Release 9.0.0 2019-02-09 about detailed changes since 8.1.1

Let's search by Groonga!

2019-01-29

Groonga 8.1.1 has been released

Groonga 8.1.1 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • logical_select Added new argument --load_table, --load_columns and --load_values.

  • groonga executable file Added a new option --log-flags.

  • Fixed a memory leak when occurs index update error.

  • Normalizers Fixed a bug that stateless normalizers and stateful normalizers return wrong results when we use them at the same time.

    • Stateless normalizers are below.

      • unify_kana
      • unify_kana_case
      • unify_kana_voiced_sound_mark
      • unify_hyphen
      • unify_prolonged_sound_mark
      • unify_hyphen_and_prolonged_sound_mark
      • unify_middle_dot
    • Stateful normalizers are below.

      • unify_katakana_v_sounds
      • unify_katakana_bu_sound
      • unify_to_romaji

logical_select Added new argument --load_table, --load_columns and --load_values.

We can store a result of logical_select in a table that specifying --load_table.

--load_values option specifies columns of result of logical_select.

--load_columns options specifies columns of table that specifying --load_table.

In this way, you can store values of columns that specifying with --load_values into columns that specifying with --load_columns.

For example, we can store _id and timestamp that a result of logical_select in a Logs table specified by --load_table as below.

table_create Logs_20150203 TABLE_HASH_KEY ShortText
column_create Logs_20150203 timestamp COLUMN_SCALAR Time

table_create Logs_20150204 TABLE_HASH_KEY ShortText
column_create Logs_20150204 timestamp COLUMN_SCALAR Time

table_create Logs TABLE_HASH_KEY ShortText
column_create Logs original_id COLUMN_SCALAR UInt32
column_create Logs timestamp_text COLUMN_SCALAR ShortText

load --table Logs_20150203
[
{
  "_key": "2015-02-03:1",
  "timestamp": "2015-02-03 10:49:00"
},
{
  "_key": "2015-02-03:2",
  "timestamp": "2015-02-03 12:49:00"
}
]

load --table Logs_20150204
[
{
  "_key": "2015-02-04:1",
  "timestamp": "2015-02-04 00:00:00"
}
]

logical_select \
  --logical_table Logs \
  --shard_key timestamp \
  --load_table Logs \
  --load_columns "original_id, timestamp_text" \
  --load_values "_id, timestamp"
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "timestamp",
          "Time"
        ]
      ],
      [
        1,
        "2015-02-03:1",
        1422928140.0
      ],
      [
        2,
        "2015-02-03:2",
        1422935340.0
      ],
      [
        1,
        "2015-02-04:1",
        1422975600.0
      ]
    ]
  ]
]
select --table Logs
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "original_id",
          "UInt32"
        ],
        [
          "timestamp_text",
          "ShortText"
        ]
      ],
      [
        1,
        "2015-02-03:1",
        1,
        "1422928140000000"
      ],
      [
        2,
        "2015-02-03:2",
        2,
        "1422935340000000"
      ],
      [
        3,
        "2015-02-04:1",
        1,
        "1422975600000000"
      ]
    ]
  ]
]

groonga executable file Added a new option --log-flags.

We can specify output items of a log of the Groonga.

We can output as below items.

  • Timestamp
  • Log message
  • Location(the location where the log was output)
  • Process id
  • Thread id

We can specify prefix as below.

  • +

    • This prefix means that "add the flag".
  • -

    • This prefix means that "remove the flag".
  • No prefix means that "replace existing flags".

Specifically, we can specify flags as below.

  • none

    • Output nothing into the log.
  • time

    • Output a timestamp into the log.
  • message

    • Output log messages into the log.
  • location

    • Output the location where the log was output( a file name, a line and a function name) and process id.
  • process_id

    • Output a process id into the log.
  • pid

    • This flag is an alias of process_id.
  • thread_id

    • Output thread id into the log.
  • all

    • This flag specifies all flags except none and default flags.
  • default

    • Output a timestamp and log messages into the log.

We can also specify multiple log flags by separating flags with |.

For example, we can output process id and thread id additional as below.

Execute command
% groonga --log-path groonga.log --log-flags "+pid|+thread_id" db/test.db

Result format
Timestamp|Log level|process id|thread id: Log message

Result
2019-01-29 08:53:03.587000|n|2344|3228: grn_init: <8.1.1-xx-xxxxxxxx>

Conclusion

See Release 8.1.1 2019-01-29 about detailed changes since 8.1.0

Let's search by Groonga!

2018-12-29

Groonga 8.1.0 has been released

Groonga 8.1.0 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • Fixed a bug that unlock against DB is always executed after flush when after execute a io_flush command.
    • OS flush unlocks information to storage at some future date. However, If the Groonga is finished before flush storage by OS, lock remain in DB.
    • This problem occurs only The Windows OS.
  • Fixed a bug that reindex command doesn't finish when execute a reindex command against table that has record that has not references.

Conclusion

See Release 8.1.0 2018-12-29 about detailed changes since 8.0.9

Let's search by Groonga!