BloGroonga

2019-02-09

Groonga 9.0.0 has been released

Groonga 9.0.0 has been released!

This is a major version up! But It keeps backward compatibility. You can upgrade to 9.0.0 without rebuilding database.

How to install: Install

Changes

Here are important changes in this release:

Tokenizers Added a new tokenizer TokenPattern.

You can extract tokens by regular expression as below. This tokenizer extracts only token that matches the regular expression.

You can also specify multiple patterns of regular expression.

tokenize 'TokenPattern("pattern", "\\\\$[0-9]", "pattern", "apples|oranges")' "I bought apples for $3 and oranges for $4."
[
  [
    0,
    1549612606.784344,
    0.0003230571746826172
  ],
  [
    {
      "value": "apples",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$3",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "oranges",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$4",
      "position": 3,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

Tokenizers Added a new tokenizer TokenTable.

You can extract tokens by a key of existing a table as below.

table_create Keywords TABLE_PAT_KEY ShortText --normalizer NormalizerNFKC100
load --table Keywords
[
{"_key": "$4"},
{"_key": "apples"},
{"_key": "$3"}
]
tokenize 'TokenTable("table", "Keywords")' "I bought apples for $4 at $3."
[
  [
    0,
    1549613095.146393,
    0.0003008842468261719
  ],
  [
    {
      "value": "apples",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$4",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$3",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

select Supported similer search against index column.

If you have used multi column index, you can similar search against all source columns by this feature.

table_create Documents TABLE_HASH_KEY ShortText
column_create Documents content1 COLUMN_SCALAR Text
column_create Documents content2 COLUMN_SCALAR Text
table_create Terms TABLE_PAT_KEY|KEY_NORMALIZE ShortText --default_tokenizer TokenBigram
column_create Terms document_index COLUMN_INDEX|WITH_POSITION|WITH_SECTION Documents content1,content2
load --table Documents
[
["_key", "content1"],
["Groonga overview", "Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results."],
["Full text search and Instant update", "In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure."],
["Column store and aggregate query", "People can collect more than enough data in the Internet era."]
]
load --table Documents
[
["_key", "content2"],
["Inverted index and tokenizer", "An inverted index is a traditional data structure used for large-scale full text search."],
["Sharable storage and read lock-free", "Multi-core processors are mainstream today and the number of cores per processor is increasing."],
["Geo-location (latitude and longitude) search", "Location services are getting more convenient because of mobile devices with GPS."],
["Groonga library", "The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database."],
["Groonga server", "Groonga provides a built-in server command which supports HTTP, the memcached binary protocol and the Groonga Query Transfer Protocol (GQTP)."],
["Mroonga storage engine", "Groonga works not only as an independent column-oriented DBMS but also as storage engines of well-known DBMSs."]
]
select Documents --filter 'Terms.document_index *S "Full text seach by MySQL"' --output_columns '_key, _score, content1, content2'
[
  [
    0,
    1549615598.381915,
    0.0007889270782470703
  ],
  [
    [
      [
        4
      ],
      [
        [
          "_key",
          "ShortText"
        ],
        [
          "_score",
          "Int32"
        ],
        [
          "content1",
          "Text"
        ],
        [
          "content2",
          "Text"
        ]
      ],
      [
        "Groonga overview",
        87382,
        "Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results.",
        ""
      ],
      [
        "Full text search and Instant update",
        87382,
        "In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure.",
        ""
      ],
      [
        "Inverted index and tokenizer",
        87382,
        "",
        "An inverted index is a traditional data structure used for large-scale full text search."
      ],
      [
        "Groonga library",
        87382,
        "",
        "The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database."
      ]
    ]
  ]
]

Normalizers Added new option remove_blank for NormalizerNFKC100.

This option remove white spaces as below.

normalize 'NormalizerNFKC100("remove_blank", true)' "This is a pen."
[
  [
    0,
    1549528178.608151,
    0.0002171993255615234
  ],
  {
    "normalized": "thisisapen.",
    "types": [
    ],
    "checks": [
    ]
  }
]

groonga executable file Improve display of thread id in log.

Because It was easy to confuse thread id and process id on Windows version, it made clear which is a thread id or a process id.

  • (Before): |2436|1032:
    • 2436 is a process id. 1032 is a thread id.
  • (After): |2436|00001032:
    • 2436 is a process id, 00001032 is a thread id.

Conclusion

See Release 9.0.0 2019-02-09 about detailed changes since 8.1.1

Let's search by Groonga!

2019-01-29

Groonga 8.1.1 has been released

Groonga 8.1.1 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • logical_select Added new argument --load_table, --load_columns and --load_values.

  • groonga executable file Added a new option --log-flags.

  • Fixed a memory leak when occurs index update error.

  • Normalizers Fixed a bug that stateless normalizers and stateful normalizers return wrong results when we use them at the same time.

    • Stateless normalizers are below.

      • unify_kana
      • unify_kana_case
      • unify_kana_voiced_sound_mark
      • unify_hyphen
      • unify_prolonged_sound_mark
      • unify_hyphen_and_prolonged_sound_mark
      • unify_middle_dot
    • Stateful normalizers are below.

      • unify_katakana_v_sounds
      • unify_katakana_bu_sound
      • unify_to_romaji

logical_select Added new argument --load_table, --load_columns and --load_values.

We can store a result of logical_select in a table that specifying --load_table.

--load_values option specifies columns of result of logical_select.

--load_columns options specifies columns of table that specifying --load_table.

In this way, you can store values of columns that specifying with --load_values into columns that specifying with --load_columns.

For example, we can store _id and timestamp that a result of logical_select in a Logs table specified by --load_table as below.

table_create Logs_20150203 TABLE_HASH_KEY ShortText
column_create Logs_20150203 timestamp COLUMN_SCALAR Time

table_create Logs_20150204 TABLE_HASH_KEY ShortText
column_create Logs_20150204 timestamp COLUMN_SCALAR Time

table_create Logs TABLE_HASH_KEY ShortText
column_create Logs original_id COLUMN_SCALAR UInt32
column_create Logs timestamp_text COLUMN_SCALAR ShortText

load --table Logs_20150203
[
{
  "_key": "2015-02-03:1",
  "timestamp": "2015-02-03 10:49:00"
},
{
  "_key": "2015-02-03:2",
  "timestamp": "2015-02-03 12:49:00"
}
]

load --table Logs_20150204
[
{
  "_key": "2015-02-04:1",
  "timestamp": "2015-02-04 00:00:00"
}
]

logical_select \
  --logical_table Logs \
  --shard_key timestamp \
  --load_table Logs \
  --load_columns "original_id, timestamp_text" \
  --load_values "_id, timestamp"
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "timestamp",
          "Time"
        ]
      ],
      [
        1,
        "2015-02-03:1",
        1422928140.0
      ],
      [
        2,
        "2015-02-03:2",
        1422935340.0
      ],
      [
        1,
        "2015-02-04:1",
        1422975600.0
      ]
    ]
  ]
]
select --table Logs
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "original_id",
          "UInt32"
        ],
        [
          "timestamp_text",
          "ShortText"
        ]
      ],
      [
        1,
        "2015-02-03:1",
        1,
        "1422928140000000"
      ],
      [
        2,
        "2015-02-03:2",
        2,
        "1422935340000000"
      ],
      [
        3,
        "2015-02-04:1",
        1,
        "1422975600000000"
      ]
    ]
  ]
]

groonga executable file Added a new option --log-flags.

We can specify output items of a log of the Groonga.

We can output as below items.

  • Timestamp
  • Log message
  • Location(the location where the log was output)
  • Process id
  • Thread id

We can specify prefix as below.

  • +

    • This prefix means that "add the flag".
  • -

    • This prefix means that "remove the flag".
  • No prefix means that "replace existing flags".

Specifically, we can specify flags as below.

  • none

    • Output nothing into the log.
  • time

    • Output a timestamp into the log.
  • message

    • Output log messages into the log.
  • location

    • Output the location where the log was output( a file name, a line and a function name) and process id.
  • process_id

    • Output a process id into the log.
  • pid

    • This flag is an alias of process_id.
  • thread_id

    • Output thread id into the log.
  • all

    • This flag specifies all flags except none and default flags.
  • default

    • Output a timestamp and log messages into the log.

We can also specify multiple log flags by separating flags with |.

For example, we can output process id and thread id additional as below.

Execute command
% groonga --log-path groonga.log --log-flags "+pid|+thread_id" db/test.db

Result format
Timestamp|Log level|process id|thread id: Log message

Result
2019-01-29 08:53:03.587000|n|2344|3228: grn_init: <8.1.1-xx-xxxxxxxx>

Conclusion

See Release 8.1.1 2019-01-29 about detailed changes since 8.1.0

Let's search by Groonga!

2018-12-29

Groonga 8.1.0 has been released

Groonga 8.1.0 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • Fixed a bug that unlock against DB is always executed after flush when after execute a io_flush command.
    • OS flush unlocks information to storage at some future date. However, If the Groonga is finished before flush storage by OS, lock remain in DB.
    • This problem occurs only The Windows OS.
  • Fixed a bug that reindex command doesn't finish when execute a reindex command against table that has record that has not references.

Conclusion

See Release 8.1.0 2018-12-29 about detailed changes since 8.0.9

Let's search by Groonga!

2018-11-29

Groonga 8.0.9 has been released

Groonga 8.0.9 has been released!

How to install: Install

Changes

Here are important changes in this release:

The TokenDelimit tokenizer now supports any delimiter not only whitespaces

New options delimiter and pattern are now available for TokenDelimit to specify any delimiter, like:

% groonga
> tokenize 'TokenDelimit("delimiter", ",")' "A,B"
=> "A", "B"
> tokenize 'TokenDelimit("delimiter", ",")' "A , B"
=> "A ", " B" (whitespace still there)
> tokenize 'TokenDelimit("pattern", "\\\\s*,\\\\s*")' "A, B  ,C"
=> "A", "B", "C"

Please note that characters not specified by the delimiter option are not treated as delimiters like as the second example. The pattern option accepts a regular experssion, and it will be useful for input like as the third example containing random whitespaces.

Improvements around normalizers and token filters mainly for better internationalization

The NormalizerNFKC100 normalizer now supports a new option unify_to_romaji to convert both hiragana and katakana to romaji, like:

% groonga
> normalize 'NormalizerNFKC100("unify_to_romaji", true)' "リンゴ みかん"
=> "ringo mikan"

And a new built-in token filter TokenFilterNFKC100 is added. It also can covert katakana to hiragana like NormalizerNFKC100 with the unify_kana option, like:

% groonga
> tokenize TokenMecab "リンゴおいしい" --token_filters TokenFilterNFKC100
=> "リンゴ", "おいしい" ("リンゴ" was normalized)
> tokenize TokenMecab "リンゴおいしい" --token_filters 'TokenFilterNFKC100("unify_kana", true)'
=> "りんご", "おいしい" ("リンゴ" was normalized and converted)

The TokenFilterStem filter now supports a new option algorithm for stemming not only in English but also in other languages: French, Spanish, Portuguese, Italian, Romanian, German, Dutch, Swedish, Norwegian, Danish, Russian, and Finnish. The test for the option describes its usage.

The TokenFilterStopWord filter now supports a new option column to change the name of a column for stop words from is_stop_word to any other. The test for the option describes its usage.

Conclusion

See Release 8.0.9 2018-11-29 about detailed changes since 8.0.8

Let's search by Groonga!

2018-10-29

Groonga 8.0.8 has been released

Groonga 8.0.8 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • New options for the TokenMecab tokenizer.
  • Supported locking of a database during a io_flush.

New options for the TokenMecab tokenizer

TokenMecab now accepts target_class option:

target_class option searches a token of specifying a part-of-speech. This option can also specify subclasses and exclude or add specific part-of-speech of specific using + or -.

  • + adds part-of-speech of a search target.
    • If you specify only + or ``, search taget are all tokens.
  • - excludes part-of-speech from a search target.

For example, you can search all tokens exclude a pronoun as below.

'TokenMecab("target_class", "-名詞/代名詞", "target_class", "+")'

Supported locking of a database during a io_flush

The feature added to fix a bug that the Groonga is a crash when deleted a table of a target of a io_flush during execution of a io_flush. io_flush locks Groonga database while flushing. So, you can’t run the following commands while io_flush

  • column_create
  • column_remove
  • column_rename
  • logical_table_remove
  • object_remove
  • plugin_register
  • plugin_unregister
  • table_create
  • table_remove
  • table_rename

Conclusion

See Release 8.0.8 2018-10-29 about detailed changes since 8.0.7

Let's search by Groonga!