BloGroonga

2019-02-09

Groonga 9.0.0 has been released

Groonga 9.0.0 has been released!

This is a major version up! But It keeps backward compatibility. You can upgrade to 9.0.0 without rebuilding database.

How to install: Install

Changes

Here are important changes in this release:

Tokenizers Added a new tokenizer TokenPattern.

You can extract tokens by regular expression as below. This tokenizer extracts only token that matches the regular expression.

You can also specify multiple patterns of regular expression.

tokenize 'TokenPattern("pattern", "\\\\$[0-9]", "pattern", "apples|oranges")' "I bought apples for $3 and oranges for $4."
[
  [
    0,
    1549612606.784344,
    0.0003230571746826172
  ],
  [
    {
      "value": "apples",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$3",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "oranges",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$4",
      "position": 3,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

Tokenizers Added a new tokenizer TokenTable.

You can extract tokens by a key of existing a table as below.

table_create Keywords TABLE_PAT_KEY ShortText --normalizer NormalizerNFKC100
load --table Keywords
[
{"_key": "$4"},
{"_key": "apples"},
{"_key": "$3"}
]
tokenize 'TokenTable("table", "Keywords")' "I bought apples for $4 at $3."
[
  [
    0,
    1549613095.146393,
    0.0003008842468261719
  ],
  [
    {
      "value": "apples",
      "position": 0,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$4",
      "position": 1,
      "force_prefix": false,
      "force_prefix_search": false
    },
    {
      "value": "$3",
      "position": 2,
      "force_prefix": false,
      "force_prefix_search": false
    }
  ]
]

select Supported similer search against index column.

If you have used multi column index, you can similar search against all source columns by this feature.

table_create Documents TABLE_HASH_KEY ShortText
column_create Documents content1 COLUMN_SCALAR Text
column_create Documents content2 COLUMN_SCALAR Text
table_create Terms TABLE_PAT_KEY|KEY_NORMALIZE ShortText --default_tokenizer TokenBigram
column_create Terms document_index COLUMN_INDEX|WITH_POSITION|WITH_SECTION Documents content1,content2
load --table Documents
[
["_key", "content1"],
["Groonga overview", "Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results."],
["Full text search and Instant update", "In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure."],
["Column store and aggregate query", "People can collect more than enough data in the Internet era."]
]
load --table Documents
[
["_key", "content2"],
["Inverted index and tokenizer", "An inverted index is a traditional data structure used for large-scale full text search."],
["Sharable storage and read lock-free", "Multi-core processors are mainstream today and the number of cores per processor is increasing."],
["Geo-location (latitude and longitude) search", "Location services are getting more convenient because of mobile devices with GPS."],
["Groonga library", "The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database."],
["Groonga server", "Groonga provides a built-in server command which supports HTTP, the memcached binary protocol and the Groonga Query Transfer Protocol (GQTP)."],
["Mroonga storage engine", "Groonga works not only as an independent column-oriented DBMS but also as storage engines of well-known DBMSs."]
]
select Documents --filter 'Terms.document_index *S "Full text seach by MySQL"' --output_columns '_key, _score, content1, content2'
[
  [
    0,
    1549615598.381915,
    0.0007889270782470703
  ],
  [
    [
      [
        4
      ],
      [
        [
          "_key",
          "ShortText"
        ],
        [
          "_score",
          "Int32"
        ],
        [
          "content1",
          "Text"
        ],
        [
          "content2",
          "Text"
        ]
      ],
      [
        "Groonga overview",
        87382,
        "Groonga is a fast and accurate full text search engine based on inverted index. One of the characteristics of Groonga is that a newly registered document instantly appears in search results.",
        ""
      ],
      [
        "Full text search and Instant update",
        87382,
        "In widely used DBMSs, updates are immediately processed, for example, a newly registered record appears in the result of the next query. In contrast, some full text search engines do not support instant updates, because it is difficult to dynamically update inverted indexes, the underlying data structure.",
        ""
      ],
      [
        "Inverted index and tokenizer",
        87382,
        "",
        "An inverted index is a traditional data structure used for large-scale full text search."
      ],
      [
        "Groonga library",
        87382,
        "",
        "The basic functions of Groonga are provided in a C library and any application can use Groonga as a full text search engine or a column-oriented database."
      ]
    ]
  ]
]

Normalizers Added new option remove_blank for NormalizerNFKC100.

This option remove white spaces as below.

normalize 'NormalizerNFKC100("remove_blank", true)' "This is a pen."
[
  [
    0,
    1549528178.608151,
    0.0002171993255615234
  ],
  {
    "normalized": "thisisapen.",
    "types": [
    ],
    "checks": [
    ]
  }
]

groonga executable file Improve display of thread id in log.

Because It was easy to confuse thread id and process id on Windows version, it made clear which is a thread id or a process id.

  • (Before): |2436|1032:
    • 2436 is a process id. 1032 is a thread id.
  • (After): |2436|00001032:
    • 2436 is a process id, 00001032 is a thread id.

Conclusion

See Release 9.0.0 2019-02-09 about detailed changes since 8.1.1

Let's search by Groonga!