BloGroonga

2018-05-29

Groonga 8.0.3 has been released

Groonga 8.0.3 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • [highlight_html] Support highlight of results of the search by NormalizerNFKC100 or TokenNgram.
  • [normalizers] Added new option for NormalizerNFKC100 that unify_middle_dot option.
  • [normalizers] Added new option for NormalizerNFKC100 that unify_katakana_v_sounds option.
  • [normalizers] Added new option for NormalizerNFKC100 that unify_katakana_bu_sound option.
  • [sub_filter] Supported sub_filter optimization for the too filter case.
  • [delete] Added new options that limit.
  • [normalizers] Fixed a bug that FULLWIDTH LATIN CAPITAL LETTERs such as U+FF21 FULLWIDTH LATIN CAPITAL LETTER A aren't normalized to LATIN SMALL LETTERs such as U+0061 LATIN SMALL LETTER A. If you have been used NormalizerNFKC100 , you must recreate your indexes.

[highlight_html] Support highlight of results of the search by NormalizerNFKC100 or TokenNgram

You can highlight of keyword that searched by NormalizerNFKC100 or TokenNgram as below example.

table_create Entries TABLE_NO_KEY
column_create Entries body COLUMN_SCALAR ShortText
table_create Terms TABLE_PAT_KEY ShortText   --default_tokenizer 'TokenNgram("report_source_location", true)'   --normalizer 'NormalizerNFKC100'
column_create Terms document_index COLUMN_INDEX|WITH_POSITION Entries body
load --table Entries
[
{"body": "ア㌕Az"}
]
[[0,0.0,0.0],1]
select Entries   --match_columns body   --query 'グラム'   --output_columns 'highlight_html(body, Terms)'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        1
      ],
      [
        [
          "highlight_html",
          null
        ]
      ],
      [
        "ア<span class=\"keyword\">㌕</span>Az"
      ]
    ]
  ]
]

[normalizers] Added new option for NormalizerNFKC100 that unify_middle_dot option

This option normalizes middle dot as below example.

normalize   'NormalizerNFKC100("unify_middle_dot", true)'   "·ᐧ•∙⋅⸱・・"   WITH_TYPES
[
  [
    0,
    0.0,
    0.0
  ],
  {
    "normalized": "········",
    "types": [
      "symbol",
      "symbol",
      "symbol",
      "symbol",
      "symbol",
      "symbol",
      "symbol",
      "symbol"
    ],
    "checks": [

    ]
  }
]

You can search with or without (middle dot) and regardless of (middle dot) position by this option.

[normalizers] Added new option for NormalizerNFKC100 that

unify_katakana_v_sounds option

This option normalizes ヴァヴィヴヴェヴォ (katakana) to バビブベボ (katakana) as below example.

normalize   'NormalizerNFKC100("unify_katakana_v_sounds", true)'   "ヴァヴィヴヴェヴォヴ"   WITH_TYPES
[
  [
    0,
    0.0,
    0.0
  ],
  {
    "normalized": "バビブベボブ",
    "types": [
      "katakana",
      "katakana",
      "katakana",
      "katakana",
      "katakana",
      "katakana"
    ],
    "checks": [

    ]
  }
]

For example, you can search バイオリン (violin) in ヴァイオリン (violin).

[normalizers] Added new option for NormalizerNFKC100 that

unify_katakana_bu_sound option

This option normalizes ヴァヴィヴゥヴェヴォ (katakana) to (katakana) as below example.

normalize   'NormalizerNFKC100("unify_katakana_bu_sound", true)'   "ヴァヴィヴヴェヴォヴ"   WITH_TYPES
[
  [
    0,
    0.0,
    0.0
  ],
  {
    "normalized": "ブブブブブブ",
    "types": [
      "katakana",
      "katakana",
      "katakana",
      "katakana",
      "katakana",
      "katakana"
    ],
    "checks": [

    ]
  }
]

For example, you can search セーブル (katakana) and セーヴル (katakana) in セーヴェル (katakana).

[sub_filter] Supported sub_filter optimization for the too filter case

For example,this optimize is valid when records are enough narrowed down before sub_filter execution as below.

table_create Files TABLE_PAT_KEY ShortText
column_create Files revision COLUMN_SCALAR UInt32

table_create Packages TABLE_PAT_KEY ShortText
column_create Packages files COLUMN_VECTOR Files

column_create Files packages_files_index COLUMN_INDEX Packages files

table_create Revisions TABLE_PAT_KEY UInt32
column_create Revisions files_revision COLUMN_INDEX Files revision

load --table Files
[
{"_key": "include/groonga.h", "revision": 100},
{"_key": "src/groonga.c",     "revision": 29},
{"_key": "lib/groonga.rb",    "revision": 12},
{"_key": "README.textile",    "revision": 24},
{"_key": "ha_mroonga.cc",     "revision": 40},
{"_key": "ha_mroonga.hpp",    "revision": 6}
]

load --table Packages
[
{"_key": "groonga", "files": ["include/groonga.h", "src/groonga.c"]},
{"_key": "rroonga", "files": ["lib/groonga.rb", "README.textile"]},
{"_key": "mroonga", "files": ["ha_mroonga.cc", "ha_mroonga.hpp"]}
]

select Packages \
  --filter '_key == "rroonga" && \
            sub_filter(files, "revision >= 10 && revision < 40")' \
  --output_columns '_key, files, files.revision'

[delete] Added new options that limit

You can limit the number of deleting records with this option as below example.

table_create Users TABLE_PAT_KEY ShortText
[[0,0.0,0.0],true]
load --table Users
[
{"_key": "alice"},
{"_key": "bob"},
{"_key": "bill"},
{"_key": "brian"}
]
[[0,0.0,0.0],4]
delete --table Users --filter '_key @^ "b"' --limit 2
[[0,0.0,0.0],true]
#>delete --filter "_key @^ \"b\"" --limit "2" --table "Users"
#:000000000000000 filter(3)
#:000000000000000 delete(2): [0][2]
#<000000000000000 rc=0
select Users
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        2
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ]
      ],
      [
        1,
        "alice"
      ],
      [
        3,
        "bill"
      ]
    ]
  ]
]

Conclusion

See Release 8.0.3 2018-05-29 about detailed changes since 8.0.2

Let's search by Groonga!

2018-04-29

Groonga 8.0.2 has been released

Groonga 8.0.2 has been released!

In this release, you can "define" custom tokenizer and normalizer via options, without any programming. It helps you to search sources including many orthographical variants.

How to install: Install

Changes

Here are important changes in this release:

  • [logical_range_filter] Added sort_keys option.
  • Added a new function time_format(). You can specify time format against a column of Time type, with the format same to strftime.
  • [tokenizers] Support new tokenizer TokenNgram. You can define its behavior dynamically.
  • [normalizers] Support new normalizer NormalizerNFKC100. It is based on Unicode NFKC for Unicode 10.0.
  • [normalizers] Support options for normalizers NormalizerNFKC51 and NormalizerNFKC100. You can change normalizer's behavior dynamically.
  • [dump][schema] Add support for options of tokenizer and normalizer. As the result, Groonga 8.0.1 and earlier versions cannot import dump and schema generated by Groonga 8.0.2 or later, and they will occurs error due to unsupported information.

[logical_range_filter] Added sort_keys option

logical_range_filter now supports a new option sort_keys, corresponding to sort_keys in select.

Note that it works only for single search target shard and doesn't work for multiple search target shards. For more details, see the command reference.

Added a new function time_format()

Now you can specify time format against a column of Time type, with the format same to strftime.

For example, the following command line will output the _key column as both UNIX time and a human readable format like 2018-04-29T10:30:00:

select Timestamps --sortby _id --limit -1 --output_columns '_key, time_format(_key, "%Y-%m-%dT%H:%M:%S")'

[tokenizers] Support new tokenizer TokenNgram

Now a new tokenizer TokenNgram is available. You can define its behavior dynamically via its options. Options are given via the style 'TokenNgram("[name 1]", [value 1], "[name 2]", [value 2], ...). For example:

table_create --name Terms --flags TABLE_PAT_KEY --key_type ShortText --default_tokenizer 'TokenNgram("n", 2, "loose_symbol", true)' --normalizer NormalizerAuto

[normalizers] Support new normalizer NormalizerNFKC100

Now a new normalizer NormalizerNFKC100, based on Unicode NFKC (Normalization Form Compatibility Composition) for Unicode 10.0 is available.

Both it and NormalizerNFKC51 supports options. For more details, see the next section.

[normalizers] Support options for normalizers NormalizerNFKC51 and NormalizerNFKC100

Both normalizers NormalizerNFKC51 and NormalizerNFKC100 now support options to change their behavior dyanmically. Options are given via the style 'NormalizerNFKC100("[name 1]", [value 1], "[name 2]", [value 2], ...). For example:

table_create --name Terms --flags TABLE_PAT_KEY --key_type ShortText --default_tokenizer TokenBigram --normalizer 'NormalizerNFKC100("unify_kana", true, "unify_kana_case", true)'

[dump][schema] Add support for options of tokenizer and normalizer

dump and schema commands now report options for tokenizers (TokenNgram) and normalizers (NormalizerNFKC51 and NormalizerNFKC100.), like:

table_create Site TABLE_HASH_KEY ShortText
column_create Site title COLUMN_SCALAR ShortText

table_create Terms TABLE_PAT_KEY ShortText --default_tokenizer TokenBigram --normalizer "NormalizerNFKC100(\"unify_kana\", true, \"unify_kana_case\", true)"

As the result, Groonga 8.0.1 and earlier versions cannot import results of dump and schema including such options information.

Tokenizers and normalizers without options are still reported same as on the old versions, so you need to be careful only when you use new features of tokenizers or normalizers described above.

Conclusion

See Release 8.0.2 2018-04-29 about detailed changes since 8.0.1

Let's search by Groonga!

2018-03-29

Groonga 8.0.1 has been released

Groonga 8.0.1 has been released!

How to install: Install

Changes

Here are important changes in this release:

  • [log] Show filter conditions in query log.
  • [Windows] Install *.pdb into the directory where *.dll and *.exe are installed .
  • [logical_count] Support filtered stage dynamic columns.
  • [logical_count] Added a new filter timing.
  • [logical_select] Added a new filter timing.
  • [logical_range_filter] Optimize window function for large result set.
  • [select] Added --match_escalation parameter.`
  • [httpd] Updated bundled nginx to 1.13.10.
  • Fixed memory leak that occurs when a prefix query doesn't match any token.
  • Fixed a bug that a cache for different databases is used when multiple databases are opened in the same process.
  • Fixed a bug that a constant value can overflow or underflow in comparison (>,>=,<,<=,==,!=).

[log] Show filter conditions in query log.

As a result, under what conditions you can see how many records were narrowed down. Specifically, below.

2018-02-15 19:04:02.303809|0x7ffd9eedf6f0|:000000013837058 filter(17): product equal "test_product"

In the above example, we can 17 records were narrowed down by product == "test_product". It's disabled by default. To enable it, you need to set an environment variable below.

GRN_QUERY_LOG_SHOW_CONDITION=yes

[logical_count] Support filtered stage dynamic columns.

logical_count is only support initial stage dynamic columns until now. You can use filtered stage dynamic columns in logical_count also from this release.

[logical_count][logical_select] Added a new filter timing.

It's executed after filtered stage generated columns are generated. Specifically, below.

logical_select \
    --logical_table Entries \
    --shard_key created_at \
    --columns[n_likes_sum_per_tag].stage filtered \
    --columns[n_likes_sum_per_tag].type UInt32 \
    --columns[n_likes_sum_per_tag].value 'window_sum(n_likes)' \
    --columns[n_likes_sum_per_tag].window.group_keys 'tag' \
    --filter 'content @ "system" || content @ "use"' \
    --post_filter 'n_likes_sum_per_tag > 10' \
    --output_columns _key,n_likes,n_likes_sum_per_tag

  # [
  #   [
  #     0, 
  #     1519030779.410312,
  #     0.04758048057556152
  #   ], 
  #   [
  #     [
  #       [
  #         2
  #       ], 
  #       [
  #         [
  #           "_key", 
  #           "ShortText"
  #         ], 
  #         [
  #           "n_likes", 
  #           "UInt32"
  #         ], 
  #         [
  #           "n_likes_sum_per_tag", 
  #           "UInt32"
  #         ]
  #       ]
  #       [
  #         "Groonga", 
  #         10, 
  #         25
  #       ], 
  #       [
  #         "Mroonga", 
  #         15, 
  #         25
  #       ]
  #     ]
  #   ]
  # ]

This feature's point that after filtered stage generated columns use in --post_filter. In the above example is logical_select'example, however it's available on the logical_count as well.

[logical_range_filter] Optimize window function for large result set.

If we find enough matched records, we don't apply window function to the remaining windows. Disable this optimization for small result set if its overhead is not negligible.

[select] Added --match_escalation parameter.`

You can force to enable match escalation by --match_escalation yes. It's stronger than --match_escalation_threshold 99999....999 because --match_escalation yes also works with SOME_CONDITIONS && column @ 'query'. --match_escalation_threshold isn't used in this case.

The default is --match_escalation auto. It doesn't change the current behavior.

You can disable match escalation by --match_escalation no. It's the same as --match_escalation_threshold -1.

Fixed memory leak that occurs when a prefix query doesn't match any token.

Fixed a memory leak that occurs when a prefix query doesn't match any token by fuzzy search as below example.

table_create Users TABLE_NO_KEY
[[0,0.0,0.0],true]
column_create Users name COLUMN_SCALAR ShortText
[[0,0.0,0.0],true]
table_create Names TABLE_PAT_KEY ShortText
[[0,0.0,0.0],true]
column_create Names user COLUMN_INDEX Users name
[[0,0.0,0.0],true]
load --table Users
[
{"name": "Tom"},
{"name": "Tomy"},
{"name": "Pom"},
{"name": "Tom"}
]
[[0,0.0,0.0],4]
select Users --filter 'fuzzy_search(name, "Atom", {"prefix_length": 1})'   --output_columns 'name, _score'   --match_escalation_threshold -1
[[0,0.0,0.0],[[[0],[["name","ShortText"],["_score","Int32"]]]]]

Fixed a bug that a cache for different databases is used when multiple databases are opened in the same process.

Fixed a bug that when multiple databases are opened in the same process, results are returned from the cache of another database because the cache was shared within the process.

Fixed a bug that a constant value can overflow or underflow in comparison (>,>=,<,<=,==,!=).

Fixed a bug that a constant value can overflow or underflow in comparison as below example.

table_create Values TABLE_NO_KEY
[[0,0.0,0.0],true]
column_create Values number COLUMN_SCALAR Int16
[[0,0.0,0.0],true]
load --table Values
[
{"number": 3},
{"number": 4},
{"number": -1}
]
[[0,0.0,0.0],3]
select Values   --filter 'number > 32768'   --output_columns 'number'
[[0,1522305525.361629,0.0003235340118408203],[[[3],[["number","Int16"]],[3],[4],[-1]]]]

An overflow occurs, because of 32768 is over the range of Int16 (-32,768 to 32,767). As this time number> 32768 was evaluated as number> - 32768. In this release, when overflow or underflow occurs as above, not to return any results.

Conclusion

See Release 8.0.1 2018-03-29 about detailed changes since 8.0.0

Let's search by Groonga!

2018-02-09

Groonga 8.0.0 has been released

Groonga 8.0.0 has been released!

This is a major version up! But It keeps backward compatibility. You can upgrade to 8.0.0 without rebuilding database.

How to install: Install

Changes

Here are important changes in this release:

  • select added --drilldown_adjuster and --drilldowns[label].adjuster.

  • between Accept between() without borders.

  • Fixed a memory leak for normal hash table.

select added --drilldown_adjuster and --drilldowns[label].adjuster.

Added --drilldown_adjuster and --drilldowns[LABEL].adjuster in select arguments. You can adjust score against result of drilldown.

Specifically, below.

table_create Categories TABLE_PAT_KEY ShortText

table_create Tags TABLE_PAT_KEY ShortText
column_create Tags categories COLUMN_VECTOR|WITH_WEIGHT Categories

table_create Memos TABLE_HASH_KEY ShortText
column_create Memos tags COLUMN_VECTOR Tags

column_create Categories tags_categories COLUMN_INDEX|WITH_WEIGHT \
  Tags categories

load --table Tags
[
{"_key": "groonga", "categories": {"full-text-search": 100}},
{"_key": "mroonga", "categories": {"mysql": 100, "full-text-search": 80}},
{"_key": "ruby", "categories": {"language": 100}}
]

load --table Memos
[
{
  "_key": "Groonga is fast",
  "tags": ["groonga"]
},
{
  "_key": "Mroonga is also fast",
  "tags": ["mroonga", "groonga"]
},
{
  "_key": "Ruby is an object oriented script language",
  "tags": ["ruby"]
}
]

select Memos \
  --limit 0 \
  --output_columns _id \
  --drilldown tags \
  --drilldown_adjuster 'categories @ "full-text-search" * 2 + categories @ "mysql"' \
  --drilldown_output_columns _key,_nsubrecs,_score
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
          "_id",
          "UInt32"
        ]
      ]
    ],
    [
      [
        3
      ],
      [
        [
          "_key",
          "ShortText"
        ],
        [
          "_nsubrecs",
          "Int32"
        ],
        [
          "_score",
          "Int32"
        ]
      ],
      [
        "groonga",
        2,
        203
      ],
      [
        "mroonga",
        1,
        265
      ],
      [
        "ruby",
        1,
        0
      ]
    ]
  ]
]

In the above example, we adjust the score of records that have full-text-search or mysql in categories .

between Accept between() without borders.

From this release, max_border and min_border are now optional. If the number of arguments passed to between() is 3, the 2nd and 3rd arguments are handled as the inclusive edges.

Specifically, below.

table_create Users TABLE_HASH_KEY ShortText
column_create Users age COLUMN_SCALAR Int32

table_create Ages TABLE_PAT_KEY Int32
column_create Ages users_age COLUMN_INDEX Users age

load --table Users
[
{"_key": "alice",  "age": 17},
{"_key": "bob",    "age": 18},
{"_key": "calros", "age": 19},
{"_key": "dave",   "age": 20},
{"_key": "eric",   "age": 21}
]

select Users --filter 'between(age, 18, 20)'
[
  [
    0,
    0.0,
    0.0
  ],
  [
    [
      [
        3
      ],
      [
        [
          "_id",
          "UInt32"
        ],
        [
          "_key",
          "ShortText"
        ],
        [
          "age",
          "Int32"
        ]
      ],
      [
        2,
        "bob",
        18
      ],
      [
        3,
        "calros",
        19
      ],
      [
        4,
        "dave",
        20
      ]
    ]
  ]
]

Fixed a memory leak for normal hash table.

Fixed a bug that you sometimes can not connect to groonga just by continuing to send queries.

Conclusion

See Release 8.0.0 2018-02-09 about detailed changes since 7.1.1

Let's search by Groonga!