BloGroonga

2013-06-29

Groonga 3.0.5 has been released

Groonga 3.0.5 has been released!

How to install: Install

There are two topics for this release.

  • Supported single quoted string literal in output_columns
  • Supported html_untag function experimentally

Supported single quoted string literal in output_columns

In this release, we began to support single quoted string literal in output_columns.

Since groonga 3.0.2 release, complex string concatination in --output_columns had been supported. This feature support following expression:

'"<" + title + ">"'

Note that 'title' means 'title' column in this case. Above query returns "<(CONTENT OF TITLE)>".

But there is the fact that single quote isn't supported in string literal at that time.

Here is the sample schema:

table_create Entries TABLE_NO_KEY
column_create Entries title COLUMN_SCALAR ShortText

load --table Entries
[
 {"title": "Single quote and double quote"}
]

In the previous release, there are some way to get "<(CONTENT OF TITLE)>".

  • select Entries --output_columns '_id, "<" + title + ">"' --command_version 2
  • select Entries --output_columns "_id, "<" + title + ">"" --command_version 2

Here is the revised query using single quote in string literal for groonga 3.0.5:

select Entries --output_columns "_id, '<' + title + '>'" --command_version 2

As single quote has been supported, groonga 3.0.5 returns intended result sets even though the query which groonga 3.0.4 returns empty result.

Here is the sample queries which groonga 3.0.4 or earlier version returns empty set:

# <"(contents of title column)">
select Entries --output_columns "_id, '<"' + title + '">'" --command_version 2
#=> [1,"<"Single quote and double quote">"]

# <'(contents of title column)'>
select Entries --output_columns "_id, '<'' + title + ''>'" --command_version 2
#=> [1,"<'Single quote and double quote'>"]

Supported html_untag function experimentally

In this release, we began to support html_untag function which strips HTML tags experimentally.

For example, consider the case that scraped web site HTML is stored into groonga database.

Here is the sample schema which stores scraped HTML:

table_create WebClips TABLE_NO_KEY
column_create WebClips url COLUMN_SCALAR ShortText
column_create WebClips content COLUMN_SCALAR ShortText
column_create WebClips tag COLUMN_VECTOR ShortText

Here is the sample data:

load --table WebClips
[
{"url": "http://groonga.org", "tag": ["groonga"], "content": "groonga is fast"},
{"url": "http://mroonga.org", "tag": ["mroonga"], "content": "mroonga is fast"},
]

Specify column name as an argument of html_untag function. According to above sample schema, if you want to get plain text of content column, use html_untag(content).

Here is the sample query which returns plain text of content column:

select WebClips --output_columns "html_untag(content)" --command_version 2

Here is the execution result of above query:

[[2],
  [
    ["html_untag", "null"]
  ],
  ["groonga is fast"],
  ["mroonga is fast"]
]

You can see that span tag with a class attribute is eliminated.

Note that you need to specify with --command_version 2 if you use html_untag function. Without this, you can't get intended search results.

There is a reason why html_untag is supported. It is a demand that we want to search scraped HTML contents which is stored into groonga database, then extract highlighted search results which does not contain extra noisy HTML tags.

It is assumed to use with the snippet_html function (it isn't supported yet).

Here is the concrete processing flow:

original HTML -(html_untag)-> plain text -(snippet_html)-> highlighted HTML

It isn't supported combination usage of html_untag and snippet_html yet, but it will be supported in the future release.

Conclusion

See Release 3.0.5 2013/06/29 about detailed changes since 3.0.4.

Let's search by groonga!