BloGroonga

2012-11-29

Groonga 2.0.9 has been released

Groonga 2.0.9 has been released!

How to install: Install

There are four topics for this release.

  • Supported snippet_html() function
  • Supported nested index search among related table by column index
  • Supported range search by using index
  • Supported calculation across meridian, equator, the date line by geo_distance() function

Supported snippet_html() function

This release began to support snippet_html() function which extract keyword and surrounding text. Note that this is experimentally supported API, so this API would be changed in the future.

Use snippet_html() fuction following syntax:

  snippet_html(column name)

Here is the more concrete example.

Schema definition:

  table_create Documents TABLE_NO_KEY
  column_create Documents content COLUMN_SCALAR Text
  table_create Terms TABLE_PAT_KEY|KEY_NORMALIZE ShortText --default_tokenizer TokenBigram
  column_create Terms documents_content_index COLUMN_INDEX|WITH_POSITION Documents content

Sample data:

  load --table Documents
  [
  ["content"],
  ["Groonga is a fast and accurate full text search engine based on inverted index."],
  ["Groonga is also a column-oriented database management system (DBMS)."],
  ["Mroonga was called groonga storage engine."]
  ]

If you want to search 'groonga' and extract 'groonga' and surrounding text from Documents table, try following:

Here is the query to search 'groonga' with snippet_html function.

  select Documents --output_columns "snippet_html(content)" --command_version 2 --match_columns content --query "groonga" 
  [
    [0,1353893385.5454,0.000486850738525391],
    [
      [
        [3],
        [["snippet_html","null"]],
        [["Groonga is a fast and accurate full text search engine based on inverted index."]],
        [["Groonga is also a column-oriented database management system (DBMS)."]],
        [["Mroonga was called groonga storage engine."]]
      ]
    ]
  ]

As a result, specified keyword is surrounded by <span> tag, and keyword 'groonga' and surrounding text is extracted like a highlighted search results.

Note that you need to specify '--command_version 2' in the query. The reason why function call in --output_column has supported from version 2.0.9.

See following documentation about snippet_html details.

This release began to support nested index search among related table by column index.

If there are relationships among multiple table with column index, you can search multiple table by specifing column index name.

Here is the concrete example.

there are tables which store blog articles, comments for articles. The table which stores articles has columns for article and comment, and the comment column refers comments table. The table which stores comments has columns for comment and column index to article table.

In the previous release of groonga, if you want to search the articles which contain specified keyword in comment, you need to execute fulltext search for table of comment, then search the records which contains fulltext search results.

Now, you can search the records by specifing the refererence column index at once.

here is the sample how to use this feature.

Schema definition:

  table_create Comments TABLE_HASH_KEY UInt32
  column_create Comments content COLUMN_SCALAR ShortText

  table_create Articles TABLE_NO_KEY
  column_create Articles content COLUMN_SCALAR Text
  column_create Articles comment COLUMN_SCALAR Comments

  table_create Lexicon TABLE_PAT_KEY|KEY_NORMALIZE ShortText --default_tokenizer TokenBigram
  column_create Lexicon articles_content COLUMN_INDEX|WITH_POSITION Articles content
  column_create Lexicon comments_content COLUMN_INDEX|WITH_POSITION Comments content

  column_create Comments article COLUMN_INDEX Articles comment

Sample data:

  load --table Comments
  [
  {"_key": 1, "content": "I'm using groonga too!"},
  {"_key": 2, "content": "I'm using groonga and mroonga!"},
  {"_key": 3, "content": "I'm using mroonga too!"}
  ]

  load --table Articles
  [
  {"content": "Groonga is fast!", "comment": 1},
  {"content": "Groonga is useful!"},
  {"content": "Mroonga is fast!", "comment": 3}
  ]

You can write the query that search the records which contains specified keyword as a comment, then fetch the articles which refers to it.

  select Articles --match_columns comment.content --query groonga --output_columns "_id, _score, *" 

You need to concatinate comment column of articles table and content column of comments table with period(.) as --match_columns arguments.

At first, this query execute fulltext search from content of comments table, then fetch the records of articles table which refers to already searched records of comments table. (Because of this, if you comment out the query which create column index 'article' of comments table, you can't get intended search results.)

  [
    [0,1353903149.81632,0.000459432601928711],
    [
      [
       [1],
       [["_id","UInt32"],["_score","Int32"],["comment","Comments"],["content","Text"]],
       [1,1,1,"Groonga is fast!"]
      ]
    ]
  ]

Now, you can search articles which contains specific keywords as a comment.

Supported range search by using index

This release began to support range search by using index. As a result, you can search in a short time by contrast to previous release.

Here is the sample how to use this feature.

Schema definition:

  table_create Shops TABLE_HASH_KEY ShortText
  column_create Shops ranking COLUMN_SCALAR UInt32

  table_create Rankings TABLE_PAT_KEY UInt32
  column_create Rankings shops_ranking COLUMN_INDEX Shops ranking

Sample data (ranking data about 10,000,000 shops):

  load --table Shops
  [
  {"_key": "Shop1", "ranking": 1},
  {"_key": "Shop2", "ranking": 2},
  {"_key": "Shop3", "ranking": 3},
  {"_key": "Shop4", "ranking": 4},
  {"_key": "Shop5", "ranking": 5},
  {"_key": "Shop6", "ranking": 6},
  {"_key": "Shop7", "ranking": 7},
  {"_key": "Shop8", "ranking": 8},
  {"_key": "Shop9", "ranking": 9},
  {"_key": "Shop10", "ranking": 10},
  {"_key": "Shop11", "ranking": 11},
  ...
  ]

Now, registered shop name as a key, the value of ranking.

Here is the sample query to search top 10 shops of ranking.

In range search, you can specify 'Top 10' expression as 'ranking <= 10' in this case.

Here is the search results by groonga 2.0.8.

  select Shops --filter 'ranking <= 10'
  [
    [0,1355465886.15137,1.39784264564514],
    [
      [
        [10],
        [
          ["_id","UInt32"],["_key","ShortText"],["ranking","UInt32"]
        ],
        [1,"Shop1",1],
        [2,"Shop2",2],
        [3,"Shop3",3],
        [4,"Shop4",4],
        [5,"Shop5",5],
        [6,"Shop6",6],
        [7,"Shop7",7],
        [8,"Shop8",8],
        [9,"Shop9",9],
        [10,"Shop10",10]
      ]
    ]
  ]

Here is the search results by groonga 2.0.9.

  select Shops --filter 'ranking <= 10'
  [
    [0,1355465837.0779,0.00165677070617676],
    [
      [
        [10],
        [
          ["_id","UInt32"],["_key","ShortText"],["ranking","UInt32"]
        ],
        [1,"Shop1",1],
        [2,"Shop2",2],
        [3,"Shop3",3],
        [4,"Shop4",4],
        [5,"Shop5",5],
        [6,"Shop6",6],
        [7,"Shop7",7],
        [8,"Shop8",8],
        [9,"Shop9",9],
        [10,"Shop10",10]
      ]
    ]
  ]

The search result is same, but the execution time is different.

    [0,1355465886.15137,1.39784264564514],

In groonga 2.0.8, it takes 1.39784264564514 seconds.

    [0,1355465837.0779,0.00165677070617676],

In groonga 2.0.9, it takes 0.00165677070617676 seconds.

See Output Format about the output of groonga command details.

Version of groonga groonga 2.0.8 groonga 2.0.9
Execution time(seconds) 1.39784264564514 0.00165677070617676

By upgrading 2.0.8 to 2.0.9, you can see the execution time is clipped to about a few milliseconds.

Here is the measurement environment:

CPU Intel® Core i7-2640M CPU @ 2.80GHz
Memory 8GB

Supported calculation across meridian, equator, the date line by geo_distance() function

This release began to support calculation of the value of distance across meridian, equator, the date line by geo_distance() function.

This functional enhancement is applied to the case which the way to approximate is 'rectangle'.

There are some calculation method how to approximate the value of distance.

Groonga supports folowing three method which has trade-offs in point of view of speed, acculacy.

  • Rectangle This regards geographical feature between specified points as level surface. You can calculate the value of distance fast, but the error of distance increases as it approaches the pole.
  • Sphere This regards geographical feature between specified points as spherical surface. It is slower than rectangle, but the error of distance becomes smaller than rectangle.
  • Ellipsoid This regards geographical feature between specified points as ellipsoid. It is slower than sphere, but the error of distance becomes smaller than sphere.

Here is the sample how to caluculate the value of distance across meridian.

This sample shows the value of distance between Paris(France) to Madrid(Spain). The geographical feature is approximated as level surface (rectangle).

"175904000x8464000" means Paris(France) expressed in milliseconds. "145508000x-13291000" means Madrid(Spain) expressed in milliseconds.

  select Geo --output_columns distance --scorer 'distance = geo_distance("175904000x8464000", "145508000x-13291000", "rectangle")'
  [
    [
     0,
     1337566253.89858,
     0.000355720520019531
   ],
   [
     [
       [
         1
       ],
       [
         [
           "distance",
           "Int32" 
         ]
       ],
       [
         1051293
       ]
     ]
   ]
  ]

See following documentation how to express longitude and latitude in milliseconds

See following documentation how to use geo_distance

Conclusion

See Release 2.0.9 2012/11/29 about detailed changes since 2.0.8.

Let's search by groonga!