MySQL binding

What is Senna MySQL binding?

MySQL version after 3.23.23 supports FULLTEXT index. With it, MySQL can execute the full-text search for the field of VARCHAR and TEXT type. But, these problems exist. 1) Insufficient Japanese support 2) Slow phrase search 3) Slow update

Senna MySQL binding substitute for MySQL original fulltext index. Those problems are fixed. And, It enables you to use full features of Senna through MySQL.

How to use

You can install MySQL with Senna with either one of below two method.

  1. binary package
  2. patch MySQL source, compile and install.

Functions

  • Supports version 4.0, 4.1, 5.0 and 5.1
  • MATCH AGAINST query support in BOOLEAN MODE and defaulut mode(NLQ MODE)
  • In BOOLEAN MODE, you can use all operators like +, -, <, >, (, ), ~, *, ".
  • Supports Japanese encoding EUC, SJIS
  • Supports Unicode with UTF8
  • Supports normalization. In UTF8, NFKC normalization supported
  • Supports similar document search
  • Supports near words search
  • Supports MyISAM and MERGE(4.0 only) storage engines.
  • Supports snippet(KWIC) function with MySQL user defined functions.
  • Supports Japanese word's index(with MeCab), N-gram(bi-gram) index and space delimited index.
  • 2ind patch enables MySQL to use FULLTEXT index and normal b-tree index bothly at one time.
  • skipmode patch(4.0 only) accelerate queries as select count(*) and select * from table limit x,y.

Install

Please read install for English.

Original enhanced features

Senna extends SQL syntax.

Thanks to Moriyoshi Koizumi at GREE(http://gree.jp) and Kazuhiro Osawa at iYappo(http://i.yappo.jp/).

enhanced CREATE TABLE / CREATE INDEX

CREATE TABLE / CREATE INDEX query is enhanced to be able to pass Senna parameters.

You can specify index type

You can select Japanese word index, NGRAM index or DELIMITED index.

Japanese word index (with MeCab)

If no index type specified, Senna makes Japanese word index. Notice: For English users, this mode is not befitted. use NGRAM index.

ex.)

  CREATE TABLE test (
    id INTEGER AUTO_INCREMENT,
    PRIMARY KEY (id),
    text TEXT NOT NULL,
    FULLTEXT INDEX (text)
  );
NGRAM index(bigram index for non-ascii strings and word index for ascii strings)

Specify USING NGRAM.

ex.)

  CREATE TABLE test (
    id INTEGER AUTO_INCREMENT,
    PRIMARY KEY (id),
    text TEXT NOT NULL,
    FULLTEXT INDEX USING NGRAM (text)
  );
DELIMITED index(space delimited index)

If you specify USING DELIMITED, words delimited by space are indexed.

ex.)

  CREATE TABLE test (
    id INTEGER AUTO_INCREMENT,
    PRIMARY KEY (id),
    text TEXT NOT NULL,
    FULLTEXT INDEX USING DELIMITED (text)
  );
MySQL original fulltext index(without Senna)

Specify USING NO SENNA.

ex.)

  CREATE TABLE test (
    id INTEGER AUTO_INCREMENT,
    PRIMARY KEY (id),
    text TEXT NOT NULL,
    FULLTEXT INDEX USING NO SENNA (text)
  );
normalize or not

At default, alphabet normalization is enabled If you want to disable it, specify USING NO NORMALIZE.

ex.) without normalization with NGRAM index

  CREATE TABLE test (
    id INTEGER AUTO_INCREMENT,
    PRIMARY KEY (id),
    text TEXT NOT NULL,
    FULLTEXT INDEX USING SENNA, NO NORMALIZE, NGRAM (text)
  );
INITIAL_N_SEGMENTS parameter

You can speficy INITIAL_N_SEGMENTS which is an initial value of an index buffer with 'USING number'

ex.) create index with INITIAL_N_SEGMENTS=2048

  CREATE TABLE test (
    id INTEGER AUTO_INCREMENT,
    PRIMARY KEY (id),
    text TEXT NOT NULL,
    FULLTEXT INDEX USING SENNA, 2048 (text)
  );

Extensions of SHOW INDEX

 mysql> SHOW INDEX FROM test;
 +-------+------------+----------+- ... -+--------+------+--------------------------------+---------+
 | Table | Non_unique | Key_name |       | Packed | Null | Index_type                     | Comment |
 +-------+------------+----------+- ... -+--------+------+--------------------------------+---------+
 | test  |          0 | PRIMARY  |       | NULL   |      | BTREE                          |         |
 | test  |          1 | text     |       | NULL   | YES  | FULLTEXT,SENNA,NORMALIZE,NGRAM |         |
 +-------+------------+----------+--... -+--------+------+--------------------------------+---------+ 
 2 rows in set (0.04 sec)

Extensions of SHOW TABLE STATUS

 +-------+--------+ ... -------+----------------+----------------+--------------------+---------+
 | Name  | Type   | ... k_time | Create_options | Senna_key_size | Senna_lexicon_size | Comment | 
 +-------+--------+ ... -------+----------------+----------------+--------------------+---------+
 | test1 | MyISAM | ...        |                |              0 |                  0 |         |
 | test2 | MyISAM | ...        |                |              0 |                  0 |         |
 | test3 | MyISAM | ...        |                |              0 |                  0 |         |
 | test4 | MyISAM | ...        |                |              0 |                  0 |         |
 | test5 | MyISAM | ...        |                |              0 |                  0 |         |
 +-------+--------+ ... -------+----------------+----------------+--------------------+---------+

next column is added.

  • Senna_keys_size

record number of symbol table (.SEN)

  • Senna_keys_file_size

file size of symbol table (.SEN)

  • Senna_lexicon_size

record number of lexicon table (.SEN.I)

  • Senna_lexicon_file_size

file size of lexicon table (.SEN.I)

  • Senna_inv_seg_size

file size of the buffer of lexicon table (.SEN.i)

  • Senna_inv_chunk_size

file size of an inverted index table (.SEN.i.c)

2ind patch(fulltext index and b-tree index bothly used at one time)

What is 2ind patch

In MySQL, only one index is used in executing one query. Normally, that's limit is avoided with multi-column index. But using fulltext index, it is very big problem like below.

1. slow limited query

  select columns from table where match(a) against(b) limit 1000, 10

If limit offset is big, table scan occurs and response is very slow.

2. slow only getting count(*)

  select count(*) from table where match(a) against(b);

Only getting record count, table scan occurs and response is very slow.

3. slow conditional search

  select columns from table where match(a) against(b) and c like 'hoge%';

Although 'column c' is indexed, table scan occurs and response is very slow.

4. slow sort

  select columns from table where match(a) against(b) order by c;

Although 'column c' is indexed, table scan occurs and response is very slow.

2ind patch solves all of the problems.

Install 2ind patch

Please read install for English.

Using 2ind-patch

  • No additional query parameter is required to solve problem 1. 2 with 2ind-patch.
  • Additional query parameter is required to solve problem 3. 4 with 2ind-patch like below.
  select columns from table force index(c) where match(a) against(b) and c like 'hoge%';
  select columns from table force index(c) where match(a) against(b) order by c;

(If you want to use PRIMARY key, specify 'force index(PRIMARY)'.

Notice of 2ind-patch

  • 2ind-patch affects those four problem patterns, but no effect for other patterns.
  • MySQL with 2ind-patch uses it's original B-tree index.So, you have to choon parameters like key_buffer_size, which effects performance of B-tree index.

snippet UDF

You can get snippet(or KWIC, KeyWords In Context) with a snippet UDF function.

Install snippet UDF

> cd {senna-package}/bindings/mysql/udf/

If you get Senna from Subversion repository, you have to execute autogen.sh.

> ./autogen.sh

configure

> ./configure --prefix=/usr

make

> make

install

> sudo make install

Load udf functions in MySQL.

> make load

(In 'make load', 'mysql -u root -p' command is used for regist UDF.So password input is required.)

Using snippet UDF

snippet UDF is a function which have many parameters.

SELECT snippet(text or column, byte length of one snippet, number of snippets, encoding of text, 
  html encoding or not, snippet open tag, snippet close tag,
  word1, word1 open tag, word1 close tag,
  word2, word2 open tag, word2 close tag, ...)

Encoding of text is 'sjis', 'ujis', 'utf8' or 'default'. If you get html encoded text, you have to specify html encoding parameter as 1. If not, 0.

ex.)

SELECT snippet(body, 120, 3, 'utf8', 
  1, '...', '...<br>\n', 
  'search', '<span class="word1">', '</span>', 
  'test', '<span class="word2">', '</span>'
)
FROM contents
WHERE MATCH(body) AGAINST('*D+ search test' IN BOOLEAN MODE)
LIMIT 0,10;

skipmode-patch

skipmode-patch make it's time faster that you get hit count and query which is count limited. It is only available in MySQL 4.0.

Install skipmode patch

Please read install for English.

生成時間: __qwik_page_generate_time__秒