网站sphinx全文搜索

研究了差不多两个多星期,有点眉目了,已经在测试环境上把网站改造成sphinx了。。。现在搜索速度确实挺快的。。。

sphinx这个东西在张宴的博客上介绍了好几年了http://blog.s135.com/post/360/

一. 文档

http://sphinxsearch.com/docs/

我用的是最新的sphinx2.0.1-beta版。虽然为beta版,以下为官网介绍

2.0.1-beta (Apr 2011)

Generally recommended release.

Our latest and greatest stable beta release, with real-time indexes,
string attributes, optimized index format, and many other features.

What is a stable beta? It’s a release in which, to the best of our knowledge,

most features are production-quality stable;most features come with additional improvements or fixes;newly added features did not have any known major bugs at the time of release;newly added features might be incomplete and/or less tested.

Core indexing and searching functionality does, of course, fall into
the “existing features” category and should be rock solid at all times.
Examples of potentially unstable new features that we’re mentioning here
would be newly added search operators, SphinxQL syntax clauses, indexing
time settings, advanced optimizations, etc.

在许多功能上已经是比较可靠了。但在新增加的功能上还是beta。。。比如realtime索引

二. 安装

wget http://sphinxsearch.com/files/sphinx-2.0.1-beta.tar.gz

tar zxvf sphinx-2.0.1-beta.tar.gz

cd sphinx-2.0.1-beta

./configure –prefix=/usr/local/sphinx2.0.1

make

make install

三. 配置

sphinx可以使用许多种数据源,SQL databases, plain text files, HTML files, mailboxes,
and so on

被全文索引的内容被称为属性,plain index支持的属性有

sql_attr_uint

sql_attr_bigint

sql_attr_timestamp

sql_attr_string

sql_attr_bool等

realtime index由于不完善支持的属性比较少
rt_field

rt_attr_uint

rt_attr_bigint

rt_attr_float

rt_attr_timestamp
rt_attr_string

配置文件中需要配置以下几个部分:

1. 数据源

2. 索引

3. searchd配置

source src1
{
        type                    = mysql

        sql_host                = localhost
        sql_user                = test
        sql_pass                =
        sql_db                  = test
        sql_port                = 3306  # optional, default is 3306

        sql_query               =
                SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content
                FROM documents

        sql_attr_uint           = group_id
        sql_attr_timestamp      = date_added

        sql_query_info          = SELECT * FROM documents WHERE id=$id
}

index test1
{
        source                  = src1
        path                    = /usr/local/sphinx2.0.1/var/data/test1
        docinfo                 = extern
        charset_type            = sbcs
}


index testrt
{
        type                    = rt
        rt_mem_limit            = 32M

        path                    = /usr/local/sphinx2.0.1/var/data/testrt
        charset_type            = utf-8

        rt_field                = title
        rt_field                = content
        rt_attr_uint            = gid
}

indexer
{
        mem_limit               = 32M
}


searchd
{
        listen                  = 9312
        listen                  = 9306:mysql41
        log                     = /usr/local/sphinx2.0.1/var/log/searchd.log
        query_log               = /usr/local/sphinx2.0.1/var/log/query.log
        read_timeout            = 5
        max_children            = 30
        pid_file                = /usr/local/sphinx2.0.1/var/log/searchd.pid
        max_matches             = 1000
        seamless_rotate         = 1
        preopen_indexes         = 1
        unlink_old              = 1
        workers                 = threads # for RT to work
}indexer
{
        mem_limit               = 32M
}


searchd
{
        listen                  = 9312
        listen                  = 9306:mysql41
        log                     = /usr/local/sphinx2.0.1/var/log/searchd.log
        query_log               = /usr/local/sphinx2.0.1/var/log/query.log
        read_timeout            = 5
        max_children            = 30
        pid_file                = /usr/local/sphinx2.0.1/var/log/searchd.pid
        max_matches             = 1000
        seamless_rotate         = 1
        preopen_indexes         = 1
        unlink_old              = 1
        workers                 = threads # for RT to work
}

四. 建索引

建立索引      /usr/local/sphinx2.0.1/bin/indexer –all –rotate

开启守护进程    /usr/local/sphinx2.0.1/bin/searchd

测试搜索           /usr/local/sphinx2.0.1/bin/search keyword

五. 增量索引

在数据库中建一张记录max id的表

create table sph_counter

(

counter_id INTEGER PRIMARY KEY NOT NULL,

max_doc_id INTEGER NOT NULL 

)

建两个索引,一个为主索引,一个为增量索引,max_doc_id中记录主索引的最大id,在创建数据源的时候需要做类似如下的设置

# in sphinx.conf

source main
{

# …
sql_query_pre = SET NAMES utf8

sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
sql_query = SELECT id, title, body FROM documents
WHERE id<=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

source delta : main
{

sql_query_pre = SET NAMES utf8

sql_query = SELECT id, title, body FROM documents
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

index main
{

source = main

path = /path/to/main

# … all the other settings

}

# in sphinx.conf
source main
{
    # ...
    sql_query_pre = SET NAMES utf8
    sql_query_pre = REPLACE INTO sph_counter SELECT 1, MAX(id) FROM documents
    sql_query = SELECT id, title, body FROM documents 
        WHERE id&lt;=( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

source delta : main
{
    sql_query_pre = SET NAMES utf8
    sql_query = SELECT id, title, body FROM documents 
        WHERE id&gt;( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 )
}

index main
{
    source = main
    path = /path/to/main
    # ... all the other settings
}

# note how all other settings are copied from main,
# but source and path are overridden (they MUST be)
index delta : main
{
    source = delta
    path = /path/to/delta
}

# note how all other settings are copied from main,
# but source and path are overridden (they MUST be)

index delta : main
{

source = delta

path = /path/to/delta

}

写两个脚本,一个会增量索引的脚本

/usr/local/sphinx2.0.1/bin/indexer delta –rotate

每两分钟执行一次,进行增量索引

一个为每天定时合并增量索引的脚本

/usr/local/sphinx2.0.1/bin/indexer –all –rotate

注意:在运行之前把max_doc_id设置成当前最大的id

 六. sphinxql

sphinx2.0.1使用了mysql协议来进行查询的sphinxql功能。默认端口为9306

mysql -h127.0.0.1 -P 9306

show tables;

mysql> show tables;
+————+——-+
| Index      | Type  |
+————+——-+
| delta      | local |
| gamesearch | local |
+————+——-+
2 rows in set (0.00 sec)

mysql> desc gamesearch;
+————-+———–+
| Field       | Type      |
+————-+———–+
| id          | integer   |
| name        | field     |
| gametype    | field     |
| keyword     | field     |
| id_attr     | uint      |
| hit_in_attr | uint      |
| size_attr   | uint      |
| date_added  | timestamp |
| gametype    | string    |
| type        | uint      |
+————-+———–+
mysql> select * from gamesearch where match(‘植物’);
+——-+——–+———+————-+———–+————+————–+——+
| id    | weight | id_attr | hit_in_attr | size_attr | date_added | gametype     | type |
+——-+——–+———+————-+———–+————+————–+——+
| 21966 |   2761 |   21966 |          85 |   2218107 | 1286856337 | 小游戏       |    2 |
| 25478 |   2761 |   25478 |          60 |   1375562 | 1286857298 | 小游戏       |    2 |
| 30519 |   2761 |   30519 |      811725 |  77072592 | 1288255393 | 单机游戏     |    6 |
| 30520 |   2761 |   30520 |      169057 |  35533104 | 1288256405 | 单机游戏     |    5 |
| 30779 |   2761 |   30779 |         445 |         0 | 1290756043 | 网页游戏     |    2 |
+——-+——–+———+————-+———–+————+————–+——+
5 rows in set (0.00 sec)

不过使用plain类型的索引不能使用insert,delete等语句。。只能进行查询

使用realtime index可以使用insert,delete等语句。不过官方文档提到realtime支持的功能没有plain index多。。这次改造中放弃使用realtime index。等稳定版本出来了再说吧。

再说一下realtime index的初始化问题,由于realtime index没有数据源,所以对数据只能进行初始化。可以使用mysqldump出来需要的字段。然后mysql -h127.0.0.1 -P 9306 < dump.sql进行导入,导入过程中需要对应相应的字段,否则会提示错误。