Introduction to sphinx with PHP – Part 1

I was using MySQL Full text indexing in one of my projects, but I noticed that after data size increased beyond several gigabytes, MySQL won’t scale well in using the index and the queries will get too slow to be used in production environments especially for high traffic websites, so I have read about sphinx search engine, and find it quite powerful for textual searching functionality.

What is Sphinx?

as sphinx official site defines:

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

 

Some sphinx advantages

Sphinx has many features that make it excellent option for textual search of large data sizes, some of these advantages (specifically compared to MySQL):

  1. Scalable over large data sizes (both horizontally and vertically, using features like distributed indexes)
  2. Advanced ranking algorithm, calculating relevancy of data based on analyzing the keywords (ability to set customized weight -relevancy importance- for each field)
  3. Index data from multiple sources, including different database types (or different storage engines)
  4. Other important enhancements, like: parallel results,batch queries…etc

Install Sphinx

Sphinx is available on many platforms (including Linux, MacOS, Windows), and its relatively easy to install.
I will cover here compiling Sphinx from source code, (since I have Fedora and I didn’t find a ready made RPM package in my OS).

1- Go to sphinx download page:
http://sphinxsearch.com/downloads/release/

and download the package suitable to your OS, as I mentioned earlier I will go for the first option, which is downloading the source files on sphinx.

If you find a package for your OS, install it and you can skip the remaining installation steps and jump to the next section.

2- Compile Sphinx from source code:
After downloading the source files, extract the source files and run:

./configure

then
make install

In case you want to let sphinx work with mysql, you should have mysql development package installed to your system, you should run:
yum install mysql-devel

By doing these steps successfully, Sphinx should be installed to your machine successfully. The next step is to configure Sphinx sources and indexes to read from MySQL.

Sphinx Configuration

After having installed sphinx, we have to configure data source and indexes for sphinx.

First, let me introduce a sample MySQL table which I will run sphinx indexer to work with. Its a table called “content” which stores news portal articles, containing standard news fields (like title, body, author…etc):

CREATE TABLE `content` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `title` varchar(500) DEFAULT NULL,
  `author` varchar(255) DEFAULT NULL,
  `category_id` int(11) NOT NULL,  
  `status` int(11) NOT NULL, -- used to indicate if content is published, pending...etc  
  `body` longtext,
  `keywords` varchar(255) DEFAULT NULL,
  `is_deleted` int(11) DEFAULT NULL,
  `slug` varchar(255) DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,   
   FULLTEXT KEY `INDEX_TEXT` (`title`,`author`),
   FULLTEXT KEY `title` (`title`),
   FULLTEXT KEY `body` (`body`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

This table was MyISAM table in order to benefit from MySQL Full-Text index prior to use of Sphinx. We will use these indexes to compare sphinx and mySQL performance later.

Now let’s see sphinx configuration file,
Sphinx configuration file, is named sphinx.conf, usually you will find it in /usr/local/etc (in case you compiled sphinx from sources), or in /etc/sphinx (in case you installed a RPM package). you will find also at the same directory, sphinx.conf.dist which have standard configuration file that have all the configuration options.

In the following, the contents of the sphinx.conf file I used with my database

source content
{
        #data source type
        type                    = mysql
        sql_host                = localhost
        sql_user                = root
        sql_pass                = root
        sql_db                  = your_db
        sql_port                = 3306

        #main document fetch query
        sql_query_pre           =   SET NAMES utf8
        sql_query               = \
                                SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                from content \
                                where is_deleted=0 and status=1 \
                                ;
        #attribute declaration        
        sql_attr_uint           = status
        sql_attr_uint           = category_id
        sql_attr_timestamp      = created_time

}

#defining delta source
source delta: content {
       sql_query                = \
                                SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                from content \
                                where is_deleted=0 and status=1 and created_at >= CURRENT_TIMESTAMP() - INTERVAL 15 MINUTE\
                                ;
}

#defining indexes
index content_index
{
       source           = content
       path             = /home/zaid/sphinx/data/mdn_content
       docinfo          = extern
}
index delta: content_index
{
      source            = delta
      path              = /home/zaid/sphinx/data/mdn_content_delta
      docinfo           = extern
}

#searchd daemon configurations
searchd
{
     listen   = localhost:3321
     listen   = 9312
     log      = /var/log/searchd.log
     pid_file = /var/log/searchd.pid
}

Here is the explanation of the configuration file parts:

1- Main data source block (mandatory):

source content
{
....
}

this block defines the data source, which sphinx will index its contents, in this block you define mainly three groups of information:

  •     data source connection parameters:   containing information to connect your database, including: database type, username, password, port, database name …etc
    source content
    {
            #data source type
            type                    = mysql
            sql_host                = localhost
            sql_user                = root
            sql_pass                = root
            sql_db                  = your_db
            sql_port                = 3306
    
            ....
    }
  • Query Fetch configurations: containing the main query to fetch the data from your source in order to be indexed in Sphinx, I used
    sql_query_pre to define UTF8 encoding of the incoming data.
    sql_query to fetch the main data to be indexed, which are -in my case- non deleted, approved news articles.
    source content
    {
            ....
            #main document fetch query
            sql_query_pre           =   SET NAMES utf8
            sql_query               = \
                                    SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                    from content \
                                    where is_deleted=0 and status=1 \
            ....
    }
  • Attribute declaration: mainly data fetched into Sphinx will be full-text indexed, however, you can define other attribute that can be used for filtration, ordering and grouping as non-text fields. Here I used
    sql_attr_uint to define status and category_id columns as unsigned integer attributes,
    sql_attr_timestamp to define created_time as time stamp.
    source content
    {
            ....
            #attribute declaration        
            sql_attr_uint           = status
            sql_attr_uint           = category_id
            sql_attr_timestamp      = created_time
    }

    You may not define any attribute if there is no need for them

 

2- Delta Source Block:
This block defines the data which are recently updated, so we don’t have to run Sphinx indexer on all the data, we will run the indexer periodically only on the recently added contents (delta) in order to add them to the index.

#defining delta source
source delta: content {
       sql_query                = \
                                SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                from content \
                                where is_deleted=0 and status=1 and created_at >= CURRENT_TIMESTAMP() - INTERVAL 15 MINUTE\
                                ;
}

3- Index Block(s):
Defining the index associated with the data sources, I defined one index for the main source, and another for delta source. this block contains the path where the index will be stored

index content_index
{
       source           = content
       path             = /home/zaid/sphinx/data/mdn_content
       docinfo          = extern
}
index delta: content_index
{
      source            = delta
      path              = /home/zaid/sphinx/data/mdn_content_delta
      docinfo           = extern
}

4- Searchd Daemon Block:
Searchd is the daemon that serves the search queries issued by the clients and retrieve results, you define the port to listen to, log file path and PID (process ID) file path.

#searchd daemon configurations
searchd
{
     listen   = localhost:9312
     listen   = 9312
     log      = /var/log/searchd.log
     pid_file = /var/log/searchd.pid
}

Running Sphinx

Once you have placed your sphinx config file, you are ready to start indexing your data, and requesting search queries from Sphinx.

To index your data source, run the Indexer

indexer --all

I found the indexer pretty fast, it indexed my data source (which is about 1.5G) in about 90 seconds!

After completion of indexing data, start searchd daemon by simply typing

searchd

To make sure that your searchd daemon works, you can type
netstat -nlp | grep searchd
netstat -nlp | grep 9312

Now we need to make the delta index, indexing new data automatically and merge them to the main index, please these in your crontab:

#I added this cronjob to run each 15minutes according to my delta query
*/15 * * * * /usr/bin/indexer --rotate --config /usr/local/etc/sphinx.conf delta
*/15 * * * * /usr/bin/indexer --config /usr/local/etc/sphinx.conf --merge content_index delta --rotate

Now, you are ready to go!

Running sphinx from Command Line

now you can query search phrases from your sphinx using search command

in the following format

 search -i NAME_OF_INDEX -l LIMIT -o OFFSET SEARCH_PHRASE

Here is an example, this command search for “jordan” in content_index that we have just defined.

[zaid@localhost tmp]$ search -i content_index -l 10 -o 20 jordan
Sphinx 2.2.1-id64-dev (r4310)
Copyright (c) 2001-2013, Andrew Aksyonoff
Copyright (c) 2008-2013, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/etc/sphinx.conf'...
index 'content_index': query 'jordan ': returned 62 matches of 62 total in 0.000 sec

displaying matches:
21. document=136385, category_id=11, created_time=Sun Feb 26 14:27:32 2012
22. document=138933, category_id=11, created_time=Mon Mar 12 15:39:15 2012
23. document=142949, category_id=11, created_time=Wed Apr  4 13:23:04 2012
24. document=152446, category_id=19, created_time=Sun May 27 14:41:34 2012
25. document=156444, category_id=11, created_time=Sun Jun 17 00:40:47 2012
26. document=180436, category_id=11, created_time=Mon Oct 22 11:03:01 2012
27. document=57574, category_id=1, created_time=Sun Oct  3 18:05:58 2010
28. document=62989, category_id=53, created_time=Tue Nov 30 19:11:22 2010
29. document=76606, category_id=11, created_time=Sat Mar 12 11:29:13 2011
30. document=80203, category_id=17, created_time=Wed Apr  6 23:59:56 2011

words:
1. 'jordan': 62 documents, 164 hits

Note that the results returned DocumentID (which is analogous to id column in our SQL), along with other attributes we defined in the config file, which include category_id and created_time. The search was pretty fast (0.0000sec) take for this search query.

I am aware that I didn’t write any PHP code in this article, I will leave it to part 2 🙂

In the next article, I will write simple PHP script that query results from the indexes we created, and talk a little bit about filtering, grouping and ranking results in sphinx.

3 thoughts on “Introduction to sphinx with PHP – Part 1

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? prove it... * Time limit is exhausted. Please reload CAPTCHA.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>