Tag Archives: mysql

Pagination optimization for Symfony2 & Doctrine

In the last week, one of my websites – which is recently launched – has suffered from major performance problems, although there is little traffic on it. The website is built on Symfony2 and it acts as directory for financial/forex companies, I started investigating the problem, and the reason was quickly identified, the pagination was the culprit!

One of the tables pulled RSS feeds from multiple financial news sources and saved them in our CMS content table, and in about one month the table contained approximately 50,000 rows… the number of rows is relatively not that much, but the queries that are used in the pagination functionality (which is knp_paginator bundle) cause me some problem, especially that I am deploying the site on VPS with 2GB RAM..,

Background on limit/offset performance

Many developers that are dealing with MySQL has misunderstanding of limit/offset functionality, its not necessary that your query become efficient just because you applied limit to it. MySQL process query conditions, apply the ordering then it retrieves the desired number of rows, so internally MySQL may perform full table scan, use filesort of the resultset in a file or temporary table, then retrieve the desired number of rows, so the applying limit/offset comes at later stages of the query execution.

If you are using an index on your columns (especially for “order by” clause), you will get a better performance, but still when applying large offset (lets say 100,000 rows with limit 10), MySQL will have to process 100,000 then throw away unneeded rows to return just your desired 10 rows!

That’s why it’s problematic to apply pagination on large dataset on MySQL, and when you are using ORM you have to make some extra effort to optimize your pagination queries.

Paging Problems in Symfony2 knp_paginator

Usually when we need to paginate one of our table rows in Doctrine ORM with Symfony2, we would use Knp_paginator, since it’s easy to use, and provide you with simple paging functionality with just a couple lines of code. However, when I looked more closely in the way it performs I found some points that form performance bottlenecks in the way it operates, especially for large data sets.

For the purpose of clarification, I will be using cms_content table that have the following structure:

mysql> desc cms_content;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int(11)      | NO   | PRI | NULL    | auto_increment |
| category_id  | int(11)      | NO   | MUL | NULL    |                |
| is_published | tinyint(1)   | NO   | MUL | NULL    |                |
| slug         | varchar(255) | NO   | MUL | NULL    |                |
| created_at   | datetime     | NO   | MUL | NULL    |                |
....
+--------------+--------------+------+-----+---------+----------------+

The columns that I frequently use in the queries are is_published, category_id and usually the ordering is based on created_at column.

Counting Query:

In order to get number of pages available, any paging library construct query that counts the results based on the parameters passed, the simplest counting query will look something like that:

SELECT COUNT(id)
FROM cms_content
where
category_id = 3 and
is_published = true
order by created_at desc;

so when you explain this query in order to check its performance you will see:
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
| id | select_type | table       | type | possible_keys                        | key             | key_len | ref         | rows  | Extra       |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | cms_content | ref  | IDX_A0293FB812469DE2,secondary_index | secondary_index | 5       | const,const | 13972 | Using index |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+

as you can see, this query is fairly optimized, because it uses a covering index (see Using Index in the Extra cell in the explain result), that mean that MySQL performs this query by just looking into index without reading full data row of the table.

Here I have created an index for the columns frequently used, which are is_published, category_id, created_at to utilize indexing and improve all queries, see the indexes applied:
mysql> show index from cms_content;
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table       | Non_unique | Key_name             | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cms_content |          0 | PRIMARY              |            1 | id           | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | IDX_A0293FB812469DE2 |            1 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            1 | is_published | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            2 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            3 | created_at   | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.25 sec)

however, when I searched for the count query used by knp_paginator bundle, I got this complex query:

SELECT 
  COUNT(*) AS dctrn_count 
FROM 
  (
    SELECT 
      DISTINCT id0 
    FROM 
      (
        SELECT 
          c0_.id AS id0, 
          c0_.is_published AS is_published1, 
          c0_.slug AS slug2, 
          c0_.guid AS guid3, 
          c0_.rss_link AS rss_link4, 
          c0_.category_id AS category_id5, 
          c0_.created_at AS created_at6 
        FROM 
          cms_content c0_ 
        WHERE 
          c0_.is_published = true 
          AND c0_.category_id = 3 
        ORDER BY 
          c0_.created_at DESC
      ) dctrn_result
  ) dctrn_table

and here is the explain result for the previous query:
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
| id | select_type | table      | type | possible_keys                        | key                  | key_len | ref  | rows  | Extra                        |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
|  1 | PRIMARY     | NULL       | NULL | NULL                                 | NULL                 | NULL    | NULL |  NULL | Select tables optimized away |
|  2 | DERIVED     | <derived3> | ALL  | NULL                                 | NULL                 | NULL    | NULL | 26775 | Using temporary              |
|  3 | DERIVED     | c0_        | ALL  | IDX_A0293FB812469DE2,secondary_index | IDX_A0293FB812469DE2 | 4       |      | 27944 | Using where; Using filesort  |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+

as shown, the query is not optimized as all! the use of subquery eliminated the use of the indexes and required a file sort to handle processing its result, also there is unnecessary distinct keyword that impose some overhead on the query.
On my machine this query took 1.04 seconds to execute compared with 0.01 seconds for the previous count query, although they have the exact same logic.

Fetcing Data Query:

The query that retrieves the data of pagination, usually implemented by simply adding limit/offset clauses with the desired parameters. As I explained earlier, this can be a performance problem for large number of pages because it causes MySQL to process too much rows then discard most of them.
For example, here is the default query to retrieve data of the page number 5,000 of my cms_content table:

SELECT 
  DISTINCT c0_.id AS id0, 
  c0_.created_at AS created_at1,
  c0_.slug as slug2
FROM 
  cms_content c0_ 
WHERE 
  c0_.is_published = ? 
  AND c0_.category_id = ? 
ORDER BY 
  c0_.created_at DESC 
LIMIT 
  5 OFFSET 24995

this query may seem as simple query, but it will cause MySQL to process large number of rows, and perform filesort, and will lead to high random I/O operation especially if you want to retrieve more data columns such as title, guid…etc, and this cost will increase as the number of pages gets larger.

One of the best ways to optimize such queries is by using deferred joins, that is joining with smaller resultset, as following:

select id, created_at, slug 
from cms_content 
inner join 
(
    select id 
    from cms_content 
	where category_id = 3 and is_published=1
	order by created_at desc 
	limit 5 OFFSET 24995
)  as small_tbl 
using (id)

This method is used to make MySQL read as little data as possible, if you check the sub-query, it’s using a covering index and retrieves small resultset with very low cost (I/O operation), then full data row will be fetched according to the small number of IDs retrieved by the sub-query.

despite how this query structured, its very efficient one because it minimize the cost of I/O operations. you can verify that by checking last_query_cost after executing it, and comparing it to the other approach
show status like 'last_query_cost';

Doctrine Code

In order to apply the optimized pagination queries in my Symfony project, I had to write my own code, and remove knp_paginator calls.
First, I wrote a method to get the count of results used in paging, placing it in the ContentRepository class:

    public function getContentCountByCategory($cat_id){
        
        $qb = $this->createQueryBuilder('c')
                ->select('count(c.id)')                                
                ->where('c.categoryId = :category_id')
                ->andWhere('c.isPublished = :published')
                ->setParameter('category_id', $cat_id)
                ->setParameter('published', true)
                ;
        
        return $qb->getQuery()->getSingleScalarResult();                
    }

this method produces an optimized version of the count query.

Second, I had to write method to retrieve results with deferred join, I tried to use query builder, but unfortunately doctrine doesn’t support subquery in “from” clause, so I wrote a native SQL query with ResultSetMapping as following:

    public function getContentQueryPyCategory($cat_id, $lang, $page=1, $limit=10) {

        $offset = ($page-1)*$limit;
        
        $rsm = new ResultSetMapping();
        $rsm->addEntityResult('Webit\CMSBundle\Entity\Content', 'c');
        $rsm->addFieldResult('c', 'id', 'id');
        $rsm->addFieldResult('c', 'created_at', 'createdAt');
        $rsm->addFieldResult('c', 'slug', 'slug');
        
        $query = $this->getEntityManager()->createNativeQuery("select id, created_at, slug from cms_content inner join "
                . "(select id from cms_content where category_id = ? and is_published=? order by created_at desc limit ? offset ?) as small_tbl"
                . " using (id)", $rsm);
        $query->setParameter(1, $cat_id);
        $query->setParameter(2, true);
        $query->setParameter(3, $limit);
        $query->setParameter(4, $offset);
        
        
        return $query->getResult();
    }

This method with produce an efficient query, that reduces I/O operations inside MySQL, and the query cost will be minimal even when the number of pages increase.

After implementing this solution, the queries used to produce the page number 3000 took about 46ms compared to approximately 440ms in the original paginator code, more important, the memory usage of MySQL have been utilized because the queries uses the indexing in a more efficient way.

Conclusion

In this post, I discussed the performance problems of pagination in doctrine & Symfony2 project. I applied some techniques that are more efficient in handling limit/offset operations in pagination. Although, there are other workarounds that give good performance, like this good post. This piece of code made my application perform better, hope it helps others who face similar situation.

Introduction to sphinx with PHP – part2

In Part1, I explained how to install sphinx and configure it to index the data from MySQL source, and use the searchd daemon from command line to retrieve data from defined indexes.

In this post, I will explain a PHP examples of how to use Sphinx API.

The following script is based of the database structure and sphinx config file I used in Part1 of this sphinx introduction.

Example PHP Script

<?php

header('Content-type: text/html; charset=utf8');
include ( "sphinxapi.php" );

mysql_connect('localhost', 'root', 'root');
mysql_select_db('your_database_here');        
mysql_query('set names utf8');        

$phrase = @$_GET['phrase'];
$page = isset($_GET['page']) ? $_GET['page'] : 1;
$date_start = @$_GET['date_start'];
$date_end = @$_GET['date_end'];

$client = new SphinxClient();
$client->SetLimits(($page - 1) * 10, 10);
$client->SetSortMode(SPH_SORT_EXTENDED, '@weight desc, created_time desc');
$client->SetMatchMode(SPH_MATCH_ANY);
$client->SetFieldWeights(array('title'=>4, 'keywords'=>2, 'body'=>1 ));

if(isset($date_start) || isset($date_end)){    
    $start_time = isset($date_start)?strtotime($date_start):null;
    $end_time = isset($date_end)?strtotime($date_end):null;    
    $client->SetFilterRange('created_time', $start_time, $end_time);
}

$res = $client->Query($phrase, 'content_index');


if (!$res) {
    echo 'error: ' . $client->GetLastError();
} else {

    if ($res['total'] == 0 || !isset($res['matches'])) {
        echo 'No results retrieved from Search engine';
    } else {
        echo "Displaying " . (($page - 1) * 10+1).'-'.(min($res['total'],$page * 10)) . " out of " . $res['total_found'] . ' total results';
                
        //var_dump($res);
        $ids_str = implode(', ', array_keys($res['matches']));
        $res_db = mysql_query('select id, title, created_at from content where id in  (' . $ids_str . ') order by field(id,'.$ids_str.')');
        if ($res_db === false) {
            echo "Error in mysql query #" . mysql_errno() . ' - ' . mysql_error();
        } else {
            echo '<ul>';
            while ($row = mysql_fetch_assoc($res_db)) {
                echo '<li>'
                . '<a href="show.php?id=' . $row['id'] . '&phrase='.$phrase.'">' . $row['title'] . '<a>'
                . '<br/> [relevency: '.$res['matches'][$row['id']]['weight'].']'        
                . '<br/> [created_at: '.$row['created_at'].']'        
                . '</li>';
            }
            echo '</ul>';
        }

        echo '<br/><br/>Total Time: ' . $res['time'] . 's';
    }
}

This simple script takes parameters from the webpage, then issue a search request containing the specified phrase and conditions from searchd daemon.

In the first lines (1-13), I declared the database connection along with the parameters that I will use within the search, after that I initialized sphinx client and applied main configurations on it as explained in the next section.

Main SphinxClient Methods

Here are a list of main methods used to configure SphinxClient:

1- SetSortMode:
Sphinx supports multiple flexible sort modes which controls the ordering criteria of the retrieved results,
I will mention brief information about each sort mode – since I consider them as one of most important features in sphinx:

a- SPH_SORT_RELEVANCE: its the default sort mode that sorts the results according to the their relevancy to the search query passed.

$client->SetSortMode(SPH_SORT_RELEVANCE);

Sphinx ranks the results by default using phrase proximity that takes into consideration the phrase words order along with words frequency. We can control the way sphinx computes relevancy by changing Ranking modes (using  SetRankingMode function ).

b- SPH_SORT_ATTR_ASC / SPH_SORT_ATTR_DESC: sort the results in ascending or descending order according to predefined attribute, for example, you can change line 17 to be:

$client->SetSortMode(SPH_SORT_ATTR_DESC, 'created_time');
in this way, the newest articles will come as the first result in the page.

c- SPH_SORT_TIME_SEGMENTS: sorts by combination time ordering then by relevancy

$client->setSortMode(SPH_SORT_TIME_SEGMENTS, 'created_time');

d- SPH_SORT_EXTENDED: sort by a combination of attributes ascending or descending in SQL-like format, as I used in the script above:

$client->SetSortMode(SPH_SORT_ATTR_ASC, '@weight desc, created_time desc');
Here I sorted according to relevancy (represented using @weight computed attribute), then descending according to creation time (in case two results have same weight).

e- SPH_SORT_EXPR: sort using some arithmetic expression, for example you can use a combination of the relevancy and popularity represented by page_views, as an example:

$client->SetSortMode(SPH_SORT_EXPR, '@weight * page_views/100');

unlike MySql, putting expression in sort mode (analogous to order by clause) won’t effect the performance negatively.

2- SetMatchMode():
used to control how sphinx perform a match for the query phrase, here is the most important options:
a- SPH_MATCH_ALL: matches all keywords in the search query.
b- SPH_MATCH_ANY: matches any keyword.
c- SPH_MATCH_PHRASE: match the whole phrase, which require perfect match.
all matching modes can be found here

3- SetFieldWeights():
Using this function, you can distribute the relevancy weight among the fields, in the script above, I used this line:

$client->SetFieldWeights(array('title'=>4, 'keywords'=>2, 'body'=>1 ));

in order to indicate that “title” field is more important than “keywords” field and “body” field, so the results that have matching query phrase in the title will appear before those which have many matching query phrase in the body. This option is very useful to control the relevancy of results.

4- SetFilterRange():
Here you can add filter based on one of the attributes defined in sphinx index, (analogous to adding where condition to the SQL statement). I used it to filter according to the creation time

$client->SetFilterRange('created_time', $start_time, $end_time);

5- Query():
after configuring sphinx search query, this method used to send request to searchd daemon and get the results from sphinx:

$res = $client->Query($phrase, 'content_index');

the Query() method, take the search phrase as the first parameter, and the name of the index(es) to match against as the second parameter.

After calling Query() method on sphinxClient, a result array will be returned containing information about matching records. If we dumped the “matches” index in the result array, we will get similar to those data:

var_dump($res['matches']);
/*********/

  array(2) {
    [181916]=>
    array(2) {
      ["weight"]=>
      string(1) "1"
      ["attrs"]=>
      array(3) {
        ["status"]=>
        string(1) "1"
        ["category_id"]=>
        string(2) "11"
        ["created_time"]=>
        string(10) "1386946964"
      }
    }
    [181915]=>
    array(2) {
      ["weight"]=>
      string(1) "7"
      ["attrs"]=>
      array(3) {
        ["status"]=>
        string(1) "1"
        ["category_id"]=>
        string(2) "12"
        ["created_time"]=>
        string(10) "1386368157"
      }
    }

The data returned for each matched element are:
– documentID (as the key of the array element)
– weight (dynamically calculated according to SetSortMode() and SetFieldWeights() functions, we used earlier)
– attributes values, in “attrs” index (ex. created_time, status…etc), containing sphinx attributes defined in config file.

note that sphinx will not return the textual data itself, because it only index textual data and don’t store it, so we have to get them from our MySQL database:

$ids_str = implode(', ', array_keys($res['matches']));
$res_db = mysql_query('select id, title, created_at from mdn_content where id in  (' . $ids_str . ') order by field(id,'.$ids_str.')');

in this line, I got the records from MySQL using the DocumentIDs, and kept the same ordering as Sphinx by using “Field(id, val1,val2,…)” in order by clause.

Now I got the results IDs from sphinx, fetched associated textual data from MySQL and displayed them into webpage.

Running the code

Now, I would like to query all recording containing word “syria” published in the last two weeks, and here are the results:
Screenshot from 2013-12-14 00:02:11

you can see that articles with “syria” word appeared in title got higher rank than those with “syria” keyword appeared in the body, because of the field weights I used in the script above. also the sphinx took about 0.015 seconds to get those results among 150,000 record, which is extremely fast.

another execution here, searching for syria phrase without any additional filters:
Screenshot from 2013-12-14 00:20:34
and that took about 0.109 seconds to execute!

Quick MySQL comparison

I just wanted to compare sphinx with MySQL, in terms of performance here:
I execute mysql query that have a similar condition to that I executed on sphinx in previous section, and here is the result:

mysql> select id from content where match(body) against('*syria*' in boolean mode) and status=1;
+--------+
| id     |
+--------+
| 145805 |
| 142579 |
| 133329 |
|  59778 |
|  95318 |
|  94979 |
|  83539 |
|  56858 |
| 181915 |
| 181916 |
| 181917 |
| 181918 |
+--------+
12 rows in set (10.74 sec)

MySQL took about 10 seconds to execute the same query compared to about 0.1 second using sphinx.

Conclusion

Now, the simple PHP script is running with sphinx and MySQL, and I explained the main functions to control Sphinx using PHP API, including sorting, matching and filtration.
There are many other powerful features of sphinx, like: MultiQuery, MVA (multi-valued attributes), grouping…etc, that I may write about in the future.

Introduction to sphinx with PHP – Part 1

I was using MySQL Full text indexing in one of my projects, but I noticed that after data size increased beyond several gigabytes, MySQL won’t scale well in using the index and the queries will get too slow to be used in production environments especially for high traffic websites, so I have read about sphinx search engine, and find it quite powerful for textual searching functionality.

What is Sphinx?

as sphinx official site defines:

Sphinx is an open source full text search server, designed from the ground up with performance, relevance (aka search quality), and integration simplicity in mind. It’s written in C++ and works on Linux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, and a few other systems.

 

Some sphinx advantages

Sphinx has many features that make it excellent option for textual search of large data sizes, some of these advantages (specifically compared to MySQL):

  1. Scalable over large data sizes (both horizontally and vertically, using features like distributed indexes)
  2. Advanced ranking algorithm, calculating relevancy of data based on analyzing the keywords (ability to set customized weight -relevancy importance- for each field)
  3. Index data from multiple sources, including different database types (or different storage engines)
  4. Other important enhancements, like: parallel results,batch queries…etc

Install Sphinx

Sphinx is available on many platforms (including Linux, MacOS, Windows), and its relatively easy to install.
I will cover here compiling Sphinx from source code, (since I have Fedora and I didn’t find a ready made RPM package in my OS).

1- Go to sphinx download page:
http://sphinxsearch.com/downloads/release/

and download the package suitable to your OS, as I mentioned earlier I will go for the first option, which is downloading the source files on sphinx.

If you find a package for your OS, install it and you can skip the remaining installation steps and jump to the next section.

2- Compile Sphinx from source code:
After downloading the source files, extract the source files and run:

./configure

then
make install

In case you want to let sphinx work with mysql, you should have mysql development package installed to your system, you should run:
yum install mysql-devel

By doing these steps successfully, Sphinx should be installed to your machine successfully. The next step is to configure Sphinx sources and indexes to read from MySQL.

Sphinx Configuration

After having installed sphinx, we have to configure data source and indexes for sphinx.

First, let me introduce a sample MySQL table which I will run sphinx indexer to work with. Its a table called “content” which stores news portal articles, containing standard news fields (like title, body, author…etc):

CREATE TABLE `content` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
  `title` varchar(500) DEFAULT NULL,
  `author` varchar(255) DEFAULT NULL,
  `category_id` int(11) NOT NULL,  
  `status` int(11) NOT NULL, -- used to indicate if content is published, pending...etc  
  `body` longtext,
  `keywords` varchar(255) DEFAULT NULL,
  `is_deleted` int(11) DEFAULT NULL,
  `slug` varchar(255) DEFAULT NULL,
  `updated_at` datetime DEFAULT NULL,
  `created_at` datetime DEFAULT NULL,   
   FULLTEXT KEY `INDEX_TEXT` (`title`,`author`),
   FULLTEXT KEY `title` (`title`),
   FULLTEXT KEY `body` (`body`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

This table was MyISAM table in order to benefit from MySQL Full-Text index prior to use of Sphinx. We will use these indexes to compare sphinx and mySQL performance later.

Now let’s see sphinx configuration file,
Sphinx configuration file, is named sphinx.conf, usually you will find it in /usr/local/etc (in case you compiled sphinx from sources), or in /etc/sphinx (in case you installed a RPM package). you will find also at the same directory, sphinx.conf.dist which have standard configuration file that have all the configuration options.

In the following, the contents of the sphinx.conf file I used with my database

source content
{
        #data source type
        type                    = mysql
        sql_host                = localhost
        sql_user                = root
        sql_pass                = root
        sql_db                  = your_db
        sql_port                = 3306

        #main document fetch query
        sql_query_pre           =   SET NAMES utf8
        sql_query               = \
                                SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                from content \
                                where is_deleted=0 and status=1 \
                                ;
        #attribute declaration        
        sql_attr_uint           = status
        sql_attr_uint           = category_id
        sql_attr_timestamp      = created_time

}

#defining delta source
source delta: content {
       sql_query                = \
                                SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                from content \
                                where is_deleted=0 and status=1 and created_at >= CURRENT_TIMESTAMP() - INTERVAL 15 MINUTE\
                                ;
}

#defining indexes
index content_index
{
       source           = content
       path             = /home/zaid/sphinx/data/mdn_content
       docinfo          = extern
}
index delta: content_index
{
      source            = delta
      path              = /home/zaid/sphinx/data/mdn_content_delta
      docinfo           = extern
}

#searchd daemon configurations
searchd
{
     listen   = localhost:3321
     listen   = 9312
     log      = /var/log/searchd.log
     pid_file = /var/log/searchd.pid
}

Here is the explanation of the configuration file parts:

1- Main data source block (mandatory):

source content
{
....
}

this block defines the data source, which sphinx will index its contents, in this block you define mainly three groups of information:

  •     data source connection parameters:   containing information to connect your database, including: database type, username, password, port, database name …etc
    source content
    {
            #data source type
            type                    = mysql
            sql_host                = localhost
            sql_user                = root
            sql_pass                = root
            sql_db                  = your_db
            sql_port                = 3306
    
            ....
    }
  • Query Fetch configurations: containing the main query to fetch the data from your source in order to be indexed in Sphinx, I used
    sql_query_pre to define UTF8 encoding of the incoming data.
    sql_query to fetch the main data to be indexed, which are -in my case- non deleted, approved news articles.
    source content
    {
            ....
            #main document fetch query
            sql_query_pre           =   SET NAMES utf8
            sql_query               = \
                                    SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                    from content \
                                    where is_deleted=0 and status=1 \
            ....
    }
  • Attribute declaration: mainly data fetched into Sphinx will be full-text indexed, however, you can define other attribute that can be used for filtration, ordering and grouping as non-text fields. Here I used
    sql_attr_uint to define status and category_id columns as unsigned integer attributes,
    sql_attr_timestamp to define created_time as time stamp.
    source content
    {
            ....
            #attribute declaration        
            sql_attr_uint           = status
            sql_attr_uint           = category_id
            sql_attr_timestamp      = created_time
    }

    You may not define any attribute if there is no need for them

 

2- Delta Source Block:
This block defines the data which are recently updated, so we don’t have to run Sphinx indexer on all the data, we will run the indexer periodically only on the recently added contents (delta) in order to add them to the index.

#defining delta source
source delta: content {
       sql_query                = \
                                SELECT id, title, author, status, body, keywords, category_id, unix_timestamp(created_at) as created_time  \
                                from content \
                                where is_deleted=0 and status=1 and created_at >= CURRENT_TIMESTAMP() - INTERVAL 15 MINUTE\
                                ;
}

3- Index Block(s):
Defining the index associated with the data sources, I defined one index for the main source, and another for delta source. this block contains the path where the index will be stored

index content_index
{
       source           = content
       path             = /home/zaid/sphinx/data/mdn_content
       docinfo          = extern
}
index delta: content_index
{
      source            = delta
      path              = /home/zaid/sphinx/data/mdn_content_delta
      docinfo           = extern
}

4- Searchd Daemon Block:
Searchd is the daemon that serves the search queries issued by the clients and retrieve results, you define the port to listen to, log file path and PID (process ID) file path.

#searchd daemon configurations
searchd
{
     listen   = localhost:9312
     listen   = 9312
     log      = /var/log/searchd.log
     pid_file = /var/log/searchd.pid
}

Running Sphinx

Once you have placed your sphinx config file, you are ready to start indexing your data, and requesting search queries from Sphinx.

To index your data source, run the Indexer

indexer --all

I found the indexer pretty fast, it indexed my data source (which is about 1.5G) in about 90 seconds!

After completion of indexing data, start searchd daemon by simply typing

searchd

To make sure that your searchd daemon works, you can type
netstat -nlp | grep searchd
netstat -nlp | grep 9312

Now we need to make the delta index, indexing new data automatically and merge them to the main index, please these in your crontab:

#I added this cronjob to run each 15minutes according to my delta query
*/15 * * * * /usr/bin/indexer --rotate --config /usr/local/etc/sphinx.conf delta
*/15 * * * * /usr/bin/indexer --config /usr/local/etc/sphinx.conf --merge content_index delta --rotate

Now, you are ready to go!

Running sphinx from Command Line

now you can query search phrases from your sphinx using search command

in the following format

 search -i NAME_OF_INDEX -l LIMIT -o OFFSET SEARCH_PHRASE

Here is an example, this command search for “jordan” in content_index that we have just defined.

[zaid@localhost tmp]$ search -i content_index -l 10 -o 20 jordan
Sphinx 2.2.1-id64-dev (r4310)
Copyright (c) 2001-2013, Andrew Aksyonoff
Copyright (c) 2008-2013, Sphinx Technologies Inc (http://sphinxsearch.com)

using config file '/usr/local/etc/sphinx.conf'...
index 'content_index': query 'jordan ': returned 62 matches of 62 total in 0.000 sec

displaying matches:
21. document=136385, category_id=11, created_time=Sun Feb 26 14:27:32 2012
22. document=138933, category_id=11, created_time=Mon Mar 12 15:39:15 2012
23. document=142949, category_id=11, created_time=Wed Apr  4 13:23:04 2012
24. document=152446, category_id=19, created_time=Sun May 27 14:41:34 2012
25. document=156444, category_id=11, created_time=Sun Jun 17 00:40:47 2012
26. document=180436, category_id=11, created_time=Mon Oct 22 11:03:01 2012
27. document=57574, category_id=1, created_time=Sun Oct  3 18:05:58 2010
28. document=62989, category_id=53, created_time=Tue Nov 30 19:11:22 2010
29. document=76606, category_id=11, created_time=Sat Mar 12 11:29:13 2011
30. document=80203, category_id=17, created_time=Wed Apr  6 23:59:56 2011

words:
1. 'jordan': 62 documents, 164 hits

Note that the results returned DocumentID (which is analogous to id column in our SQL), along with other attributes we defined in the config file, which include category_id and created_time. The search was pretty fast (0.0000sec) take for this search query.

I am aware that I didn’t write any PHP code in this article, I will leave it to part 2 🙂

In the next article, I will write simple PHP script that query results from the indexes we created, and talk a little bit about filtering, grouping and ranking results in sphinx.

MySQL Slave monitor in Bash

I have several slave setups for different sites, that I use for several purposes, like: taking packups from the slave, or using it for reporting/read only operations for masters that have high write query rates.

But, in the last week I found that a couple of the slaves have been stop for some time without me noticing that (one of them was not working for about a month), so that mean that I don’t have valid packups for the last month! So I thought that I must automate monitoring the slave, so I could receive notification if the slave stopped working for a reason or another.

Why Slave Stop working?
MySQL replication works to synchronize master data to slave database in real time asynchronously, so the master data is copied as it is to the slave through executing the same write queries (stored in the master BinLog) to the slave, so any error that occur in one of these queries may cause the slave to halt.

For example: one time a client of mine tried to submit very large article in his news portal, so in order to get this working I increased the value of max_allowed_packet directive without applying the same configuration to the slave, and that caused the slave SQL_THREAD to stop working.

How to check slave is working or not?
The basic command to check the status of your slave is

SHOW SLAVE STATUS\G

this command will give you summery information about your slave, the most useful pieces of information here is:

  1. Slave_SQL_Running : indicates whether the thread responsible for writing SQL queries is running normally (any error on one of the queries that it executes will stop the thread and it will set value for Last_SQL_Error and Last_SQL_Errno)
  2. Slave_IO_Running: indicates whether the thread responsible for receiving commands sent from master and write them to relay log in the slave is working.
  3. Seconds_Behind_Master: how much the slave is delayed from the master (usually it should minimum, if this value is continuously increasing, this indicates a problem even if both SQL_THREAD and IO_THREAD are running)

So, I searched on the internet to see if there are some scripts that perform that check, and I found a couple of useful links but it didn’t contain all the functionality that I needed, so I wrote my own bash script that monitor the slave and send an alert once there is problem in replicating data from the master.
The script below do the following:

  1. extracts the information about slave using SHOW SLAVE STATUS command in mysql
  2. checks the problems about slave (using the three piece of information discussed above), and write them to a log file in the tmp directory
  3. if there is a problem, send the contents of the log file to the DBA in order to for him to fix.
#!/bin/bash
server_name="Slave #1 - Test DB" #change this in order to indicate which slave you are monitoring now
admin_email='zaid@wewebit.com' #email of the database administrator to recieve notification

# change mysql credentials in the following commands if you running monitor using a user other than root
sql_thread_running=$(mysql -e "show slave status\G" | awk -F":" '/Slave_SQL_Running/ { print $2 }' | tr -d ' ')
io_thread_running=$(mysql -e "show slave status\G" | awk -F":" '/Slave_IO_Running/ { print $2 }' | tr -d ' ')
seconds_late=$(mysql -e "show slave status\G" | awk -F":" '/Seconds_Behind_Master/ { print $2 }' | tr -d ' ')
seconds_late=$(($seconds_late+0))

if [ "$sql_thread_running" = "No" ] || [ "$io_thread_running" = "No" ] || [ $seconds_late -gt 3600 ]; then

log_file="/tmp/log_slave_status_$(date +%m-%d-%Y-%H:%M)"
echo "Slave status report on $(date +%m-%d-%Y-%H:%M)" >> $log_file
echo "Error in slave on $server_name" >> $log_file
if [ "$sql_thread_running" = "No" ]; then
echo "SQL Thread not running" >> $log_file
fi

if [ "$io_thread_running" = "No" ]; then
echo "IO thread not running" >> $log_file
fi

if [ $seconds_late -gt 3600 ]; then #formattting how the latency of the slave behind master should be displayed
display_late="$seconds_late seconds"
if [ $seconds_late -gt 60 ]; then
display_late="$display_late = $(($seconds_late/60)) minutes"
fi

if [ $seconds_late -gt 3600 ]; then
display_late="$display_late = $(($seconds_late/3600)) hours"
fi

echo "slave is behind master by $display_late" >> $log_file
fi

#echo send alerts to the admin
mail -s "Slave status on $server_name" $admin_email < $log_file
echo "Slave not running, alerts sent to the admins..."
else
echo "slave is running normally, no problem detected :)"
fi

you can download the bash script from here

 

Put bash in crontab
The final step is to put this simple script in crontab in order to check the status periodically,
just open crontab file using

crontab -e

And I added it to run each half an hour, as follows:

# some environments require that you set the shell path inside crontab in order to run properly
SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
*/30 * * * * /usr/local/sbin/monitor_slave.sh

 

Now, whenever there is problem in one of my slave, a message will be delivered to my email giving me a summery information about the problem, so I am not worried anymore about my slave(s) status.