Tag Archives: database optimization

Pagination optimization for Symfony2 & Doctrine

In the last week, one of my websites – which is recently launched – has suffered from major performance problems, although there is little traffic on it. The website is built on Symfony2 and it acts as directory for financial/forex companies, I started investigating the problem, and the reason was quickly identified, the pagination was the culprit!

One of the tables pulled RSS feeds from multiple financial news sources and saved them in our CMS content table, and in about one month the table contained approximately 50,000 rows… the number of rows is relatively not that much, but the queries that are used in the pagination functionality (which is knp_paginator bundle) cause me some problem, especially that I am deploying the site on VPS with 2GB RAM..,

Background on limit/offset performance

Many developers that are dealing with MySQL has misunderstanding of limit/offset functionality, its not necessary that your query become efficient just because you applied limit to it. MySQL process query conditions, apply the ordering then it retrieves the desired number of rows, so internally MySQL may perform full table scan, use filesort of the resultset in a file or temporary table, then retrieve the desired number of rows, so the applying limit/offset comes at later stages of the query execution.

If you are using an index on your columns (especially for “order by” clause), you will get a better performance, but still when applying large offset (lets say 100,000 rows with limit 10), MySQL will have to process 100,000 then throw away unneeded rows to return just your desired 10 rows!

That’s why it’s problematic to apply pagination on large dataset on MySQL, and when you are using ORM you have to make some extra effort to optimize your pagination queries.

Paging Problems in Symfony2 knp_paginator

Usually when we need to paginate one of our table rows in Doctrine ORM with Symfony2, we would use Knp_paginator, since it’s easy to use, and provide you with simple paging functionality with just a couple lines of code. However, when I looked more closely in the way it performs I found some points that form performance bottlenecks in the way it operates, especially for large data sets.

For the purpose of clarification, I will be using cms_content table that have the following structure:

mysql> desc cms_content;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int(11)      | NO   | PRI | NULL    | auto_increment |
| category_id  | int(11)      | NO   | MUL | NULL    |                |
| is_published | tinyint(1)   | NO   | MUL | NULL    |                |
| slug         | varchar(255) | NO   | MUL | NULL    |                |
| created_at   | datetime     | NO   | MUL | NULL    |                |
....
+--------------+--------------+------+-----+---------+----------------+

The columns that I frequently use in the queries are is_published, category_id and usually the ordering is based on created_at column.

Counting Query:

In order to get number of pages available, any paging library construct query that counts the results based on the parameters passed, the simplest counting query will look something like that:

SELECT COUNT(id)
FROM cms_content
where
category_id = 3 and
is_published = true
order by created_at desc;

so when you explain this query in order to check its performance you will see:
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
| id | select_type | table       | type | possible_keys                        | key             | key_len | ref         | rows  | Extra       |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | cms_content | ref  | IDX_A0293FB812469DE2,secondary_index | secondary_index | 5       | const,const | 13972 | Using index |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+

as you can see, this query is fairly optimized, because it uses a covering index (see Using Index in the Extra cell in the explain result), that mean that MySQL performs this query by just looking into index without reading full data row of the table.

Here I have created an index for the columns frequently used, which are is_published, category_id, created_at to utilize indexing and improve all queries, see the indexes applied:
mysql> show index from cms_content;
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table       | Non_unique | Key_name             | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cms_content |          0 | PRIMARY              |            1 | id           | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | IDX_A0293FB812469DE2 |            1 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            1 | is_published | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            2 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            3 | created_at   | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.25 sec)

however, when I searched for the count query used by knp_paginator bundle, I got this complex query:

SELECT 
  COUNT(*) AS dctrn_count 
FROM 
  (
    SELECT 
      DISTINCT id0 
    FROM 
      (
        SELECT 
          c0_.id AS id0, 
          c0_.is_published AS is_published1, 
          c0_.slug AS slug2, 
          c0_.guid AS guid3, 
          c0_.rss_link AS rss_link4, 
          c0_.category_id AS category_id5, 
          c0_.created_at AS created_at6 
        FROM 
          cms_content c0_ 
        WHERE 
          c0_.is_published = true 
          AND c0_.category_id = 3 
        ORDER BY 
          c0_.created_at DESC
      ) dctrn_result
  ) dctrn_table

and here is the explain result for the previous query:
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
| id | select_type | table      | type | possible_keys                        | key                  | key_len | ref  | rows  | Extra                        |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
|  1 | PRIMARY     | NULL       | NULL | NULL                                 | NULL                 | NULL    | NULL |  NULL | Select tables optimized away |
|  2 | DERIVED     | <derived3> | ALL  | NULL                                 | NULL                 | NULL    | NULL | 26775 | Using temporary              |
|  3 | DERIVED     | c0_        | ALL  | IDX_A0293FB812469DE2,secondary_index | IDX_A0293FB812469DE2 | 4       |      | 27944 | Using where; Using filesort  |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+

as shown, the query is not optimized as all! the use of subquery eliminated the use of the indexes and required a file sort to handle processing its result, also there is unnecessary distinct keyword that impose some overhead on the query.
On my machine this query took 1.04 seconds to execute compared with 0.01 seconds for the previous count query, although they have the exact same logic.

Fetcing Data Query:

The query that retrieves the data of pagination, usually implemented by simply adding limit/offset clauses with the desired parameters. As I explained earlier, this can be a performance problem for large number of pages because it causes MySQL to process too much rows then discard most of them.
For example, here is the default query to retrieve data of the page number 5,000 of my cms_content table:

SELECT 
  DISTINCT c0_.id AS id0, 
  c0_.created_at AS created_at1,
  c0_.slug as slug2
FROM 
  cms_content c0_ 
WHERE 
  c0_.is_published = ? 
  AND c0_.category_id = ? 
ORDER BY 
  c0_.created_at DESC 
LIMIT 
  5 OFFSET 24995

this query may seem as simple query, but it will cause MySQL to process large number of rows, and perform filesort, and will lead to high random I/O operation especially if you want to retrieve more data columns such as title, guid…etc, and this cost will increase as the number of pages gets larger.

One of the best ways to optimize such queries is by using deferred joins, that is joining with smaller resultset, as following:

select id, created_at, slug 
from cms_content 
inner join 
(
    select id 
    from cms_content 
	where category_id = 3 and is_published=1
	order by created_at desc 
	limit 5 OFFSET 24995
)  as small_tbl 
using (id)

This method is used to make MySQL read as little data as possible, if you check the sub-query, it’s using a covering index and retrieves small resultset with very low cost (I/O operation), then full data row will be fetched according to the small number of IDs retrieved by the sub-query.

despite how this query structured, its very efficient one because it minimize the cost of I/O operations. you can verify that by checking last_query_cost after executing it, and comparing it to the other approach
show status like 'last_query_cost';

Doctrine Code

In order to apply the optimized pagination queries in my Symfony project, I had to write my own code, and remove knp_paginator calls.
First, I wrote a method to get the count of results used in paging, placing it in the ContentRepository class:

    public function getContentCountByCategory($cat_id){
        
        $qb = $this->createQueryBuilder('c')
                ->select('count(c.id)')                                
                ->where('c.categoryId = :category_id')
                ->andWhere('c.isPublished = :published')
                ->setParameter('category_id', $cat_id)
                ->setParameter('published', true)
                ;
        
        return $qb->getQuery()->getSingleScalarResult();                
    }

this method produces an optimized version of the count query.

Second, I had to write method to retrieve results with deferred join, I tried to use query builder, but unfortunately doctrine doesn’t support subquery in “from” clause, so I wrote a native SQL query with ResultSetMapping as following:

    public function getContentQueryPyCategory($cat_id, $lang, $page=1, $limit=10) {

        $offset = ($page-1)*$limit;
        
        $rsm = new ResultSetMapping();
        $rsm->addEntityResult('Webit\CMSBundle\Entity\Content', 'c');
        $rsm->addFieldResult('c', 'id', 'id');
        $rsm->addFieldResult('c', 'created_at', 'createdAt');
        $rsm->addFieldResult('c', 'slug', 'slug');
        
        $query = $this->getEntityManager()->createNativeQuery("select id, created_at, slug from cms_content inner join "
                . "(select id from cms_content where category_id = ? and is_published=? order by created_at desc limit ? offset ?) as small_tbl"
                . " using (id)", $rsm);
        $query->setParameter(1, $cat_id);
        $query->setParameter(2, true);
        $query->setParameter(3, $limit);
        $query->setParameter(4, $offset);
        
        
        return $query->getResult();
    }

This method with produce an efficient query, that reduces I/O operations inside MySQL, and the query cost will be minimal even when the number of pages increase.

After implementing this solution, the queries used to produce the page number 3000 took about 46ms compared to approximately 440ms in the original paginator code, more important, the memory usage of MySQL have been utilized because the queries uses the indexing in a more efficient way.

Conclusion

In this post, I discussed the performance problems of pagination in doctrine & Symfony2 project. I applied some techniques that are more efficient in handling limit/offset operations in pagination. Although, there are other workarounds that give good performance, like this good post. This piece of code made my application perform better, hope it helps others who face similar situation.