Pagination optimization for Symfony2 & Doctrine

In the last week, one of my websites – which is recently launched – has suffered from major performance problems, although there is little traffic on it. The website is built on Symfony2 and it acts as directory for financial/forex companies, I started investigating the problem, and the reason was quickly identified, the pagination was the culprit!

One of the tables pulled RSS feeds from multiple financial news sources and saved them in our CMS content table, and in about one month the table contained approximately 50,000 rows… the number of rows is relatively not that much, but the queries that are used in the pagination functionality (which is knp_paginator bundle) cause me some problem, especially that I am deploying the site on VPS with 2GB RAM..,

Background on limit/offset performance

Many developers that are dealing with MySQL has misunderstanding of limit/offset functionality, its not necessary that your query become efficient just because you applied limit to it. MySQL process query conditions, apply the ordering then it retrieves the desired number of rows, so internally MySQL may perform full table scan, use filesort of the resultset in a file or temporary table, then retrieve the desired number of rows, so the applying limit/offset comes at later stages of the query execution.

If you are using an index on your columns (especially for “order by” clause), you will get a better performance, but still when applying large offset (lets say 100,000 rows with limit 10), MySQL will have to process 100,000 then throw away unneeded rows to return just your desired 10 rows!

That’s why it’s problematic to apply pagination on large dataset on MySQL, and when you are using ORM you have to make some extra effort to optimize your pagination queries.

Paging Problems in Symfony2 knp_paginator

Usually when we need to paginate one of our table rows in Doctrine ORM with Symfony2, we would use Knp_paginator, since it’s easy to use, and provide you with simple paging functionality with just a couple lines of code. However, when I looked more closely in the way it performs I found some points that form performance bottlenecks in the way it operates, especially for large data sets.

For the purpose of clarification, I will be using cms_content table that have the following structure:

mysql> desc cms_content;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int(11)      | NO   | PRI | NULL    | auto_increment |
| category_id  | int(11)      | NO   | MUL | NULL    |                |
| is_published | tinyint(1)   | NO   | MUL | NULL    |                |
| slug         | varchar(255) | NO   | MUL | NULL    |                |
| created_at   | datetime     | NO   | MUL | NULL    |                |
....
+--------------+--------------+------+-----+---------+----------------+

The columns that I frequently use in the queries are is_published, category_id and usually the ordering is based on created_at column.

Counting Query:

In order to get number of pages available, any paging library construct query that counts the results based on the parameters passed, the simplest counting query will look something like that:

SELECT COUNT(id)
FROM cms_content
where
category_id = 3 and
is_published = true
order by created_at desc;

so when you explain this query in order to check its performance you will see:
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
| id | select_type | table       | type | possible_keys                        | key             | key_len | ref         | rows  | Extra       |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | cms_content | ref  | IDX_A0293FB812469DE2,secondary_index | secondary_index | 5       | const,const | 13972 | Using index |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+

as you can see, this query is fairly optimized, because it uses a covering index (see Using Index in the Extra cell in the explain result), that mean that MySQL performs this query by just looking into index without reading full data row of the table.

Here I have created an index for the columns frequently used, which are is_published, category_id, created_at to utilize indexing and improve all queries, see the indexes applied:
mysql> show index from cms_content;
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table       | Non_unique | Key_name             | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cms_content |          0 | PRIMARY              |            1 | id           | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | IDX_A0293FB812469DE2 |            1 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            1 | is_published | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            2 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            3 | created_at   | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.25 sec)

however, when I searched for the count query used by knp_paginator bundle, I got this complex query:

SELECT 
  COUNT(*) AS dctrn_count 
FROM 
  (
    SELECT 
      DISTINCT id0 
    FROM 
      (
        SELECT 
          c0_.id AS id0, 
          c0_.is_published AS is_published1, 
          c0_.slug AS slug2, 
          c0_.guid AS guid3, 
          c0_.rss_link AS rss_link4, 
          c0_.category_id AS category_id5, 
          c0_.created_at AS created_at6 
        FROM 
          cms_content c0_ 
        WHERE 
          c0_.is_published = true 
          AND c0_.category_id = 3 
        ORDER BY 
          c0_.created_at DESC
      ) dctrn_result
  ) dctrn_table

and here is the explain result for the previous query:
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
| id | select_type | table      | type | possible_keys                        | key                  | key_len | ref  | rows  | Extra                        |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
|  1 | PRIMARY     | NULL       | NULL | NULL                                 | NULL                 | NULL    | NULL |  NULL | Select tables optimized away |
|  2 | DERIVED     | <derived3> | ALL  | NULL                                 | NULL                 | NULL    | NULL | 26775 | Using temporary              |
|  3 | DERIVED     | c0_        | ALL  | IDX_A0293FB812469DE2,secondary_index | IDX_A0293FB812469DE2 | 4       |      | 27944 | Using where; Using filesort  |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+

as shown, the query is not optimized as all! the use of subquery eliminated the use of the indexes and required a file sort to handle processing its result, also there is unnecessary distinct keyword that impose some overhead on the query.
On my machine this query took 1.04 seconds to execute compared with 0.01 seconds for the previous count query, although they have the exact same logic.

Fetcing Data Query:

The query that retrieves the data of pagination, usually implemented by simply adding limit/offset clauses with the desired parameters. As I explained earlier, this can be a performance problem for large number of pages because it causes MySQL to process too much rows then discard most of them.
For example, here is the default query to retrieve data of the page number 5,000 of my cms_content table:

SELECT 
  DISTINCT c0_.id AS id0, 
  c0_.created_at AS created_at1,
  c0_.slug as slug2
FROM 
  cms_content c0_ 
WHERE 
  c0_.is_published = ? 
  AND c0_.category_id = ? 
ORDER BY 
  c0_.created_at DESC 
LIMIT 
  5 OFFSET 24995

this query may seem as simple query, but it will cause MySQL to process large number of rows, and perform filesort, and will lead to high random I/O operation especially if you want to retrieve more data columns such as title, guid…etc, and this cost will increase as the number of pages gets larger.

One of the best ways to optimize such queries is by using deferred joins, that is joining with smaller resultset, as following:

select id, created_at, slug 
from cms_content 
inner join 
(
    select id 
    from cms_content 
	where category_id = 3 and is_published=1
	order by created_at desc 
	limit 5 OFFSET 24995
)  as small_tbl 
using (id)

This method is used to make MySQL read as little data as possible, if you check the sub-query, it’s using a covering index and retrieves small resultset with very low cost (I/O operation), then full data row will be fetched according to the small number of IDs retrieved by the sub-query.

despite how this query structured, its very efficient one because it minimize the cost of I/O operations. you can verify that by checking last_query_cost after executing it, and comparing it to the other approach
show status like 'last_query_cost';

Doctrine Code

In order to apply the optimized pagination queries in my Symfony project, I had to write my own code, and remove knp_paginator calls.
First, I wrote a method to get the count of results used in paging, placing it in the ContentRepository class:

    public function getContentCountByCategory($cat_id){
        
        $qb = $this->createQueryBuilder('c')
                ->select('count(c.id)')                                
                ->where('c.categoryId = :category_id')
                ->andWhere('c.isPublished = :published')
                ->setParameter('category_id', $cat_id)
                ->setParameter('published', true)
                ;
        
        return $qb->getQuery()->getSingleScalarResult();                
    }

this method produces an optimized version of the count query.

Second, I had to write method to retrieve results with deferred join, I tried to use query builder, but unfortunately doctrine doesn’t support subquery in “from” clause, so I wrote a native SQL query with ResultSetMapping as following:

    public function getContentQueryPyCategory($cat_id, $lang, $page=1, $limit=10) {

        $offset = ($page-1)*$limit;
        
        $rsm = new ResultSetMapping();
        $rsm->addEntityResult('Webit\CMSBundle\Entity\Content', 'c');
        $rsm->addFieldResult('c', 'id', 'id');
        $rsm->addFieldResult('c', 'created_at', 'createdAt');
        $rsm->addFieldResult('c', 'slug', 'slug');
        
        $query = $this->getEntityManager()->createNativeQuery("select id, created_at, slug from cms_content inner join "
                . "(select id from cms_content where category_id = ? and is_published=? order by created_at desc limit ? offset ?) as small_tbl"
                . " using (id)", $rsm);
        $query->setParameter(1, $cat_id);
        $query->setParameter(2, true);
        $query->setParameter(3, $limit);
        $query->setParameter(4, $offset);
        
        
        return $query->getResult();
    }

This method with produce an efficient query, that reduces I/O operations inside MySQL, and the query cost will be minimal even when the number of pages increase.

After implementing this solution, the queries used to produce the page number 3000 took about 46ms compared to approximately 440ms in the original paginator code, more important, the memory usage of MySQL have been utilized because the queries uses the indexing in a more efficient way.

Conclusion

In this post, I discussed the performance problems of pagination in doctrine & Symfony2 project. I applied some techniques that are more efficient in handling limit/offset operations in pagination. Although, there are other workarounds that give good performance, like this good post. This piece of code made my application perform better, hope it helps others who face similar situation.

20 thoughts on “Pagination optimization for Symfony2 & Doctrine

  1. In my opinion it would be better to first execute and fetch:
    select id
    from cms_content
    where category_id = 3 and is_published=1
    order by created_at desc
    limit 5 OFFSET 24995
    and than to use the id’s to fetch the full results with another query.
    Of course there is the small overhead due to 2 queries but this solution is much cleaner and easier to maintain than to create a native SQL query which should only be used when there is no other way around.

    1. The approach you suggested does the same thing in terms of query processing , and will work good for small page data.
      but I don’t prefer it due to limited flexibility because it causes extra round trips to be made to the database server along with the intermediate data.

  2. knp_paginator implements events quite well, did you consider implementing custom paginate, sortable, filtration subscriber?

    1. I don’t know if the logic I discussed can be done via implementing custom subscriber,
      I will read more about it (and may be update the article accordingly)

  3. Yes the knp paginator creates lot of overhead, but this native query solution works like a charm, the query execution time was significantly lesser than the knp paginator. Thanks for this post !!! 🙂

  4. Although everything works just fine as I reported in my previous comment, does the order by created_at desc really return objects in desc order? For me it didn’t and always returned in asc order.
    I guess order by clause should be outside sub query.
    Can you pelase verify this for your code?

    1. the query results are returned in desc order, make sure the sub query that retrieve the IDs have the correct order by clause; (& don’t remove order by clause from the inner sub-query because that will effect the data returned)

    1. Hello George,
      I looked into doctrine paginator, but it has similar issues regarding the performance, specifically when getting entity IDs in the current page it performs sub-query with distinct identifier which is not optimized approach for large data..

  5. Hello Zaid,

    You can use KnpPaginator and just override count function like this :

    $myCount = $em
    ->createQuery(‘SELECT COUNT(u) FROM Entity\User u WHERE u.id = :firstId AND u.guid = :secId’)
    ->setParameters(array(‘firstId’ => 4, ‘secId’ => ‘something’))
    ->getSingleScalarResult()
    ;
    $userQuery = $em
    ->createQuery(‘SELECT u FROM Entity\User u’)
    ->setHint(‘knp_paginator.count’, $myCount)
    ;
    $pagination = $this->get(‘knp_paginator’)->paginate($userQuery);

    1. hello Gabriel,
      this approach is useful, and it improves the performance nicely by overriding count method, in fact I use it when there isn’t much data in the paging..
      the disadvantage here is related to MySQL because as number of pages increase (request page number 1000), MySQL will scan all the table until it reaches the required rows (and throw away most of them) which is really inefficient; in that case trying to use covered index -as in the post- is much better choice.

  6. Thanks Zaid! Kind of you to share the gist, next time you’re over in Scotland let me know I’ll buy you a pint! I’m currently dissecting KnpPaginatorBundle, so much fun. I want to build a reddit like pagination as a bundle for one of my sites with only next and previous links, this will improve the performance dramatically, working on a dataset with 1.4m rows (phew!).

    1. You are welcome Jonthan 🙂 the concept of the bundle you want to build seems interesting, please let me know once its done, I would love to try it out.

  7. hi,

    how can I optimize my data handling if I query thousands of rows from my database? I am using symfony 2.5.10. Can you help me?

    1. Hello,
      this is a general question, but as a basic guidelines; try to reduce the data that mysql processing using different types of indexes (covering index can be quite useful), and reduce the data that your application is handling also by re-write your queries to return a reasonable amount of results whenever possible..

Leave a Reply

Your email address will not be published. Required fields are marked *

Human? prove it... * Time limit is exhausted. Please reload CAPTCHA.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>