Tag Archives: symfony

Using Sonata admin bundle with multiple connections

Last month I was participating in developing solution which provides reporting features using Symfony that deals with data partitioned over multiple MySQL servers. Data was sharded among two reporting databases that have identical structure, each database store trading history for a separate company branch/trading system.
So basically, there is a bundle that deals with multiple connections – thus different entity managers- dynamically. Handling those connections in Symfony is straight forward, but for sonata admin bundle, it needed some work to get it work.

Project Structure

The project contains bundle called WebitReportingBundle which have the entities related to reporting database, I mainly use entity called MT4Trades for this post (it holds trades information performed by clients)
This bundle uses two entity managers, depending on the application context, here is portion of config.yml containing entity managers/connection mappings:

    orm:
        default_entity_manager: default
        entity_managers:
            default:
                connection: default
                mappings:
                    FOSUserBundle: ~
                    ApplicationSonataUserBundle: ~
                    SonataUserBundle: ~
                    AppBundle: ~                    
            reporting_real:
                connection: reporting_real
                mappings:
                    WebitReportingBundle: ~
            reporting_real2:
                connection: reporting_real2
                mappings:
                    WebitReportingBundle: ~

In the config file above, there are two extra entity managers (real_reporting & real_reporting2), each handles connection to separate database.

In Symfony, the entity manager can be specified with doctrine repository simply as following:

<?php

namespace Webit\ReportingBundle\Controller;

use Symfony\Bundle\FrameworkBundle\Controller\Controller;

class DefaultController extends Controller {
    
    public function IndexAction(Request $request) {
        //retrieve trades from server 1
        $obj_server1 = $this->getDoctrine()
            ->getRepository('WebitReportingBundle:MT4Trades', 'reporting_real')
            ->findOneBy(['ticket'=>42342]);
                
        //retrieve trades from server 2
        $obj_server1 = $this->getDoctrine()
            ->getRepository('WebitReportingBundle:MT4Trades', 'reporting_real2')
            ->findOneBy(['ticket'=>123123]);
    }
}

but when I wanted to do the same on sonata admin module, there was no easy way to handle that.

How Sonata Admin Bundle handle multiple connections

When you create sonata admin file for certain entity with “SonataDoctrineORMAdminBundle“, “Admin” class will determine the entity manager to use by calling “getManagerForClass()” method in the associated “ModelManager” object.

As in the following code, which is taken from ModelManager class inside sonataDoctrineORMAdminBundle:

<?php 
namespace Sonata\DoctrineORMAdminBundle\Model;

class ModelManager implements ModelManagerInterface, LockInterface
{
    public function getEntityManager($class)
    {
        if (is_object($class)) {
            $class = get_class($class);
        }

        if (!isset($this->cache[$class])) {
            //detect entity manager based on class name automatically
            $em = $this->registry->getManagerForClass($class);

            if (!$em) {
                throw new \RuntimeException(sprintf('No entity manager defined for class %s', $class));
            }

            $this->cache[$class] = $em;
        }

        return $this->cache[$class];
    }
}

However, this method will be an issue for entities when dealing with sharded databases, so in my case, the first matching entity manager will be always selected – according to mappings defined in config.yml – and there is no direct way to specify the preferred entity manager for the admin class.

Solution

In order to work around this problem, I performed the following:

  1. Define custom model manager extending the ModelManager class, and override methods “getEntityManager” and “createQuery” to use specific entity manager if it is defined in the class, as following:
    <?php 
    namespace Webit\ReportingBundle\Model;
    
    use Sonata\DoctrineORMAdminBundle\Model\ModelManager as BaseModelManager;
    use Sonata\DoctrineORMAdminBundle\Datagrid\ProxyQuery;
    
    class CustomModelManager extends BaseModelManager
    {
        /* @var $emName stores the preferred entity manager name */
        protected $emName;
        
        /**
         * set preferred entity manager name to be used in ModelManager
         * @param string $name
         */
        public function setEntityManagerName($name){        
            $this->emName = $name;
        }
    
        /**
         * {@inheritdoc}
         */    
        public function createQuery($class, $alias = 'o') {
            //adding second parameter to getRepository method specifying entity manager name
            $repository = $this->getEntityManager($class)
                               ->getRepository($class,$this->emName);
    
            return new ProxyQuery($repository->createQueryBuilder($alias));
        }
        
        /**
         * {@inheritdoc}
         */
        public function getEntityManager($class) {        
            if (is_object($class)) {
                $class = get_class($class);
            }                
            if (isset($this->cache[$class]) === false) {
                //return fixed value if preferred entity manager name specified
                if (isset($this->emName) === true) {
                    $this->cache[$class] = $this->registry->getEntityManager($this->emName);
                } else {
                    $this->cache[$class] = parent::getEntityManager($class);
                }
            }
    
            return $this->cache[$class];
        }
    }
    
  2. Add new model manager as a service and associate it with the admin service entries of the entity:
    services:        
        reporting.model_manager:
             class: Webit\ReportingBundle\Model\CustomModelManager
             arguments:
                 - '@doctrine'
    
        reporting.admin.trades:
            class: Webit\ReportingBundle\Admin\MT4TradesAdmin
            tags:
               - { name: sonata.admin, manager_type: orm, group: webit.admin.group.reporting, label: "Trades (server 1)", pager_type: "simple" }
            arguments: [null,Webit\ReportingBundle\Entity\MT4Trades, WebitReportingBundle:Admin\MT4Trades]        
            calls:
               - [setModelManager, ['@reporting.model_manager'] ]  
               
    
        reporting.admin.trades2:
            class: Webit\ReportingBundle\Admin\MT4Trades2Admin
            tags:
               - { name: sonata.admin, manager_type: orm, group: webit.admin.group.reporting, label: "Trades (server 2)", pager_type: "simple" }
            arguments: [null,Webit\ReportingBundle\Entity\MT4Trades, WebitReportingBundle:Admin\MT4Trades]        
            calls:
               - [setModelManager, ['@reporting.model_manager'] ]
               
                       
    Here I am having two admins for the same entity, each admin shall use different connection.
  3. Define the entity manager name inside the admin classes by overriding setModelManager method, as following:
    <?php
    
    namespace Webit\ReportingBundle\Admin;
    
    use Sonata\AdminBundle\Admin\Admin;
    
    /**
     * admin for MT4Trades (Server 1)
     */
    class MT4TradesAdmin extends Admin{
    
        protected $emName = 'reporting_real';
        protected $baseRouteName = 'admin_webit_reporting_mt4trades';    
        protected $baseRoutePattern = 'admin-reporting-mt4trades-server';
         
        public function setModelManager(\Sonata\AdminBundle\Model\ModelManagerInterface $modelManager) {
            parent::setModelManager($modelManager);
            //override setModelManager to specify the preferred entity manager for this admin class
            $modelManager->setEntityManagerName($this->emName); 
        } 
        
        //remaining admin class methods...
    }
       
        
       

    and here the other admin class that is connecting to the second database:
    <?php
    
    namespace Webit\ReportingBundle\Admin;
    
    /**
     * admin for MT4Trades (Server 2)
     */
    class MT4Trades2Admin extends MT4TradesAdmin{
    
        protected $emName = 'reporting_real2'; //second entity manager will be used
        
        //specify route name and pattern to get two admin classes work with same entity
        protected $baseRouteName = 'admin_webit_reporting_mt4trades2';    
        protected $baseRoutePattern = 'admin-reporting-mt4trades-server2';
         
        //...
    }

    I extended second admin class from the first, so they have identical configurations/logic, with different connection used.

Conclusion

Now, I have two identical admin classes for the same entity, each using different entity manager correctly. and if additional server is deployed in the future, I just have to define new admin classes -along with new connection/entity manager – with placing the correct $emName value.

multi-connection-sonata

Applying version stamp to Symfony & SonataAdmin

I spend a lot of time in my job in developing and maintaining backoffice systems to process some workflow logic for Forex portals. Our systems are developed using Symfony2 and depend highly on SonataAdminBundle with deep customization to impose business rules for Forex companies.

Recently some data inconsistency appeared in a system of one of our clients, and after digging into logs, I found that the cause of the problem is two users processing same user application in almost same time range, and that caused one edit operation to override the other one, and many other undesired consequences occurs after that.

so in order to fix this issue, and prevent it from happening again, I worked on adding “Version Stamps” to my entity to maintain offline data consistency within the application, and I would like to share here what I learned.

 

Version Stamps and offline consistency

Version Stamps is a field that changes every time a writing operation is performed on the data, it is used to ensure that no one else has changed data of that row before applying your modification.
There is several ways to implement version stamps, and the simplest way is an integer value which is noted on the read, its value is compared to the submitted data before write, and once validated, the write operation take place with increasing version stamp value.
Let’s say there is a form bind to an entity in Symfony application, version stamp column will be present in the entity and added as a hidden field in the form, once submitted version stamp value submitted will be compared to the one in the database currently, to ensure that no other edit is performed on that entity -in the time between displaying your form initially and submitting it-, if the condition fails, the update operation will be rejected, thus by adding error via constraint.

Implementation in SonataAdmin

The implementation of this concept is quite simple as following:
In the desired entity, I applied those modifications:

  1. Define new field to hold stamp value.
  2. Mark the field to be a version stamp, using doctrine @Version annotation; this will cause doctrine to update versionStamp field every time the object is persisted to the database
<?php

namespace Webit\ForexCoreBundle\Entity;

use Doctrine\ORM\Mapping as ORM;

/**
 * RealProfile
 *
 * @ORM\Table(name="forex_real_profile")
 * @ORM\Entity(repositoryClass="Webit\ForexCoreBundle\Repository\RealProfileRepository")
 * @ORM\HasLifecycleCallbacks()
 */
class RealProfile
{
    
    /**
     * @var integer
     *
     * @ORM\Column(name="id", type="integer", nullable=false)
     * @ORM\Id
     * @ORM\GeneratedValue(strategy="IDENTITY")
     */
    private $id;
    
    /**
     * @var integer
     *
     * @ORM\Version 
     * @ORM\Column(name="version_stamp", type="integer")
     */
    private $versionStamp;
    
    /* Other columns, getters, setters here */

    
}

In Admin class, the following is added:

  1. In configureFormFields() method, the version stamp is added as hidden field, also I set mapped option to false, to prevent persisting its value along the form. version stamp value must be modified only via PreUpdate() method inside the entity.
    <?php
    
    namespace Webit\ForexCoreBundle\Admin;
    
    use Sonata\AdminBundle\Admin\Admin;
    use Sonata\AdminBundle\Datagrid\ListMapper;
    use Sonata\AdminBundle\Datagrid\DatagridMapper;
    use Webit\ForexCoreBundle\Entity\RealProfile;
    
    class RealAccountsAdmin extends Admin
    {
        protected function configureFormFields(\Sonata\AdminBundle\Form\FormMapper $formMapper)
        {
            $formMapper->add('versionStamp','hidden',array('attr'=>array("hidden" => true, 'mapped'=>false)))
            //other fields and groups...
        }
    }
  2. Here is the important point, which is validating version_stamp posted from the form against the one that is already saved in the database. There is two methods to apply that, one if by using doctrine locking mechanism, and the other is using sonata inline validation to add extra validation layer by implementing validate() method in order to apply additional validation layer.

    option 1:

    class RealAccountsAdmin extends Admin
    {
        /**
         * {@inheritdoc}
         */    
        public function getObject($id)
        {
            $uniqid = $this->getRequest()->query->get('uniqid');
            $formData = $this->getRequest()->request->get($uniqid);        
            
            $object = $this->getModelManager()->find($this->getClass(), 
                    $id, 
                    \Doctrine\DBAL\LockMode::PESSIMISTIC_WRITE, 
                    $formData['versionStamp']);
            
            foreach ($this->getExtensions() as $extension) {
                $extension->alterObject($this, $object);
            }
    
            return $object;
        }
    
        /**
         * {@inheritdoc}
         */      
        public function update($object) {
            try{
                parent::update($object);
            }catch(\Doctrine\ORM\OptimisticLockException $e) {
                $this->getConfigurationPool()->getContainer()->get('session')
                        ->getFlashBag()->add('sonata_flash_error', 'someone modified the object in between');
            }
        }

    This approach will take advantage of doctrine locking support. Here is brief explanation:

    • I have overridden getObject() method in admin class, to add two extra parameters for getModelManager()->find() method;
      third parameter indicates locking type, I used here LockMode::PESSIMISTIC_WRITE
      fourth parameter represents the expected version stamp value -to compare with database value before flushing.
    • I have overridden update($object) method so I catch OptimisticLockException exception and add error flash message to handle it

    option 2:
    In this approach, I used sonata inline validation to detect the form as invalid before even trying to persist and flush the object to the database:

        /**
         * {@inheritdoc}
         */
        public function validate(\Sonata\AdminBundle\Validator\ErrorElement $errorElement, $object) { 
            //get all submitted data (with non-mapped fields)
            $uniqid = $this->getRequest()->query->get('uniqid');
            $formData = $this->getRequest()->request->get($uniqid);
            $submitted_version_stamp = $formData['versionStamp'];        
            
            $em = $this->getConfigurationPool()
                  ->getContainer()->get('doctrine')->getManager();
            
            //get up-to-date version stamp value from the database
            $class_name = get_class($object);        
            $q = $em->createQuery("select partial o.{id,versionStamp} from $class_name o"
                                    . " where o.id=".$object->getId());        
            $saved_data = $q->getArrayResult();        
            $saved_version_stamp = $saved_data[0]['versionStamp'];
            
            //compare version stamps and add violation in case it didn't match
            if($saved_version_stamp != $submitted_version_stamp){
                $errorElement->addViolation('Record data seems outdated, probably someone else modified it, please refresh and try again.')->end();
            }
        }
        
    Here is more details about the operations performed inside this method:

    • To get versionStamp value submitted via form inside that method, I used:
      $uniqid = $this->getRequest()->query->get('uniqid');
      $formData = $this->getRequest()->request->get($uniqid);
      $submitted_version_stamp = $formData['versionStamp'];
    • To get an updated value of versionStamp that is stored in the database, I used doctrine query that retrieve partial object
      $em = $this->getConfigurationPool()
                 ->getContainer()->get('doctrine')->getManager();
              
      $class_name = get_class($object);        
      $q = $em->createQuery("select partial o.{id,versionStamp} from $class_name o"
              . " where o.id=".$object->getId());        
      $saved_data = $q->getArrayResult();        
      $saved_version_stamp = $saved_data[0]['versionStamp'];

      *If you retrieve the whole object from the database again, it will cause many issues specially if doctrine result cache is enabled.
    • Then compare the two values with each other, if those values are not equal, an error shall be appear to the user, and that is performed by calling $errorElement->addViolation() method
      if($saved_version_stamp != $submitted_version_stamp){
          $errorElement->addViolation('Record data seems outdated, probably someone else modified it, please refresh and try again.')->end();
      }

That’s all, now I can perform some basic test.

Test Solution

In order to verify that this mechanism solved the issue, I emulated the inconsistency behavior, by opening sonata admin edit page on two browsers, and then try to modify the data and submitted on each browser consequently.
The browser that got submitted last, will not save data and error will appear
Record data seems outdated, probably someone else modified it, please refresh and try again.”
versionstamp test
In that way, the inconsistency is prevented by stopping the second user from overriding the information modified by the other user.

At last

“Version Stamp” approach helped me in preventing data inconsistency in my Symfony/SonataAdmin application, hope it will help others who face similar scenario. I would like to know if anyone else has other idea or better way to handle that issue.

Pagination optimization for Symfony2 & Doctrine

In the last week, one of my websites – which is recently launched – has suffered from major performance problems, although there is little traffic on it. The website is built on Symfony2 and it acts as directory for financial/forex companies, I started investigating the problem, and the reason was quickly identified, the pagination was the culprit!

One of the tables pulled RSS feeds from multiple financial news sources and saved them in our CMS content table, and in about one month the table contained approximately 50,000 rows… the number of rows is relatively not that much, but the queries that are used in the pagination functionality (which is knp_paginator bundle) cause me some problem, especially that I am deploying the site on VPS with 2GB RAM..,

Background on limit/offset performance

Many developers that are dealing with MySQL has misunderstanding of limit/offset functionality, its not necessary that your query become efficient just because you applied limit to it. MySQL process query conditions, apply the ordering then it retrieves the desired number of rows, so internally MySQL may perform full table scan, use filesort of the resultset in a file or temporary table, then retrieve the desired number of rows, so the applying limit/offset comes at later stages of the query execution.

If you are using an index on your columns (especially for “order by” clause), you will get a better performance, but still when applying large offset (lets say 100,000 rows with limit 10), MySQL will have to process 100,000 then throw away unneeded rows to return just your desired 10 rows!

That’s why it’s problematic to apply pagination on large dataset on MySQL, and when you are using ORM you have to make some extra effort to optimize your pagination queries.

Paging Problems in Symfony2 knp_paginator

Usually when we need to paginate one of our table rows in Doctrine ORM with Symfony2, we would use Knp_paginator, since it’s easy to use, and provide you with simple paging functionality with just a couple lines of code. However, when I looked more closely in the way it performs I found some points that form performance bottlenecks in the way it operates, especially for large data sets.

For the purpose of clarification, I will be using cms_content table that have the following structure:

mysql> desc cms_content;
+--------------+--------------+------+-----+---------+----------------+
| Field        | Type         | Null | Key | Default | Extra          |
+--------------+--------------+------+-----+---------+----------------+
| id           | int(11)      | NO   | PRI | NULL    | auto_increment |
| category_id  | int(11)      | NO   | MUL | NULL    |                |
| is_published | tinyint(1)   | NO   | MUL | NULL    |                |
| slug         | varchar(255) | NO   | MUL | NULL    |                |
| created_at   | datetime     | NO   | MUL | NULL    |                |
....
+--------------+--------------+------+-----+---------+----------------+

The columns that I frequently use in the queries are is_published, category_id and usually the ordering is based on created_at column.

Counting Query:

In order to get number of pages available, any paging library construct query that counts the results based on the parameters passed, the simplest counting query will look something like that:

SELECT COUNT(id)
FROM cms_content
where
category_id = 3 and
is_published = true
order by created_at desc;

so when you explain this query in order to check its performance you will see:
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
| id | select_type | table       | type | possible_keys                        | key             | key_len | ref         | rows  | Extra       |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+
|  1 | SIMPLE      | cms_content | ref  | IDX_A0293FB812469DE2,secondary_index | secondary_index | 5       | const,const | 13972 | Using index |
+----+-------------+-------------+------+--------------------------------------+-----------------+---------+-------------+-------+-------------+

as you can see, this query is fairly optimized, because it uses a covering index (see Using Index in the Extra cell in the explain result), that mean that MySQL performs this query by just looking into index without reading full data row of the table.

Here I have created an index for the columns frequently used, which are is_published, category_id, created_at to utilize indexing and improve all queries, see the indexes applied:
mysql> show index from cms_content;
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table       | Non_unique | Key_name             | Seq_in_index | Column_name  | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| cms_content |          0 | PRIMARY              |            1 | id           | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | IDX_A0293FB812469DE2 |            1 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            1 | is_published | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            2 | category_id  | A         |           1 |     NULL | NULL   |      | BTREE      |         |               |
| cms_content |          1 | secondary_index      |            3 | created_at   | A         |       27620 |     NULL | NULL   |      | BTREE      |         |               |
+-------------+------------+----------------------+--------------+--------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
6 rows in set (0.25 sec)

however, when I searched for the count query used by knp_paginator bundle, I got this complex query:

SELECT 
  COUNT(*) AS dctrn_count 
FROM 
  (
    SELECT 
      DISTINCT id0 
    FROM 
      (
        SELECT 
          c0_.id AS id0, 
          c0_.is_published AS is_published1, 
          c0_.slug AS slug2, 
          c0_.guid AS guid3, 
          c0_.rss_link AS rss_link4, 
          c0_.category_id AS category_id5, 
          c0_.created_at AS created_at6 
        FROM 
          cms_content c0_ 
        WHERE 
          c0_.is_published = true 
          AND c0_.category_id = 3 
        ORDER BY 
          c0_.created_at DESC
      ) dctrn_result
  ) dctrn_table

and here is the explain result for the previous query:
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
| id | select_type | table      | type | possible_keys                        | key                  | key_len | ref  | rows  | Extra                        |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+
|  1 | PRIMARY     | NULL       | NULL | NULL                                 | NULL                 | NULL    | NULL |  NULL | Select tables optimized away |
|  2 | DERIVED     | <derived3> | ALL  | NULL                                 | NULL                 | NULL    | NULL | 26775 | Using temporary              |
|  3 | DERIVED     | c0_        | ALL  | IDX_A0293FB812469DE2,secondary_index | IDX_A0293FB812469DE2 | 4       |      | 27944 | Using where; Using filesort  |
+----+-------------+------------+------+--------------------------------------+----------------------+---------+------+-------+------------------------------+

as shown, the query is not optimized as all! the use of subquery eliminated the use of the indexes and required a file sort to handle processing its result, also there is unnecessary distinct keyword that impose some overhead on the query.
On my machine this query took 1.04 seconds to execute compared with 0.01 seconds for the previous count query, although they have the exact same logic.

Fetcing Data Query:

The query that retrieves the data of pagination, usually implemented by simply adding limit/offset clauses with the desired parameters. As I explained earlier, this can be a performance problem for large number of pages because it causes MySQL to process too much rows then discard most of them.
For example, here is the default query to retrieve data of the page number 5,000 of my cms_content table:

SELECT 
  DISTINCT c0_.id AS id0, 
  c0_.created_at AS created_at1,
  c0_.slug as slug2
FROM 
  cms_content c0_ 
WHERE 
  c0_.is_published = ? 
  AND c0_.category_id = ? 
ORDER BY 
  c0_.created_at DESC 
LIMIT 
  5 OFFSET 24995

this query may seem as simple query, but it will cause MySQL to process large number of rows, and perform filesort, and will lead to high random I/O operation especially if you want to retrieve more data columns such as title, guid…etc, and this cost will increase as the number of pages gets larger.

One of the best ways to optimize such queries is by using deferred joins, that is joining with smaller resultset, as following:

select id, created_at, slug 
from cms_content 
inner join 
(
    select id 
    from cms_content 
	where category_id = 3 and is_published=1
	order by created_at desc 
	limit 5 OFFSET 24995
)  as small_tbl 
using (id)

This method is used to make MySQL read as little data as possible, if you check the sub-query, it’s using a covering index and retrieves small resultset with very low cost (I/O operation), then full data row will be fetched according to the small number of IDs retrieved by the sub-query.

despite how this query structured, its very efficient one because it minimize the cost of I/O operations. you can verify that by checking last_query_cost after executing it, and comparing it to the other approach
show status like 'last_query_cost';

Doctrine Code

In order to apply the optimized pagination queries in my Symfony project, I had to write my own code, and remove knp_paginator calls.
First, I wrote a method to get the count of results used in paging, placing it in the ContentRepository class:

    public function getContentCountByCategory($cat_id){
        
        $qb = $this->createQueryBuilder('c')
                ->select('count(c.id)')                                
                ->where('c.categoryId = :category_id')
                ->andWhere('c.isPublished = :published')
                ->setParameter('category_id', $cat_id)
                ->setParameter('published', true)
                ;
        
        return $qb->getQuery()->getSingleScalarResult();                
    }

this method produces an optimized version of the count query.

Second, I had to write method to retrieve results with deferred join, I tried to use query builder, but unfortunately doctrine doesn’t support subquery in “from” clause, so I wrote a native SQL query with ResultSetMapping as following:

    public function getContentQueryPyCategory($cat_id, $lang, $page=1, $limit=10) {

        $offset = ($page-1)*$limit;
        
        $rsm = new ResultSetMapping();
        $rsm->addEntityResult('Webit\CMSBundle\Entity\Content', 'c');
        $rsm->addFieldResult('c', 'id', 'id');
        $rsm->addFieldResult('c', 'created_at', 'createdAt');
        $rsm->addFieldResult('c', 'slug', 'slug');
        
        $query = $this->getEntityManager()->createNativeQuery("select id, created_at, slug from cms_content inner join "
                . "(select id from cms_content where category_id = ? and is_published=? order by created_at desc limit ? offset ?) as small_tbl"
                . " using (id)", $rsm);
        $query->setParameter(1, $cat_id);
        $query->setParameter(2, true);
        $query->setParameter(3, $limit);
        $query->setParameter(4, $offset);
        
        
        return $query->getResult();
    }

This method with produce an efficient query, that reduces I/O operations inside MySQL, and the query cost will be minimal even when the number of pages increase.

After implementing this solution, the queries used to produce the page number 3000 took about 46ms compared to approximately 440ms in the original paginator code, more important, the memory usage of MySQL have been utilized because the queries uses the indexing in a more efficient way.

Conclusion

In this post, I discussed the performance problems of pagination in doctrine & Symfony2 project. I applied some techniques that are more efficient in handling limit/offset operations in pagination. Although, there are other workarounds that give good performance, like this good post. This piece of code made my application perform better, hope it helps others who face similar situation.

Simple web spider with PHP Goutte

Last week we got SEO analysis about one of our portals, that analysis included thorough  statistics about website SEO measures, like missing and duplicate <title>,<h1> and meta tags, broken and invalid links, duplicate content percentage…etc . It appears that the SEO agency that prepared that analysis use some sort of crawler to extract these information.

I liked that crawler idea, and wanted to implement it in PHP. After some reading of web scrapping and Goutte I was able to write a similar web spider that extracts the needed information, and I wanted to share it in this post.

About web scrapping and Goutte

Web scrapping is a technique to extract information from websites, its very close to web indexing because the bot or web crawler that search engines use, performs some sort of scrapping the web documents through following the links, analyzing keywords, meta tags, URLs and ranking them according to relevancy, popularity, engagement..etc.

Goutte is a screen scraping and web crawling library for PHP, it provides an API to crawl websites and extract data from the HTML/XML responses. Goutte is wrapper around Guzzle and several Symfony components like: BrowserKit, DOMCrawler and CSSSelector.

Here is a small description about some libraries that Goutte wraps:

    1. Guzzle: framework for building RESTful web service, it provides a simple interface to perform cURL, along with other important features like: persistent connections and streaming request and response bodies.
    2. BrowserKit: simluates a behaviour of a web browser, providing abstract HTTP layer like request, response, cookie…etc.
    3. DOMCrawler: provides easy methods for DOM navigation and manipulation.
    4. CSSSelector: provide an API to select elements using same selectors used for CSS (it becomes exremely easy to select elements when it works with DOMCrawler).
* These are the main components I interested in for this post, however, other components like: Finder and Process are also used in Goutte.

 

Basic usage

Once you download Goutte(from here),  you should define a Client object, the client used to send requests to a website and returns a crawler object, as in the snippet below:

require_once 'goutte.phar';
use Goutte\Client;

$url_to_traverse = 'http://zrashwani.com';

$client = new Client();
$crawler = $client->request('GET', $url_to_traverse);

Here I declared a client object, and called “Request()” to simulate browser requesting the url “http://zrashwani.com” using “GET” http method.
Request() method returns an object of type Symfony\Component\DomCrawler\Crawler, than can be used to select elements from the fetched html response.

but before processing the document, let’s ensure that this URL is a valid link, which means it returned a response code (200), using

$status_code = $client->getResponse()->getStatus();
if($status_code==200){
    //process the documents
}

$client->getResponse() method returns BrowserKit/Response object that contains information about the response the client got, like: headers (including status code I used here), response content…etc

In order to extract document title, you can filter either by XPath or CSS selector in order to get you target HTML DOM element value

$crawler->filterXPath('html/head/title')->text()
// $crawler->filter('title')->text()

In order to get the number of <h1> tags, and get the contents of the tags that exist in the page,

$h1_count = $crawler->filter('h1')->count();
$h1_contents = array();
if ($h1_count) {
    $crawler->filter('h1')->each(function(Symfony\Component\DomCrawler\Crawler $node, $i) use($h1_contents) {
                $h1_contents[$i] = trim($node->text());
        });
}

for SEO Purposes, there should be one h1 tag in a page, and its content should have the main keywords in the page. Here each() function is quite useful, it can be used to loop over all matching elements. each() function takes a closure as a parameter to perform some callback operation on the node.

PHP closures is anonymous functions that started to be used in PHP5.3, its very useful to perform a callback functionality, you can refer to PHP manual if you are new to closures.

Application goals

After this brief introduction, I can begin explaining the spider functionality, this crawler will detect broken/invalid links in the website, along with extracting <h1>,<title> tag values that are important for SEO issue that I have.

my simple crawler implements Depth-limited search, in order to avoid crawling large amounts of data, and works as following :

      1. Read the initial URL to crawl along with depth of links to be visited.
      2. crawl the url and check the response code to determine the link is not broken, then add it to an array containing site links.
      3. extract <title>, <h1> tags content in order to use their values later for reporting.
      4. loop over all <a> tags inside the document fetch to extract their href attribute along with other data.
      5. check that depth limit is not reached, and the current href is not visited before, and the link url does not belong to external site.
      6. crawl the child link by repeating steps (2-5).
      7. stop when the links depth is reached.

 

These steps implemented in SimpleCrawler class that I wrote, (It still a basic version and should be optimized more):

<?php

require_once 'goutte.phar';

use Goutte\Client;

class simpleCrawler {

    private $base_url;
    private $site_links;
    private $max_depth;

    public function __construct($base_url, $max_depth = 10) {
        if (strpos($base_url, 'http') === false) { // http protocol not included, prepend it to the base url
            $base_url = 'http://' . $base_url;
        }

        $this->base_url = $base_url;
        $this->site_links = array();
        $this->max_depth = $max_depth;
    }

    /**
     * checks the uri if can be crawled or not
     * in order to prevent links like "javascript:void(0)" or "#something" from being crawled again
     * @param string $uri
     * @return boolean
     */
    protected function checkIfCrawlable($uri) {
        if (empty($uri)) {
            return false;
        }

        $stop_links = array(//returned deadlinks
            '@^javascript\:void\(0\)$@',
            '@^#.*@',
        );

        foreach ($stop_links as $ptrn) {
            if (preg_match($ptrn, $uri)) {
                return false;
            }
        }

        return true;
    }

    /**
     * normalize link before visiting it
     * currently just remove url hash from the string
     * @param string $uri
     * @return string
     */
    protected function normalizeLink($uri) {
        $uri = preg_replace('@#.*$@', '', $uri);

        return $uri;
    }

    /**
     * initiate the crawling mechanism on all links
     * @param string $url_to_traverse
     */
    public function traverse($url_to_traverse = null) {
        if (is_null($url_to_traverse)) {
            $url_to_traverse = $this->base_url;

            $this->site_links[$url_to_traverse] = array(//initialize first element in the site_links 
                'links_text' => array("BASE_URL"),
                'absolute_url' => $url_to_traverse,
                'frequency' => 1,
                'visited' => false,
                'external_link' => false,
                'original_urls' => array($url_to_traverse),
            );
        }

        $this->_traverseSingle($url_to_traverse, $this->max_depth);
    }

    /**
     * crawling single url after checking the depth value
     * @param string $url_to_traverse
     * @param int $depth
     */
    protected function _traverseSingle($url_to_traverse, $depth) {
        //echo $url_to_traverse . chr(10);

        try {
            $client = new Client();
            $crawler = $client->request('GET', $url_to_traverse);

            $status_code = $client->getResponse()->getStatus();
            $this->site_links[$url_to_traverse]['status_code'] = $status_code;

            if ($status_code == 200) { // valid url and not reached depth limit yet            
                $content_type = $client->getResponse()->getHeader('Content-Type');                
                if (strpos($content_type, 'text/html') !== false) { //traverse children in case the response in HTML document 
                   $this->extractTitleInfo($crawler, $url_to_traverse);

                   $current_links = array();
                   if (@$this->site_links[$url_to_traverse]['external_link'] == false) { // for internal uris, get all links inside
                      $current_links = $this->extractLinksInfo($crawler, $url_to_traverse);
                   }

                   $this->site_links[$url_to_traverse]['visited'] = true; // mark current url as visited
                   $this->traverseChildLinks($current_links, $depth - 1);
                }
            }
            
        } catch (Guzzle\Http\Exception\CurlException $ex) {
            error_log("CURL exception: " . $url_to_traverse);
            $this->site_links[$url_to_traverse]['status_code'] = '404';
        } catch (Exception $ex) {
            error_log("error retrieving data from link: " . $url_to_traverse);
            $this->site_links[$url_to_traverse]['status_code'] = '404';
        }
    }

    /**
     * after checking the depth limit of the links array passed
     * check if the link if the link is not visited/traversed yet, in order to traverse
     * @param array $current_links
     * @param int $depth     
     */
    protected function traverseChildLinks($current_links, $depth) {
        if ($depth == 0) {
            return;
        }

        foreach ($current_links as $uri => $info) {
            if (!isset($this->site_links[$uri])) {
                $this->site_links[$uri] = $info;
            } else{
                $this->site_links[$uri]['original_urls'] = isset($this->site_links[$uri]['original_urls'])?array_merge($this->site_links[$uri]['original_urls'], $info['original_urls']):$info['original_urls'];
                $this->site_links[$uri]['links_text'] = isset($this->site_links[$uri]['links_text'])?array_merge($this->site_links[$uri]['links_text'], $info['links_text']):$info['links_text'];
                if(@$this->site_links[$uri]['visited']) { //already visited link)
                    $this->site_links[$uri]['frequency'] = @$this->site_links[$uri]['frequency'] + @$info['frequency'];
                }
            }

            if (!empty($uri) && 
                !$this->site_links[$uri]['visited'] && 
                !isset($this->site_links[$uri]['dont_visit'])
                ) { //traverse those that not visited yet                
                $this->_traverseSingle($this->normalizeLink($current_links[$uri]['absolute_url']), $depth);
            }
        }
    }

    /**
     * extracting all <a> tags in the crawled document, 
     * and return an array containing information about links like: uri, absolute_url, frequency in document
     * @param Symfony\Component\DomCrawler\Crawler $crawler
     * @param string $url_to_traverse
     * @return array
     */
    protected function extractLinksInfo(Symfony\Component\DomCrawler\Crawler &$crawler, $url_to_traverse) {
        $current_links = array();
        $crawler->filter('a')->each(function(Symfony\Component\DomCrawler\Crawler $node, $i) use (&$current_links) {
                    $node_text = trim($node->text());
                    $node_url = $node->attr('href');
                    $hash = $this->normalizeLink($node_url);

                    if (!isset($this->site_links[$hash])) {  
                        $current_links[$hash]['original_urls'][$node_url] = $node_url;
                        $current_links[$hash]['links_text'][$node_text] = $node_text;
                        
    		if (!$this->checkIfCrawlable($node_url)){

			}elseif (!preg_match("@^http(s)?@", $node_url)) { //not absolute link                            
                            $current_links[$hash]['absolute_url'] = $this->base_url . $node_url;
                        } else {
                            $current_links[$hash]['absolute_url'] = $node_url;
                        }

                        if (!$this->checkIfCrawlable($node_url)) {
                            $current_links[$hash]['dont_visit'] = true;
                            $current_links[$hash]['external_link'] = false;
                        } elseif ($this->checkIfExternal($current_links[$hash]['absolute_url'])) { // mark external url as marked                            
                            $current_links[$hash]['external_link'] = true;
                        } else {
                            $current_links[$hash]['external_link'] = false;
                        }
                        $current_links[$hash]['visited'] = false;
                        
                        $current_links[$hash]['frequency'] = isset($current_links[$hash]['frequency']) ? $current_links[$hash]['frequency']++ : 1; // increase the counter
                    }
                    
                });

        if (isset($current_links[$url_to_traverse])) { // if page is linked to itself, ex. homepage
            $current_links[$url_to_traverse]['visited'] = true; // avoid cyclic loop                
        }
        return $current_links;
    }

    /**
     * extract information about document title, and h1
     * @param Symfony\Component\DomCrawler\Crawler $crawler
     * @param string $uri
     */
    protected function extractTitleInfo(Symfony\Component\DomCrawler\Crawler &$crawler, $url) {
        $this->site_links[$url]['title'] = trim($crawler->filterXPath('html/head/title')->text());

        $h1_count = $crawler->filter('h1')->count();
        $this->site_links[$url]['h1_count'] = $h1_count;
        $this->site_links[$url]['h1_contents'] = array();

        if ($h1_count) {
            $crawler->filter('h1')->each(function(Symfony\Component\DomCrawler\Crawler $node, $i) use($url) {
                        $this->site_links[$url]['h1_contents'][$i] = trim($node->text());
                    });
        }
    }

    /**
     * getting information about links crawled
     * @return array
     */
    public function getLinksInfo() {
        return $this->site_links;
    }

    /**
     * check if the link leads to external site or not
     * @param string $url
     * @return boolean
     */
    public function checkIfExternal($url) {
        $base_url_trimmed = str_replace(array('http://', 'https://'), '', $this->base_url);

        if (preg_match("@http(s)?\://$base_url_trimmed@", $url)) { //base url is not the first portion of the url
            return false;
        } else {
            return true;
        }
    }

}

?>

and you can try this class functionality as following:

$simple_crawler = new simpleCrawler($url_to_crawl, $depth);    
$simple_crawler->traverse();    
$links_data = $simple_crawler->getLinksInfo();

getLinksInfo() method returns an associative array, containing information about each page crawled, such as url of the page, <title>, <h1> tags contents, status_code…etc. You can store these results in any way you like, for me I prefer MySQL for simplicity in order to be able to get desired results using query, so I created pages_crawled table as following:

CREATE TABLE `pages_crawled` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(255) DEFAULT NULL,
  `frequency` int(11) unsigned DEFAULT NULL,
  `title` varchar(255) DEFAULT NULL,
  `status_code` int(11) DEFAULT NULL,
  `h1_count` int(11) unsigned DEFAULT NULL,
  `h1_content` text,
  `source_link_text` varchar(255) DEFAULT NULL,
  `original_urls` text,
  `is_external` tinyint(1) DEFAULT '0',
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=37 DEFAULT CHARSET=utf8

and here I store the links traversed into mysql table:

<?php 
error_reporting(E_ALL);
set_time_limit(300);
include_once ('../src/SimpleCrawler.php');

$url_to_crawl = $argv[1];
$depth = isset($argv[2])?$argv[2]:3;

if($url_to_crawl){
    
    echo "Begin crawling ".$url_to_crawl.' with links in depth '.$depth.chr(10);
    
    $start_time = time();    
    $simple_crawler = new simpleCrawler($url_to_crawl, $depth);    
    $simple_crawler->traverse();    
    $links_data = $simple_crawler->getLinksInfo();
       
    $end_time = time();
    
    $duration = $end_time - $start_time;
    echo 'crawling approximate duration, '.$duration.' seconds'.chr(10);
    echo count($links_data)." unique links found".chr(10);
    
    mysql_connect('localhost', 'root', 'root');
    mysql_select_db('crawler_database');
    foreach($links_data as $uri=>$info){
        
        if(!isset($info['status_code'])){
            $info['status_code']=000;//tmp
        }
        
        $h1_contents = implode("\n\r", isset($info['h1_contents'])?$info['h1_contents']:array() );
        $original_urls = implode('\n\r', isset($info['original_urls'])?$info['original_urls']:array() );
        $links_text = implode('\n\r',  isset($info['links_text'])?$info['links_text']:array() );
        $is_external = $info['external_link']?'1':'0';
        $title = @$info['title'];
        $h1_count = isset($info['h1_count'])?$info['h1_count']:0;
        
        $sql_query = "insert into pages_crawled(url, frequency, status_code, is_external, title, h1_count, h1_content, source_link_text, original_urls)
values('$uri', {$info['frequency']}, {$info['status_code']}, {$is_external}, '{$title}', {$h1_count}, '$h1_contents', '$links_text', '$original_urls')";
        
        mysql_query($sql_query) or die($sql_query);
    }
}

 

Running the spider

Now let me try out the spider on my blog url, with depth of links to be visited is 2:

C:\xampp\htdocs\Goutte\web>php -f test.php zrashwani.com 2

Now I can get the important information that I needed using simple SQL query of the pages_crawled table, as following:

mysql> select count(*) from pages_crawled where h1_count >1;
+----------+
| count(*) |
+----------+
|       30 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) as c, title from pages_crawled group by title having c>1;

+---+----------------------------------------------------------+
| c | title                                                    |
+---+----------------------------------------------------------+
| 2 | Z.Rashwani Blog | I write here whatever comes to my mind |
+---+----------------------------------------------------------+
1 row in set (0.02 sec)

in the first query, I returned the number of pages with duplicate h1 tags ( I find alot, I will consider changing the HTML structure of my blog a little bit),
in the second one, I returned the duplicated page titles.
now we can get many other statistics on the pages traversed using information we collected.

Conclusion

In this post I explained how to use Goutte for web scrapping using real-world example that I encountered in my job. Goutte can be easily used to extract great amount of information about any webpage using its easy API for requesting pages, analyzing the response and extract specific data from Dom document.

I used Goutte to extract some information that can be used as SEO measures about the specified website, and stored them into MySQL table in order query any report or statistics derived from them.

Update

thanks to Josh Lockhart, this code is modified for composer and Packagist and now available on github https://github.com/codeguy/arachnid