Tag Archives: php

Laravel collections usage with crawler links

Recently I have been maintaining a package for crawling website internal links called Arachnid, this package crawls website links, and extract information about each page including: title, meta tags, h1 tags, status code along with other info, it returns link information as an array. I was searching for a convenient way to extract meaningful summary information from the library output, and that’s when I come a cross Laravel Collections, and it was really useful in my case.

Basic Information

Arachnid library returns information about website links similar to the structure below:

 
  "/" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://zrashwani.com/"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://zrashwani.com"
    "depth" => 1
    "status_code" => 200
    "title" => "Z.Rashwani Blog - I write here whatever comes to my mind"
    "meta_keywords" => "Zeid Rashwani, Zaid Rashwani, zrashwani, web development, LAMP, PHP, mysql, Linux, Symfony2, apache, DBA"
    "meta_description" => "Zeid Rashwani personal blog, mostly contains technical topics related to web development using, LAMP, Linux, Apache, MySQL, PHP, and other open source technologies"
    "h1_count" => 3
    "h1_contents" => array:3 [ …3]
  ]
  "" => array:7 [▼
    "original_urls" => array:3 [ …3]
    "links_text" => array:2 [ …2]
    "visited" => false
    "dont_visit" => true
    "external_link" => false
    "source_link" => "http://zrashwani.com"
    "depth" => 1
  ]
  //... other links

As shown, each element store information about link including:
* original_urls: original url before normalization.
* links_text: text inside <a> tag for the link to appear in.
* absolute_url: absolute version of the url.
* visited: whether this link is visited/crawled or not.
* frequency: frequency on the link appearing in the website.
* source_link: the page which this link first crawled from.
* depth: on which level/depth the link is found.
* status_code: status code of the page (ex. 200, 404…etc.).
* title, meta description, meta keywords: meta information about the page.
* h1_count, h1_contents: number of <h1> tags, and their contents.

Main Functions

Laravel collections offer large set of functions, mainly it is wrapper over functional methods like: filter, map, reduce, each.
The main methods that I used in my case are the following:

  • filter: filter out any elements of an array that you don’t want.
  • map: transform each item in an array into something else.
  • each: loop over collection elements.

Those methods represents the main functional building blocks, which are wrappers over native php functions like: array_walk, array_filter, array_map but with cleaner OOP approach that can easily be used with method chaining.  Also these functions are higher order functions that take a function callback as a parameter.

As a basic example the link collection will be like following example:

<?php 
$url = 'http://zrashwani.com/'; //website url to be crawled 
$crawler = new \Arachnid\Crawler($url, 3);

$crawler->traverse();
$links = $crawler->getLinks();
$collection = collect($links);

Manipulate using Collections

1. Arrange links by page source:
So each array is arranged parent page URL is displayed as key, and value will be links info that parent page contains using groupBy function:

<?php
$linksBySource = $collection->groupBy('source_link');

The output will be an associative array, with each source page as key, and collection of links that this page contains, similar to the result below:
LinksCollection {#276 ▼
  #items: array:18 [▼
    "" => LinksCollection {#261 ▶}
    "http://zrashwani.com/" => LinksCollection {#117 ▼
      #items: array:144 [▶]
    }
    "http://zrashwani.com/category/technical-topics/" => LinksCollection {#84 ▼
      #items: array:2 [▶]
    }
    "http://zrashwani.com/category/anime-reviews/" => LinksCollection {#124 ▶}
    "http://zrashwani.com/about-me/" => LinksCollection {#168 ▶}
    "http://zrashwani.com/wcf-ssl-service-with-php/" => LinksCollection {#207 ▶}
    "http://zrashwani.com/tag/php/" => LinksCollection {#212 ▶}
    "http://zrashwani.com/author/zaid/" => LinksCollection {#295 ▶}
    "http://zrashwani.com/sonata-admin-bundle-multiple-connection/" => LinksCollection {#213 ▶}
    "http://zrashwani.com/tag/database/" => LinksCollection {#52 ▶}
    "http://zrashwani.com/tag/symfony/" => LinksCollection {#140 ▶}
    "http://zrashwani.com/applying-version-stamp-symfony-sonataadmin/" => LinksCollection {#239 ▶}
    "http://zrashwani.com/server-sent-events-example-laravel/" => LinksCollection {#145 ▶}
    "http://zrashwani.com/pagination-optimization-symfony2-doctrine/" => LinksCollection {#162 ▶}
    "http://zrashwani.com/tag/mysql/" => LinksCollection {#282 ▶}
    "http://zrashwani.com/materialized-views-example-postgresql-9-3/" => LinksCollection {#219 ▶}
    "http://zrashwani.com/simple-web-spider-php-goutte/" => LinksCollection {#270 ▶}
    "http://zrashwani.com/page/2/" => LinksCollection {#218 ▶}
  ]
}

2. Get External Links:
The external links can be retrieved by using filter function  according to “external_link” key as below:

<?php
$externalLinks = $collection->filter(function($link_info){
            return isset($link_info['external_link'])===true 
                       && $link_info['external_link']===true;
        });
        

The output will include all external links in the website as below:
LinksCollection {#64 ▼
  #items: array:112 [▼
    "http://www.wewebit.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://www.wewebit.com"
      "external_link" => true
      "visited" => true
      "frequency" => 1
      "source_link" => "http://zrashwani.com/"
      "depth" => 1
      "status_code" => 200
      "title" => "WeWebit - Web Development specialists"
      "meta_keywords" => "Wewebit, Website, Mt4, Forex web design , Development, Forex website design ,Forex web development ,forex bo system , Forex Backoffice Systems ,forex CRM , FX toolskit, forex toolskit , Forex client area , forex client cabinet , members cabinet , forex IB system , ecommerce website development"
      "meta_description" => "Web Development Company in jordan providing development and design services, including Forex Solutions , News Portals , Custom Web Applications, online e-commerce solutions"
      "h1_count" => 2
      "h1_contents" => array:2 [▶]
    ]
    "http://jo.linkedin.com/pub/zaid-al-rashwani/14/996/180/" => array:10 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://jo.linkedin.com/pub/zaid-al-rashwani/14/996/180/"
      "external_link" => true
      "visited" => false
      "frequency" => 1
      "source_link" => "http://zrashwani.com/"
      "depth" => 1
      "status_code" => 999
      "error_message" => 999
    ]
    "https://twitter.com/zaid_86" => array:14 [▶]
    ...
  ]
}    

3. Filter and search links by depth:
This useful for getting the links that first appeared in the nth level in the website, for example getting links in depth=3 in the website as below:

<?php
$depth = 3;
$depth2Links = $collection->filter(function($link) use($depth){
            return isset($link['depth']) && $link['depth'] == $depth;
        });
        

The output will be similar to this:
LinksCollection {#734 ▼
  #items: array:141 [▼
    "/introduction-to-sphinx-with-php-part2/" => array:8 [▶]
    "/category/technical-topics/page/2/" => array:8 [▶]
    "/anime-watched-in-summer-2015/" => array:8 [▶]
    "/anime-fall-2015-shows/" => array:8 [▶]
    "/anime-winter-2015-watch-list/" => array:8 [▶]
    "/anime-winter-2014-watch-list/" => array:8 [▶]
    

or you can get simple statistics about how many links in each level in the website by combining groupBy function with mapWithKeys function – which is same as map function but returns key/value pair – as below:

<?php
$linksGroupedByDepth = $collection->groupBy('depth')
        ->mapWithKeys(function($depthGroup,$depth){
            return [$depth =>$depthGroup->count()];
        });
        

it will display how many links exist in each site level:
LinksCollection {#824 ▼
  #items: array:4 [▼
    0 => 1
    1 => 75
    2 => 300
    3 => 141
  ]
}

4. Get Broken links:
The broken links can be retrieved by filtering items according to status code, success pages have status code between 200 and 299, so anything else will be considered as broken link, as below:

<?php 
$brokenLinks = $collection->filter(function($link){
    return isset($link['status_code']) && 
            ($link['status_code'] >= 300 || $link['status_code'] <200);
});

or better broken links can be grouped according to the page where link exists, using groupBy function, extract only summary information using map function:
 <?php
        $brokenLinksBySource = $collection->filter(function($link){
            return isset($link['status_code']) && 
                    ($link['status_code'] >= 300 || $link['status_code'] <200);
        })->map(function($link){
           return [
                'source_page' => $link['source_link'],
                'link'        => $link['absolute_url'],
                'status_code' => $link['status_code'],
                'links_text'  => $link['links_text'],
               ];
        })
        ->unique('link')
        ->groupBy('source_link'); 
        

5. Getting pages that have no title or h1 tags:
This is useful for SEO purposes, and can be done using filter method:

<?php
$linksWithMissingTitle = $collection->filter(function($link_info){
           return empty($link_info['title']); 
        });

<?php
//getting pages with no <h1> tags
$missingH1Pages = $collection->filter(function($link_info){
           return $link_info['h1_count']==0; 
        });
        

The output will contain all pages with no <h1> tag as below
LinksCollection {#823 ▼
  #items: array:2 [▼
    "http://wordcomat.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://wordcomat.com"
      "external_link" => true
      "visited" => true
      "frequency" => 2
      "source_link" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "depth" => 2
      "status_code" => 200
      "title" => ""
      "meta_keywords" => ""
      "meta_description" => ""
      "h1_count" => 1
      "h1_contents" => array:1 [▶]
    ]
    "http://matword.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:2 [▶]
      "absolute_url" => "http://matword.com"
      "external_link" => true
      "visited" => true
      "frequency" => 4
      "source_link" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "depth" => 2
      "status_code" => 200
      "title" => ""
      "meta_keywords" => ""
      "meta_description" => ""
      "h1_count" => 0
      "h1_contents" => []
    ]
  ]
}

6. Getting duplicate titles between different URLs:
This is useful to see if you have any different pages that have same title – which may effect your SEO negatively by combining several methods of filter,groupBy,unique and map as following:

<?php
        $duplicateTitlePages = $collection
                   ->filter(function($linkInfo){
                       return $linkInfo['visited']===true;
                   })                   
                   ->groupBy('title',true)
                   ->unique('absolute_url')        
                   ->filter(function($links){
                       return count($links)>1;
                   })->map(function($linkGroup){
                       return $linkGroup->map(function($linkInfo,$uri){
                            return $uri;
                       })->values();
                   });

The output will be a collection with the duplicate title as a key, along with URLs that have that title as following output:
LinksCollection {#652 ▼
  #items: array:1 [▼
    "Z.Rashwani Blog - I write here whatever comes to my mind" => LinksCollection {#650 ▼
      #items: array:2 [▼
        0 => "http://zrashwani.com/"
        1 => "/"
      ]
    }
  ]
}

More

There are many other ways to use Laravel collections method combinations to get useful information from such link info array, like: getting links with Internal server errors or links with long/short meta description…etc. And the more you are used to using collections the more useful information that can be retrieved.

WCF SSL Service with PHP

We had a task recently that required our team – me with my colleague Ahmad to write php code to integrate with existing WCF webservice that includes attaching SSL certificates to requests. The application used to integrate with third-party banking system using a form of B2B web service.
In the following post, the main steps we used to write PHP code compatible with WCF:

Existing .net application

The original code was written in C# using WCF webservice over SOAP, it was attaching SSL certificate in PFX format -that includes all certificate chain -, and a separate private key file (as .key format).

PFX format (PKCS#12) is binary format which is usually used in windows to export/import SSL certificates; it stores certificate, intermediate certificates – if there is any – and private key in one file that can be encrypted and signed.

In original C# code, they defined a class that inherits SoapHttpClientProtocol which was used to add SSL certificates to soap request.

Generating SSL files

The existing code was attaching ssl certificate in .pfx format file, so I converted it to .pem format (which is the standard format for openssl) and extracted the key as separate file using openssl commands as following:

openssl pkcs12 -in certificate.pfx -nocerts -out key.pem -nodes
openssl pkcs12 -in certificate.pfx -nokeys -out certificate.pem
openssl rsa -in key.pem -out certificate.key 

Those commands will generate public ssl certificate (*.pem) and private ssl key (*.key) file.

To make sure the generated certificates are correct for php, I wrote this basic function to test:
<?php
function validatePublicPrivateKeys($public_key_file, $private_key_file) {
    $public = openssl_pkey_get_public(file_get_contents($public_key_file)); 
    $public_error = openssl_error_string();
    if(!empty($public_error)){
        echo "Error in public key:".$public_error."\n";
    }else{
        echo "public key is valid\n";
    }

    $private = openssl_pkey_get_private(file_get_contents($private_key_file), 'passphrase-here');
    $private_error = openssl_error_string();
    if(!empty($private_error)){
        echo "Error in private key:".$private_error;
    }else{
        echo "private key is valid\n";
    }
}

Extending SoapClient Class

Normally SSL certificate can be used in php SOAP request by setting `local_cert` parameter in SoapClient Constructor. however I found this option somehow limited, because there is no ability for a private key to be attached as separate file in the request.
so what we did is to extend soap client and override __doRequest method to be based on curl to send soap request as HTTP message as following:

class MySoapClient extends \SoapClient{

      public function __doRequest($request, $location, $action, $version, $one_way = FALSE) {
            $curl = curl_init($location);
            //setting curl options and data here...
            //...
            
    }
}      

Depending in SOAP version, override to curl header values is needed, in my case version used is SOAP 1.2, so the headers will be as following:
    $curl = curl_init($location);
    $headers = array(
        "Content-type: test/xml;charset=\"utf-8\";action=\"" . $location . '/' . $action . "\"",
        "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
        "Cache-Control: no-cache",
        "Pragma: no-cache",
        "Content-length: " . strlen($request),
    ); 
        
    curl_setopt($curl, CURLOPT_HEADER, TRUE);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);        

Attaching SSL certificates

I set ssl certificate public file, key file – generated earlier – and passphrase to curl request as following:

    curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);
        
    curl_setopt($curl, CURLOPT_SSLKEYPASSWD, 'passphrase');
    curl_setopt($curl, CURLOPT_SSLKEY, 'your-private-key.key');
    curl_setopt($curl, CURLOPT_SSLCERT, 'your-certificate.pem');

In case curl error “Peer certificate cannot be authenticated with known CA certificates” appeared – which is usually happens in windows- you shall download CA certificate bundle from Mozilla – or other trusted source -, and save it to your system and set CURLOPT_CAINFO option in curl:
curl_setopt($curl, CURLOPT_CAINFO, “C:\full-path-to\cacert.pem”);
or better set it globally in your php.ini:
curl.cainfo=c:\full-path-to\cacert.pem

The final class implementation will be like the following snippet:

<?php
class MySoapClient extends \SoapClient{
      public $certificate_ssl_location = "/full/path/to/your-ssl-public-certificate";
      public $private_key_location = "/full/path/to/your-ssl-private-certificate";
      public $ssl_passphrase = "password";
      public $ca_cert_file = "/full/path/to/ca-file";
      
      public function __doRequest($request, $location, $action, $version, $one_way = FALSE) {

        // Call via Curl and use the timeout
        $curl = curl_init($location);

        $headers = array(
            "Content-type: test/xml;charset=\"utf-8\";action=\"" . $location . '/' . $action . "\"",
            "Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5",
            "Cache-Control: no-cache",
            "Pragma: no-cache",
            "Content-length: " . strlen($request),
        ); //SOAPAction: your op URL

        curl_setopt($curl, CURLOPT_VERBOSE, TRUE);
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($curl, CURLOPT_POST, TRUE);
        curl_setopt($curl, CURLOPT_POSTFIELDS, $request);
        curl_setopt($curl, CURLOPT_HEADER, TRUE);
        curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);

        curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2);
        
        curl_setopt($curl, CURLOPT_SSLKEYPASSWD, $this->ssl_passphrase);
        curl_setopt($curl, CURLOPT_SSLKEY, $this->private_key_location);
        curl_setopt($curl, CURLOPT_SSLCERT, $this->certificate_ssl_location);
        curl_setopt($curl, CURLOPT_CAINFO, $this->ca_cert_file);

        $response = curl_exec($curl);

        if (curl_errno($curl)) {
            throw new Exception(curl_error($curl));
        }
        curl_close($curl);

        $soap_start = strpos($response, "<soapenv:Envelope");
        $soap_response = substr($response, $soap_start);

        if (!$one_way) {
            return $soap_response;
        }
    }
}

you can call soap request normally by initializing object from MySoapClient class (instead of native SoapClient class)
and now the integration with secured WFC web service works fine 🙂

Applying version stamp to Symfony & SonataAdmin

I spend a lot of time in my job in developing and maintaining backoffice systems to process some workflow logic for Forex portals. Our systems are developed using Symfony2 and depend highly on SonataAdminBundle with deep customization to impose business rules for Forex companies.

Recently some data inconsistency appeared in a system of one of our clients, and after digging into logs, I found that the cause of the problem is two users processing same user application in almost same time range, and that caused one edit operation to override the other one, and many other undesired consequences occurs after that.

so in order to fix this issue, and prevent it from happening again, I worked on adding “Version Stamps” to my entity to maintain offline data consistency within the application, and I would like to share here what I learned.

 

Version Stamps and offline consistency

Version Stamps is a field that changes every time a writing operation is performed on the data, it is used to ensure that no one else has changed data of that row before applying your modification.
There is several ways to implement version stamps, and the simplest way is an integer value which is noted on the read, its value is compared to the submitted data before write, and once validated, the write operation take place with increasing version stamp value.
Let’s say there is a form bind to an entity in Symfony application, version stamp column will be present in the entity and added as a hidden field in the form, once submitted version stamp value submitted will be compared to the one in the database currently, to ensure that no other edit is performed on that entity -in the time between displaying your form initially and submitting it-, if the condition fails, the update operation will be rejected, thus by adding error via constraint.

Implementation in SonataAdmin

The implementation of this concept is quite simple as following:
In the desired entity, I applied those modifications:

  1. Define new field to hold stamp value.
  2. Mark the field to be a version stamp, using doctrine @Version annotation; this will cause doctrine to update versionStamp field every time the object is persisted to the database
<?php

namespace Webit\ForexCoreBundle\Entity;

use Doctrine\ORM\Mapping as ORM;

/**
 * RealProfile
 *
 * @ORM\Table(name="forex_real_profile")
 * @ORM\Entity(repositoryClass="Webit\ForexCoreBundle\Repository\RealProfileRepository")
 * @ORM\HasLifecycleCallbacks()
 */
class RealProfile
{
    
    /**
     * @var integer
     *
     * @ORM\Column(name="id", type="integer", nullable=false)
     * @ORM\Id
     * @ORM\GeneratedValue(strategy="IDENTITY")
     */
    private $id;
    
    /**
     * @var integer
     *
     * @ORM\Version 
     * @ORM\Column(name="version_stamp", type="integer")
     */
    private $versionStamp;
    
    /* Other columns, getters, setters here */

    
}

In Admin class, the following is added:

  1. In configureFormFields() method, the version stamp is added as hidden field, also I set mapped option to false, to prevent persisting its value along the form. version stamp value must be modified only via PreUpdate() method inside the entity.
    <?php
    
    namespace Webit\ForexCoreBundle\Admin;
    
    use Sonata\AdminBundle\Admin\Admin;
    use Sonata\AdminBundle\Datagrid\ListMapper;
    use Sonata\AdminBundle\Datagrid\DatagridMapper;
    use Webit\ForexCoreBundle\Entity\RealProfile;
    
    class RealAccountsAdmin extends Admin
    {
        protected function configureFormFields(\Sonata\AdminBundle\Form\FormMapper $formMapper)
        {
            $formMapper->add('versionStamp','hidden',array('attr'=>array("hidden" => true, 'mapped'=>false)))
            //other fields and groups...
        }
    }
  2. Here is the important point, which is validating version_stamp posted from the form against the one that is already saved in the database. There is two methods to apply that, one if by using doctrine locking mechanism, and the other is using sonata inline validation to add extra validation layer by implementing validate() method in order to apply additional validation layer.

    option 1:

    class RealAccountsAdmin extends Admin
    {
        /**
         * {@inheritdoc}
         */    
        public function getObject($id)
        {
            $uniqid = $this->getRequest()->query->get('uniqid');
            $formData = $this->getRequest()->request->get($uniqid);        
            
            $object = $this->getModelManager()->find($this->getClass(), 
                    $id, 
                    \Doctrine\DBAL\LockMode::PESSIMISTIC_WRITE, 
                    $formData['versionStamp']);
            
            foreach ($this->getExtensions() as $extension) {
                $extension->alterObject($this, $object);
            }
    
            return $object;
        }
    
        /**
         * {@inheritdoc}
         */      
        public function update($object) {
            try{
                parent::update($object);
            }catch(\Doctrine\ORM\OptimisticLockException $e) {
                $this->getConfigurationPool()->getContainer()->get('session')
                        ->getFlashBag()->add('sonata_flash_error', 'someone modified the object in between');
            }
        }

    This approach will take advantage of doctrine locking support. Here is brief explanation:

    • I have overridden getObject() method in admin class, to add two extra parameters for getModelManager()->find() method;
      third parameter indicates locking type, I used here LockMode::PESSIMISTIC_WRITE
      fourth parameter represents the expected version stamp value -to compare with database value before flushing.
    • I have overridden update($object) method so I catch OptimisticLockException exception and add error flash message to handle it

    option 2:
    In this approach, I used sonata inline validation to detect the form as invalid before even trying to persist and flush the object to the database:

        /**
         * {@inheritdoc}
         */
        public function validate(\Sonata\AdminBundle\Validator\ErrorElement $errorElement, $object) { 
            //get all submitted data (with non-mapped fields)
            $uniqid = $this->getRequest()->query->get('uniqid');
            $formData = $this->getRequest()->request->get($uniqid);
            $submitted_version_stamp = $formData['versionStamp'];        
            
            $em = $this->getConfigurationPool()
                  ->getContainer()->get('doctrine')->getManager();
            
            //get up-to-date version stamp value from the database
            $class_name = get_class($object);        
            $q = $em->createQuery("select partial o.{id,versionStamp} from $class_name o"
                                    . " where o.id=".$object->getId());        
            $saved_data = $q->getArrayResult();        
            $saved_version_stamp = $saved_data[0]['versionStamp'];
            
            //compare version stamps and add violation in case it didn't match
            if($saved_version_stamp != $submitted_version_stamp){
                $errorElement->addViolation('Record data seems outdated, probably someone else modified it, please refresh and try again.')->end();
            }
        }
        
    Here is more details about the operations performed inside this method:

    • To get versionStamp value submitted via form inside that method, I used:
      $uniqid = $this->getRequest()->query->get('uniqid');
      $formData = $this->getRequest()->request->get($uniqid);
      $submitted_version_stamp = $formData['versionStamp'];
    • To get an updated value of versionStamp that is stored in the database, I used doctrine query that retrieve partial object
      $em = $this->getConfigurationPool()
                 ->getContainer()->get('doctrine')->getManager();
              
      $class_name = get_class($object);        
      $q = $em->createQuery("select partial o.{id,versionStamp} from $class_name o"
              . " where o.id=".$object->getId());        
      $saved_data = $q->getArrayResult();        
      $saved_version_stamp = $saved_data[0]['versionStamp'];

      *If you retrieve the whole object from the database again, it will cause many issues specially if doctrine result cache is enabled.
    • Then compare the two values with each other, if those values are not equal, an error shall be appear to the user, and that is performed by calling $errorElement->addViolation() method
      if($saved_version_stamp != $submitted_version_stamp){
          $errorElement->addViolation('Record data seems outdated, probably someone else modified it, please refresh and try again.')->end();
      }

That’s all, now I can perform some basic test.

Test Solution

In order to verify that this mechanism solved the issue, I emulated the inconsistency behavior, by opening sonata admin edit page on two browsers, and then try to modify the data and submitted on each browser consequently.
The browser that got submitted last, will not save data and error will appear
Record data seems outdated, probably someone else modified it, please refresh and try again.”
versionstamp test
In that way, the inconsistency is prevented by stopping the second user from overriding the information modified by the other user.

At last

“Version Stamp” approach helped me in preventing data inconsistency in my Symfony/SonataAdmin application, hope it will help others who face similar scenario. I would like to know if anyone else has other idea or better way to handle that issue.

Server Sent Events example with laravel

Recently I have read about HTML5 Server Sent Events, and liked the concept of establishing long-lived connections to the server, instead of performing frequent Ajax calls to pull the updates. And I wanted to put it into action by implementing live currency rates widget with Laravel as backend PHP application.

Basic Introduction

What are “Server Sent Events”?
As Wikipedia defines

Server-sent events (SSE) is a technology for a browser to get automatic updates from a server via HTTP connection. The Server-Sent Events EventSource API is standardized as part of HTML5 by the W3C.

Basically, its an HTML5 technology that helps web client getting data from the server, using one connection that lives on the server for long interval and sending stream of data to the browser without closing the connection  (basically the connection will remain active until browser closes); such technique is useful for pushing news updates, automatically send updates in a social network, and populating live prices components…etc.

The older approach is called Ajax Long Polling which implemented requesting the updates from the web client by issuing frequent separate requests (by initiating Ajax request recursively with timeout), like the following example:
(function poll(){
   setTimeout(function(){
      $.ajax({ url: "/path/to/url", success: function(data){
        console.log(data);  
        poll();
      }, dataType: "json"});
  }, 30000);
})();

To make the idea more clear, I will use live currency rates widget as an example; this widget gets the rates to convert between one currency to another, with displaying up & down arrows to indicate the change of the price.

Basic Usage

The following snippet shows the basic usage of SSE with javascript:

<script type="text/javascript">
var es = new EventSource("/path/to/url");
es.addEventListener("message", function(e) {
            console.log(e.data);
}, false);
</script>

This piece of javascript code intialize EventSource object which listen for the specified URL, and process the data, as the server sent it back to the browser. Each time the server send new data, the event listener method will be called and will process the information according to the callback function implementation.

The code

As I said, Laravel will be used to implement this example, I will implement two actions; one for rendering the whole page, and the other will send only modified data in json format to the EventSource; as the following:

First, I defined the routes in the routes.php

// in apps/routes.php 
Route::get('/prices-page', 'HomeController@pricesPage');
Route::get('/prices-values', 'HomeController@pricesValues');

Then, I will implement a method to retrieve the rates values ( I used yahoo service as a free feed source):
    /**
     * retrieve rates of currencies from feed
     * @return array
     */
    protected function getCurrencyRates() {
        $pair_arr = array('EURUSD', 'GBPUSD', 'USDJPY', 'XAUUSD', 'XAGUSD', 'USDJOD');
        $currencies_arr = array();

        foreach ($pair_arr as $pair) {
            try {
                
                $price_csv = file_get_contents("http://finance.yahoo.com/d/quotes.csv?e=.csv&f=sl1d1t1&s=$pair=X");
                $price_data = explode(',', $price_csv);
                $currencies_arr[$pair]['price'] = $price_data[1];
                $currencies_arr[$pair]['status'] = '';
            } catch (Exception $ex) {
                $currencies_arr['error'] = $ex->getMessage();
            }
        }
        return $currencies_arr;
    }

It is not efficient to get file from external source in a controller, but I use it here for the purpose of the example. Usually, I write a backend command to get the prices from external source (usually trading server) and controller methods retrieve data from the database.

Second, I will implement the the action to render the whole price block:

public function pricesPage() {
    $prices = $this->getCurrencyRates();
    return View::make('pricesPage', array('prices' => $prices));        
}

and here is the template:
<h1>Prices here</h1>
<table>
    <thead>
        <tr>
            <th>Currency</th>
            <th>Rate</th>
            <th>status</th>
        </tr>
    </thead>
    <tbody>
        <?php foreach($prices as $currency=>$price_info){?>
        <tr class="price-row">
            <td><?php echo $currency?></td>
            <td data-symbol-price="<?php echo $currency; ?>"><?php echo $price_info['price']; ?></td>
            <td data-symbol-status="<?php echo $currency; ?>"><?php echo $price_info['status']; ?></td>
        </tr>
        <?php }?>
    </tbody>
</table>

<script type="text/javascript">
        var es = new EventSource("<?php echo action('HomeController@pricesValues'); ?>");
        es.addEventListener("message", function(e) {
            arr = JSON.parse(e.data);
            
            for (x in arr) {    	
                $('[data-symbol-price="' + x + '"]').html(arr[x].price);
                $('[data-symbol-status="' + x + '"]').html(arr[x].status);
                //apply some effect on change, like blinking the color of modified cell...
            }
        }, false);
</script>    

And now I will implement pricesValues() action that will push the data to the server, as following:

    /**
     * action to handle streamed response from laravel
     * @return \Symfony\Component\HttpFoundation\StreamedResponse
     */
    public function pricesValues() {

            $response = new Symfony\Component\HttpFoundation\StreamedResponse(function() {
            $old_prices = array();

            while (true) {
                $new_prices = $this->getCurrencyRates();
                $changed_data = $this->getChangedPrices($old_prices, $new_prices);

                if (count($changed_data)) {
                    echo 'data: ' . json_encode($changed_data) . "\n\n";
                    ob_flush();
                    flush();
                }
                sleep(3);
                $old_prices = $new_prices;
            }
        });

        $response->headers->set('Content-Type', 'text/event-stream');
        return $response;
    }
    

    /**
     * comparing old and new prices and return only changed currency rates
     * @param array $old_prices
     * @param array $new_prices
     * @return array
     */
    protected function getChangedPrices($old_prices, $new_prices) {
        $ret = array();
        foreach ($new_prices as $curr => $curr_info) {
            if (!isset($old_prices[$curr])) {
                $ret[$curr]['status'] = '';
                $ret[$curr]['price'] = $curr_info['price'];                
            } elseif ($old_prices[$curr]['price'] != $curr_info['price']) {
                $ret[$curr]['status'] = $old_prices[$curr]['price']>$curr_info['price']?'down':'up';
                $ret[$curr]['price'] = $curr_info['price']; 
            }
        }

        return $ret;
    }

As you notice, the action that push data to the event source, have following properties:

  1. the content type of the response is text/event-stream.
  2. the response I returned here, is of type “StreamedResponse” which is part of Symfony HTTP foundation component, this type of response enables the server to return data to the client as chunks. StreamedResponse object accepts a callback function to output the transferred data chunks.
  3. The prices that have been changed since the latest push will be sent back to browser, (I have compared the old and new prices easily since they reside in the same action), so if the prices didn’t change nothing will be sent back to the browser.
  4. The data returned is prefixed with “data:” and appended “\n\n” characters to the end.
  5. flush() and ob_flush() are called to trigger sending data back to the browser.
For the browsers that don’t support HTML5 features, you can apply simple fallback as following:
<script type="text/javascript">
if(window.EventSource !== undefined){
    // supports eventsource object go a head...
} else {
    // EventSource not supported, 
    // apply ajax long poll fallback
    }
</script>

The final output

Now the live currency rates widget is ready, the widget will auto-refresh prices every 3 seconds, and the server will send only rates that has been changed, so the operation is optimized and will not exchange unnecessary requests/response.

SSE price rate
* screenshot of the final component.

postgreSQL quick start with php

Last week, I accidentally stumbled upon blog stating features of new version of posgreSQL, and I found it pretty interesting, couple of useful features that does not exist in MySQL, are now implemented in PostgreSQL 9.3 (I am especially interested in Materialized views). so I wanted to learn more about this database.

Unfortunately, I haven’t used Postgres before ( although I have several years experience as MySQL developer and administrator), so I had to learn the basics about postgres, and I wanted to share this experience:

Installing PostgreSQL

In order to get the latest version on my Centos machine, I compiled Postgres from the source as following:

first, I got the desired version source files from Postgres site ( I used V9.3):

wget http://ftp.postgresql.org/pub/source/v9.3.2/postgresql-9.3.2.tar.bz2

then, uncompress the file:
tar xvjf postgresql-9.3.2.tar.bz2

then, compile the source files using this simple command inside the extracted folder:
./configure && make && make install

now postgres files should be placed at /usr/local/pgsql.
Postgres operates by default under user named postgres, so we should create the user, create data directory and assign folder ownership to created user:
adduser postgres 
mkdir /usr/local/pgsql/data 
chown postgres:postgres /usr/local/pgsql/data

Then we should initialize the data storage “database cluster” for the server by calling initdb, but first I switched to postgres user because you cannot run this command as a root:
[root@sub bin]# su - postgres
-bash-4.1$ /usr/local/pgsql/bin/initdb -D  /usr/local/pgsql/data/

database cluster is the collection of databases that postgres use. by creating database cluster the data directory will be filled with database files, and sample databases like Postgres and Template1 will be created.

now, I can start postgres server by typing:

/usr/local/pgsql/bin/postgres -D /usr/local/pgsql/data >logfile 2>&1 &
-D parameter specify the data directory location which also contains the configuration file of postgres, which is named by default postgresql.conf (analogous to my.cnf in mysql).

Now Postgres server is running and we can begin working with sql commands.

PostgreSQL Client

now let us enter the postgres client program by executing psql program, which is the interactive terminal for postgres:

/usr/local/pgsql/bin/psql -hlocalhost -U postgres -w
Here I am using the database using its super user “postgres”.
I will issue \list command to see the databases installed:
postgres=# \list
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(3 rows)
as shown in the snippet, there are three databases here:

  • Postgres database: which is the default database for postgres (same as mysql database in mysql)
  • template0 and template1: which are two template databases.
Template database is very useful feature in postgres, it enables administrator to create a database by copying all the content from another (template) database, by default any newly created database will be using template0 as a template.

I created a new database:

postgres=# create database test_pg;
CREATE DATABASE

If you want to create a database using template other than template0, you can use template keyword at the end of create command like this:
create database test_pg2 template template_database;

now if you run \list command, you will see the new database there

postgres=# \list
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 |
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 test_pg   | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =Tc/postgres         +
           |          |          |             |             | postgres=CTc/postgres+
           |          |          |             |             | test_usr=CTc/postgres
(4 rows)

then, I created a new user and granted him a permission to use that database:

postgres=# create user test_usr with password 'test_pass';
CREATE ROLE
postgres=# grant all privileges on database test_pg to test_usr;
GRANT

Now, I will exit (by typing \q), then connect to the new database using the user I have created in the previous step:
-bash-4.1$ /usr/local/pgsql/bin/psql test_pg -Utest_usr -W
Password for user test_usr:
psql (9.3.2)
Type "help" for help.

test_pg=>

create a table I will use for testing:
test_pg=# create table test_tbl( id serial primary key, name varchar(255) );
CREATE TABLE

Serial keyword is similar to auto-increment attribute in other databases, and used to create unique identifier for the table records

Unlike MySQL, Postgres don’t have different types of storage engines (like MyISAM or InnoDB), it has a unified database with one storage engine

and I will insert sample data to the table

test_pg=# insert into test_tbl(name) values('test1'), ('test2'), ('test3');
INSERT 0 3

PHP Script

I will use PDO in order to test php connectivity to postgres, but first php-postgres package must be installed:

yum install php-pgsql.x86_64

then I wrote this simple script:
<?php
try{
   $dbh = new \PDO('pgsql:host=localhost;dbname=test_pg', 'test_usr', 'test_pass');
}catch(Exception $ex){
   die('Error in connecting: '.$ex->getMessage());
}
$stmt = $dbh->prepare("select * from test_bg");
$stmt->execute();

echo $stmt->rowCount(). " records fetched.".chr(10);
while($row = $stmt->fetch()){
    echo "id:".$row['id'].', name:'.$row['name'].chr(10);
}

now you can run the script to see the results:
[root@sub pg_test]# php -f test.php
3 records fetched.
id:1, name:test1
id:2, name:test2
id:3, name:test3

Now you can use Postgres as your data store in a very similar way to MySQL.

Conclusion

In this post, I explained quick introduction to Postgres database, from installation, and creating database and roles, to writing simple PHP script that retrieves data from Postgres. PostgreSQL has great features that I intend to learn more about in order to get most of value out of it.

Simple web spider with PHP Goutte

Last week we got SEO analysis about one of our portals, that analysis included thorough  statistics about website SEO measures, like missing and duplicate <title>,<h1> and meta tags, broken and invalid links, duplicate content percentage…etc . It appears that the SEO agency that prepared that analysis use some sort of crawler to extract these information.

I liked that crawler idea, and wanted to implement it in PHP. After some reading of web scrapping and Goutte I was able to write a similar web spider that extracts the needed information, and I wanted to share it in this post.

About web scrapping and Goutte

Web scrapping is a technique to extract information from websites, its very close to web indexing because the bot or web crawler that search engines use, performs some sort of scrapping the web documents through following the links, analyzing keywords, meta tags, URLs and ranking them according to relevancy, popularity, engagement..etc.

Goutte is a screen scraping and web crawling library for PHP, it provides an API to crawl websites and extract data from the HTML/XML responses. Goutte is wrapper around Guzzle and several Symfony components like: BrowserKit, DOMCrawler and CSSSelector.

Here is a small description about some libraries that Goutte wraps:

    1. Guzzle: framework for building RESTful web service, it provides a simple interface to perform cURL, along with other important features like: persistent connections and streaming request and response bodies.
    2. BrowserKit: simluates a behaviour of a web browser, providing abstract HTTP layer like request, response, cookie…etc.
    3. DOMCrawler: provides easy methods for DOM navigation and manipulation.
    4. CSSSelector: provide an API to select elements using same selectors used for CSS (it becomes exremely easy to select elements when it works with DOMCrawler).
* These are the main components I interested in for this post, however, other components like: Finder and Process are also used in Goutte.

 

Basic usage

Once you download Goutte(from here),  you should define a Client object, the client used to send requests to a website and returns a crawler object, as in the snippet below:

require_once 'goutte.phar';
use Goutte\Client;

$url_to_traverse = 'http://zrashwani.com';

$client = new Client();
$crawler = $client->request('GET', $url_to_traverse);

Here I declared a client object, and called “Request()” to simulate browser requesting the url “http://zrashwani.com” using “GET” http method.
Request() method returns an object of type Symfony\Component\DomCrawler\Crawler, than can be used to select elements from the fetched html response.

but before processing the document, let’s ensure that this URL is a valid link, which means it returned a response code (200), using

$status_code = $client->getResponse()->getStatus();
if($status_code==200){
    //process the documents
}

$client->getResponse() method returns BrowserKit/Response object that contains information about the response the client got, like: headers (including status code I used here), response content…etc

In order to extract document title, you can filter either by XPath or CSS selector in order to get you target HTML DOM element value

$crawler->filterXPath('html/head/title')->text()
// $crawler->filter('title')->text()

In order to get the number of <h1> tags, and get the contents of the tags that exist in the page,

$h1_count = $crawler->filter('h1')->count();
$h1_contents = array();
if ($h1_count) {
    $crawler->filter('h1')->each(function(Symfony\Component\DomCrawler\Crawler $node, $i) use($h1_contents) {
                $h1_contents[$i] = trim($node->text());
        });
}

for SEO Purposes, there should be one h1 tag in a page, and its content should have the main keywords in the page. Here each() function is quite useful, it can be used to loop over all matching elements. each() function takes a closure as a parameter to perform some callback operation on the node.

PHP closures is anonymous functions that started to be used in PHP5.3, its very useful to perform a callback functionality, you can refer to PHP manual if you are new to closures.

Application goals

After this brief introduction, I can begin explaining the spider functionality, this crawler will detect broken/invalid links in the website, along with extracting <h1>,<title> tag values that are important for SEO issue that I have.

my simple crawler implements Depth-limited search, in order to avoid crawling large amounts of data, and works as following :

      1. Read the initial URL to crawl along with depth of links to be visited.
      2. crawl the url and check the response code to determine the link is not broken, then add it to an array containing site links.
      3. extract <title>, <h1> tags content in order to use their values later for reporting.
      4. loop over all <a> tags inside the document fetch to extract their href attribute along with other data.
      5. check that depth limit is not reached, and the current href is not visited before, and the link url does not belong to external site.
      6. crawl the child link by repeating steps (2-5).
      7. stop when the links depth is reached.

 

These steps implemented in SimpleCrawler class that I wrote, (It still a basic version and should be optimized more):

<?php

require_once 'goutte.phar';

use Goutte\Client;

class simpleCrawler {

    private $base_url;
    private $site_links;
    private $max_depth;

    public function __construct($base_url, $max_depth = 10) {
        if (strpos($base_url, 'http') === false) { // http protocol not included, prepend it to the base url
            $base_url = 'http://' . $base_url;
        }

        $this->base_url = $base_url;
        $this->site_links = array();
        $this->max_depth = $max_depth;
    }

    /**
     * checks the uri if can be crawled or not
     * in order to prevent links like "javascript:void(0)" or "#something" from being crawled again
     * @param string $uri
     * @return boolean
     */
    protected function checkIfCrawlable($uri) {
        if (empty($uri)) {
            return false;
        }

        $stop_links = array(//returned deadlinks
            '@^javascript\:void\(0\)$@',
            '@^#.*@',
        );

        foreach ($stop_links as $ptrn) {
            if (preg_match($ptrn, $uri)) {
                return false;
            }
        }

        return true;
    }

    /**
     * normalize link before visiting it
     * currently just remove url hash from the string
     * @param string $uri
     * @return string
     */
    protected function normalizeLink($uri) {
        $uri = preg_replace('@#.*$@', '', $uri);

        return $uri;
    }

    /**
     * initiate the crawling mechanism on all links
     * @param string $url_to_traverse
     */
    public function traverse($url_to_traverse = null) {
        if (is_null($url_to_traverse)) {
            $url_to_traverse = $this->base_url;

            $this->site_links[$url_to_traverse] = array(//initialize first element in the site_links 
                'links_text' => array("BASE_URL"),
                'absolute_url' => $url_to_traverse,
                'frequency' => 1,
                'visited' => false,
                'external_link' => false,
                'original_urls' => array($url_to_traverse),
            );
        }

        $this->_traverseSingle($url_to_traverse, $this->max_depth);
    }

    /**
     * crawling single url after checking the depth value
     * @param string $url_to_traverse
     * @param int $depth
     */
    protected function _traverseSingle($url_to_traverse, $depth) {
        //echo $url_to_traverse . chr(10);

        try {
            $client = new Client();
            $crawler = $client->request('GET', $url_to_traverse);

            $status_code = $client->getResponse()->getStatus();
            $this->site_links[$url_to_traverse]['status_code'] = $status_code;

            if ($status_code == 200) { // valid url and not reached depth limit yet            
                $content_type = $client->getResponse()->getHeader('Content-Type');                
                if (strpos($content_type, 'text/html') !== false) { //traverse children in case the response in HTML document 
                   $this->extractTitleInfo($crawler, $url_to_traverse);

                   $current_links = array();
                   if (@$this->site_links[$url_to_traverse]['external_link'] == false) { // for internal uris, get all links inside
                      $current_links = $this->extractLinksInfo($crawler, $url_to_traverse);
                   }

                   $this->site_links[$url_to_traverse]['visited'] = true; // mark current url as visited
                   $this->traverseChildLinks($current_links, $depth - 1);
                }
            }
            
        } catch (Guzzle\Http\Exception\CurlException $ex) {
            error_log("CURL exception: " . $url_to_traverse);
            $this->site_links[$url_to_traverse]['status_code'] = '404';
        } catch (Exception $ex) {
            error_log("error retrieving data from link: " . $url_to_traverse);
            $this->site_links[$url_to_traverse]['status_code'] = '404';
        }
    }

    /**
     * after checking the depth limit of the links array passed
     * check if the link if the link is not visited/traversed yet, in order to traverse
     * @param array $current_links
     * @param int $depth     
     */
    protected function traverseChildLinks($current_links, $depth) {
        if ($depth == 0) {
            return;
        }

        foreach ($current_links as $uri => $info) {
            if (!isset($this->site_links[$uri])) {
                $this->site_links[$uri] = $info;
            } else{
                $this->site_links[$uri]['original_urls'] = isset($this->site_links[$uri]['original_urls'])?array_merge($this->site_links[$uri]['original_urls'], $info['original_urls']):$info['original_urls'];
                $this->site_links[$uri]['links_text'] = isset($this->site_links[$uri]['links_text'])?array_merge($this->site_links[$uri]['links_text'], $info['links_text']):$info['links_text'];
                if(@$this->site_links[$uri]['visited']) { //already visited link)
                    $this->site_links[$uri]['frequency'] = @$this->site_links[$uri]['frequency'] + @$info['frequency'];
                }
            }

            if (!empty($uri) && 
                !$this->site_links[$uri]['visited'] && 
                !isset($this->site_links[$uri]['dont_visit'])
                ) { //traverse those that not visited yet                
                $this->_traverseSingle($this->normalizeLink($current_links[$uri]['absolute_url']), $depth);
            }
        }
    }

    /**
     * extracting all <a> tags in the crawled document, 
     * and return an array containing information about links like: uri, absolute_url, frequency in document
     * @param Symfony\Component\DomCrawler\Crawler $crawler
     * @param string $url_to_traverse
     * @return array
     */
    protected function extractLinksInfo(Symfony\Component\DomCrawler\Crawler &$crawler, $url_to_traverse) {
        $current_links = array();
        $crawler->filter('a')->each(function(Symfony\Component\DomCrawler\Crawler $node, $i) use (&$current_links) {
                    $node_text = trim($node->text());
                    $node_url = $node->attr('href');
                    $hash = $this->normalizeLink($node_url);

                    if (!isset($this->site_links[$hash])) {  
                        $current_links[$hash]['original_urls'][$node_url] = $node_url;
                        $current_links[$hash]['links_text'][$node_text] = $node_text;
                        
    		if (!$this->checkIfCrawlable($node_url)){

			}elseif (!preg_match("@^http(s)?@", $node_url)) { //not absolute link                            
                            $current_links[$hash]['absolute_url'] = $this->base_url . $node_url;
                        } else {
                            $current_links[$hash]['absolute_url'] = $node_url;
                        }

                        if (!$this->checkIfCrawlable($node_url)) {
                            $current_links[$hash]['dont_visit'] = true;
                            $current_links[$hash]['external_link'] = false;
                        } elseif ($this->checkIfExternal($current_links[$hash]['absolute_url'])) { // mark external url as marked                            
                            $current_links[$hash]['external_link'] = true;
                        } else {
                            $current_links[$hash]['external_link'] = false;
                        }
                        $current_links[$hash]['visited'] = false;
                        
                        $current_links[$hash]['frequency'] = isset($current_links[$hash]['frequency']) ? $current_links[$hash]['frequency']++ : 1; // increase the counter
                    }
                    
                });

        if (isset($current_links[$url_to_traverse])) { // if page is linked to itself, ex. homepage
            $current_links[$url_to_traverse]['visited'] = true; // avoid cyclic loop                
        }
        return $current_links;
    }

    /**
     * extract information about document title, and h1
     * @param Symfony\Component\DomCrawler\Crawler $crawler
     * @param string $uri
     */
    protected function extractTitleInfo(Symfony\Component\DomCrawler\Crawler &$crawler, $url) {
        $this->site_links[$url]['title'] = trim($crawler->filterXPath('html/head/title')->text());

        $h1_count = $crawler->filter('h1')->count();
        $this->site_links[$url]['h1_count'] = $h1_count;
        $this->site_links[$url]['h1_contents'] = array();

        if ($h1_count) {
            $crawler->filter('h1')->each(function(Symfony\Component\DomCrawler\Crawler $node, $i) use($url) {
                        $this->site_links[$url]['h1_contents'][$i] = trim($node->text());
                    });
        }
    }

    /**
     * getting information about links crawled
     * @return array
     */
    public function getLinksInfo() {
        return $this->site_links;
    }

    /**
     * check if the link leads to external site or not
     * @param string $url
     * @return boolean
     */
    public function checkIfExternal($url) {
        $base_url_trimmed = str_replace(array('http://', 'https://'), '', $this->base_url);

        if (preg_match("@http(s)?\://$base_url_trimmed@", $url)) { //base url is not the first portion of the url
            return false;
        } else {
            return true;
        }
    }

}

?>

and you can try this class functionality as following:

$simple_crawler = new simpleCrawler($url_to_crawl, $depth);    
$simple_crawler->traverse();    
$links_data = $simple_crawler->getLinksInfo();

getLinksInfo() method returns an associative array, containing information about each page crawled, such as url of the page, <title>, <h1> tags contents, status_code…etc. You can store these results in any way you like, for me I prefer MySQL for simplicity in order to be able to get desired results using query, so I created pages_crawled table as following:

CREATE TABLE `pages_crawled` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `url` varchar(255) DEFAULT NULL,
  `frequency` int(11) unsigned DEFAULT NULL,
  `title` varchar(255) DEFAULT NULL,
  `status_code` int(11) DEFAULT NULL,
  `h1_count` int(11) unsigned DEFAULT NULL,
  `h1_content` text,
  `source_link_text` varchar(255) DEFAULT NULL,
  `original_urls` text,
  `is_external` tinyint(1) DEFAULT '0',
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=37 DEFAULT CHARSET=utf8

and here I store the links traversed into mysql table:

<?php 
error_reporting(E_ALL);
set_time_limit(300);
include_once ('../src/SimpleCrawler.php');

$url_to_crawl = $argv[1];
$depth = isset($argv[2])?$argv[2]:3;

if($url_to_crawl){
    
    echo "Begin crawling ".$url_to_crawl.' with links in depth '.$depth.chr(10);
    
    $start_time = time();    
    $simple_crawler = new simpleCrawler($url_to_crawl, $depth);    
    $simple_crawler->traverse();    
    $links_data = $simple_crawler->getLinksInfo();
       
    $end_time = time();
    
    $duration = $end_time - $start_time;
    echo 'crawling approximate duration, '.$duration.' seconds'.chr(10);
    echo count($links_data)." unique links found".chr(10);
    
    mysql_connect('localhost', 'root', 'root');
    mysql_select_db('crawler_database');
    foreach($links_data as $uri=>$info){
        
        if(!isset($info['status_code'])){
            $info['status_code']=000;//tmp
        }
        
        $h1_contents = implode("\n\r", isset($info['h1_contents'])?$info['h1_contents']:array() );
        $original_urls = implode('\n\r', isset($info['original_urls'])?$info['original_urls']:array() );
        $links_text = implode('\n\r',  isset($info['links_text'])?$info['links_text']:array() );
        $is_external = $info['external_link']?'1':'0';
        $title = @$info['title'];
        $h1_count = isset($info['h1_count'])?$info['h1_count']:0;
        
        $sql_query = "insert into pages_crawled(url, frequency, status_code, is_external, title, h1_count, h1_content, source_link_text, original_urls)
values('$uri', {$info['frequency']}, {$info['status_code']}, {$is_external}, '{$title}', {$h1_count}, '$h1_contents', '$links_text', '$original_urls')";
        
        mysql_query($sql_query) or die($sql_query);
    }
}

 

Running the spider

Now let me try out the spider on my blog url, with depth of links to be visited is 2:

C:\xampp\htdocs\Goutte\web>php -f test.php zrashwani.com 2

Now I can get the important information that I needed using simple SQL query of the pages_crawled table, as following:

mysql> select count(*) from pages_crawled where h1_count >1;
+----------+
| count(*) |
+----------+
|       30 |
+----------+
1 row in set (0.01 sec)

mysql> select count(*) as c, title from pages_crawled group by title having c>1;

+---+----------------------------------------------------------+
| c | title                                                    |
+---+----------------------------------------------------------+
| 2 | Z.Rashwani Blog | I write here whatever comes to my mind |
+---+----------------------------------------------------------+
1 row in set (0.02 sec)

in the first query, I returned the number of pages with duplicate h1 tags ( I find alot, I will consider changing the HTML structure of my blog a little bit),
in the second one, I returned the duplicated page titles.
now we can get many other statistics on the pages traversed using information we collected.

Conclusion

In this post I explained how to use Goutte for web scrapping using real-world example that I encountered in my job. Goutte can be easily used to extract great amount of information about any webpage using its easy API for requesting pages, analyzing the response and extract specific data from Dom document.

I used Goutte to extract some information that can be used as SEO measures about the specified website, and stored them into MySQL table in order query any report or statistics derived from them.

Update

thanks to Josh Lockhart, this code is modified for composer and Packagist and now available on github https://github.com/codeguy/arachnid

Introduction to sphinx with PHP – part2

In Part1, I explained how to install sphinx and configure it to index the data from MySQL source, and use the searchd daemon from command line to retrieve data from defined indexes.

In this post, I will explain a PHP examples of how to use Sphinx API.

The following script is based of the database structure and sphinx config file I used in Part1 of this sphinx introduction.

Example PHP Script

<?php

header('Content-type: text/html; charset=utf8');
include ( "sphinxapi.php" );

mysql_connect('localhost', 'root', 'root');
mysql_select_db('your_database_here');        
mysql_query('set names utf8');        

$phrase = @$_GET['phrase'];
$page = isset($_GET['page']) ? $_GET['page'] : 1;
$date_start = @$_GET['date_start'];
$date_end = @$_GET['date_end'];

$client = new SphinxClient();
$client->SetLimits(($page - 1) * 10, 10);
$client->SetSortMode(SPH_SORT_EXTENDED, '@weight desc, created_time desc');
$client->SetMatchMode(SPH_MATCH_ANY);
$client->SetFieldWeights(array('title'=>4, 'keywords'=>2, 'body'=>1 ));

if(isset($date_start) || isset($date_end)){    
    $start_time = isset($date_start)?strtotime($date_start):null;
    $end_time = isset($date_end)?strtotime($date_end):null;    
    $client->SetFilterRange('created_time', $start_time, $end_time);
}

$res = $client->Query($phrase, 'content_index');


if (!$res) {
    echo 'error: ' . $client->GetLastError();
} else {

    if ($res['total'] == 0 || !isset($res['matches'])) {
        echo 'No results retrieved from Search engine';
    } else {
        echo "Displaying " . (($page - 1) * 10+1).'-'.(min($res['total'],$page * 10)) . " out of " . $res['total_found'] . ' total results';
                
        //var_dump($res);
        $ids_str = implode(', ', array_keys($res['matches']));
        $res_db = mysql_query('select id, title, created_at from content where id in  (' . $ids_str . ') order by field(id,'.$ids_str.')');
        if ($res_db === false) {
            echo "Error in mysql query #" . mysql_errno() . ' - ' . mysql_error();
        } else {
            echo '<ul>';
            while ($row = mysql_fetch_assoc($res_db)) {
                echo '<li>'
                . '<a href="show.php?id=' . $row['id'] . '&phrase='.$phrase.'">' . $row['title'] . '<a>'
                . '<br/> [relevency: '.$res['matches'][$row['id']]['weight'].']'        
                . '<br/> [created_at: '.$row['created_at'].']'        
                . '</li>';
            }
            echo '</ul>';
        }

        echo '<br/><br/>Total Time: ' . $res['time'] . 's';
    }
}

This simple script takes parameters from the webpage, then issue a search request containing the specified phrase and conditions from searchd daemon.

In the first lines (1-13), I declared the database connection along with the parameters that I will use within the search, after that I initialized sphinx client and applied main configurations on it as explained in the next section.

Main SphinxClient Methods

Here are a list of main methods used to configure SphinxClient:

1- SetSortMode:
Sphinx supports multiple flexible sort modes which controls the ordering criteria of the retrieved results,
I will mention brief information about each sort mode – since I consider them as one of most important features in sphinx:

a- SPH_SORT_RELEVANCE: its the default sort mode that sorts the results according to the their relevancy to the search query passed.

$client->SetSortMode(SPH_SORT_RELEVANCE);

Sphinx ranks the results by default using phrase proximity that takes into consideration the phrase words order along with words frequency. We can control the way sphinx computes relevancy by changing Ranking modes (using  SetRankingMode function ).

b- SPH_SORT_ATTR_ASC / SPH_SORT_ATTR_DESC: sort the results in ascending or descending order according to predefined attribute, for example, you can change line 17 to be:

$client->SetSortMode(SPH_SORT_ATTR_DESC, 'created_time');
in this way, the newest articles will come as the first result in the page.

c- SPH_SORT_TIME_SEGMENTS: sorts by combination time ordering then by relevancy

$client->setSortMode(SPH_SORT_TIME_SEGMENTS, 'created_time');

d- SPH_SORT_EXTENDED: sort by a combination of attributes ascending or descending in SQL-like format, as I used in the script above:

$client->SetSortMode(SPH_SORT_ATTR_ASC, '@weight desc, created_time desc');
Here I sorted according to relevancy (represented using @weight computed attribute), then descending according to creation time (in case two results have same weight).

e- SPH_SORT_EXPR: sort using some arithmetic expression, for example you can use a combination of the relevancy and popularity represented by page_views, as an example:

$client->SetSortMode(SPH_SORT_EXPR, '@weight * page_views/100');

unlike MySql, putting expression in sort mode (analogous to order by clause) won’t effect the performance negatively.

2- SetMatchMode():
used to control how sphinx perform a match for the query phrase, here is the most important options:
a- SPH_MATCH_ALL: matches all keywords in the search query.
b- SPH_MATCH_ANY: matches any keyword.
c- SPH_MATCH_PHRASE: match the whole phrase, which require perfect match.
all matching modes can be found here

3- SetFieldWeights():
Using this function, you can distribute the relevancy weight among the fields, in the script above, I used this line:

$client->SetFieldWeights(array('title'=>4, 'keywords'=>2, 'body'=>1 ));

in order to indicate that “title” field is more important than “keywords” field and “body” field, so the results that have matching query phrase in the title will appear before those which have many matching query phrase in the body. This option is very useful to control the relevancy of results.

4- SetFilterRange():
Here you can add filter based on one of the attributes defined in sphinx index, (analogous to adding where condition to the SQL statement). I used it to filter according to the creation time

$client->SetFilterRange('created_time', $start_time, $end_time);

5- Query():
after configuring sphinx search query, this method used to send request to searchd daemon and get the results from sphinx:

$res = $client->Query($phrase, 'content_index');

the Query() method, take the search phrase as the first parameter, and the name of the index(es) to match against as the second parameter.

After calling Query() method on sphinxClient, a result array will be returned containing information about matching records. If we dumped the “matches” index in the result array, we will get similar to those data:

var_dump($res['matches']);
/*********/

  array(2) {
    [181916]=>
    array(2) {
      ["weight"]=>
      string(1) "1"
      ["attrs"]=>
      array(3) {
        ["status"]=>
        string(1) "1"
        ["category_id"]=>
        string(2) "11"
        ["created_time"]=>
        string(10) "1386946964"
      }
    }
    [181915]=>
    array(2) {
      ["weight"]=>
      string(1) "7"
      ["attrs"]=>
      array(3) {
        ["status"]=>
        string(1) "1"
        ["category_id"]=>
        string(2) "12"
        ["created_time"]=>
        string(10) "1386368157"
      }
    }

The data returned for each matched element are:
– documentID (as the key of the array element)
– weight (dynamically calculated according to SetSortMode() and SetFieldWeights() functions, we used earlier)
– attributes values, in “attrs” index (ex. created_time, status…etc), containing sphinx attributes defined in config file.

note that sphinx will not return the textual data itself, because it only index textual data and don’t store it, so we have to get them from our MySQL database:

$ids_str = implode(', ', array_keys($res['matches']));
$res_db = mysql_query('select id, title, created_at from mdn_content where id in  (' . $ids_str . ') order by field(id,'.$ids_str.')');

in this line, I got the records from MySQL using the DocumentIDs, and kept the same ordering as Sphinx by using “Field(id, val1,val2,…)” in order by clause.

Now I got the results IDs from sphinx, fetched associated textual data from MySQL and displayed them into webpage.

Running the code

Now, I would like to query all recording containing word “syria” published in the last two weeks, and here are the results:
Screenshot from 2013-12-14 00:02:11

you can see that articles with “syria” word appeared in title got higher rank than those with “syria” keyword appeared in the body, because of the field weights I used in the script above. also the sphinx took about 0.015 seconds to get those results among 150,000 record, which is extremely fast.

another execution here, searching for syria phrase without any additional filters:
Screenshot from 2013-12-14 00:20:34
and that took about 0.109 seconds to execute!

Quick MySQL comparison

I just wanted to compare sphinx with MySQL, in terms of performance here:
I execute mysql query that have a similar condition to that I executed on sphinx in previous section, and here is the result:

mysql> select id from content where match(body) against('*syria*' in boolean mode) and status=1;
+--------+
| id     |
+--------+
| 145805 |
| 142579 |
| 133329 |
|  59778 |
|  95318 |
|  94979 |
|  83539 |
|  56858 |
| 181915 |
| 181916 |
| 181917 |
| 181918 |
+--------+
12 rows in set (10.74 sec)

MySQL took about 10 seconds to execute the same query compared to about 0.1 second using sphinx.

Conclusion

Now, the simple PHP script is running with sphinx and MySQL, and I explained the main functions to control Sphinx using PHP API, including sorting, matching and filtration.
There are many other powerful features of sphinx, like: MultiQuery, MVA (multi-valued attributes), grouping…etc, that I may write about in the future.

Optimizing and Compiling js/css files in php

In the last month, my team  in the company have been working on applying new theme to old project that we have, this project is more than 3 years old and it is written in relatively old technology (symfony 1.4/Propel ORM).

I wanted to find an automated method to optimize the javascript and stylesheet files that are served for this project (similar to the functionality of assetic in symfony2) , so I write a couple of files to automate this optimization, which do the following:

  1. Scan stylesheet folder and Optimize files using CssMin project:
    which compress the css file by removing whitespaces and comments, then minify.
  2. Scan javascript folder and Optimize  using google closure compiler:
    which parses javascript files, and convert it into better optimized form, as the closure page states:

    It parses your JavaScript, analyzes it, removes dead code and rewrites and minimizes what’s left. It also checks syntax, variable references, and types, and warns about common JavaScript pitfalls.

    note: I used CSSMin and google closure compiler, since they have the least dependencies, so I can utilize without installing additional packages, however, other options like Grunt or UglifyJs are really powerful but require npm to be installed.
  3. Creating new unique file names, through md5(resource_file_size) and copy it to the destination folder with the new name.
    This will prevent the browser caching for the modified files.

    note: other method for preventing browser cache is “cache busting” where you append changable query string to the resource file like

    <link rel="stylesheet" type="text/css" media="screen" href="/css/style.css?v=8" />
  4. Adding the association between the original file and the compiled file name to an array that will be used for rendering the resource path.

 

and here is the code of the task that performs the optimization:

<?php
include('CSSMinify.php'); //download CSS Min from http://code.google.com/p/cssmin/
class optimizeResourcesTask{

    private static $RESOURCES_ASSOCIATION = array();  //array to hold  mapping between original files and compiled ones
    
    private $source_js_folder = 'js/'; //the relative path of the source files for javascript directory, it will be scanned and its individual js files will be optimized
    private $target_js_folder = 'js-compiled/';    //result js files will be stored in this directory
    
    //target and source CSS folders preferred to be on the same folder level, 
    //otherwise path rewrite should be handled in the contents of the css files
    private $source_css_folder = 'css/'; //the relative path of the source files for stylesheet directory, it will be scanned and its individual css files will be optimized
    private $target_css_folder = 'css-compiled/';  //result css files will be stored in this directory   
    
    //path of the file that will hold associative array containing mapping between original files and compiled ones
    private $resource_map_file = '_resource_map.php';
    


    public function run() {
        // initialize the database connection

        $css_dir = __DIR__ .'/'. $this->source_css_folder;
        $this->optimizeCSSResources($css_dir);

        $js_dir = __DIR__.'/'.$this->source_js_folder;
        $this->optimizeJSResources($js_dir);

        $this->writeMappingData();
        
        
        $this->cleanupOldData($this->target_css_folder, 'css');
        $this->cleanupOldData($this->target_js_folder, 'js');
    }

    /**
     * iterating over the CSS directory and optimizing all of its contents
     * every single CSS file found it this directory will be passed to optimizeOneCSS() method in order to optimize
     * @param string $dir
     */
    protected function optimizeCSSResources($dir = null) {
        if (is_null($dir)) {
            $dir = __DIR__ . '/'.$this->source_css_folder;
        }

        if ($handle = opendir($dir)) {
            while (false !== ($entry = readdir($handle))) {
                if ($entry != "." && $entry != "..") {
					
                    if (is_dir($dir . $entry)) {
                        $this->optimizeCSSResources($dir . $entry);
                    } else {
                        $this->optimizeOptimizeOneCSS($dir . $entry);
                    }
                }
            }
        }
    }


    /**
     * optimize one CSS file by using CSSMin library to minify the contents of the file
     * generate new file name using hash of its file contents
     * add the new file name association to $RESOURCES_ASSOCIATION static variable in order to write resource association array later
     * @link "http://code.google.com/p/cssmin/" CSSMin documentation
     * @param string $file css file absolute path to minify
     */
    protected function optimizeOptimizeOneCSS($file) {
	
        print('trying to optimize css file ' . $file. chr(10));
        $info = pathinfo($file);
        if ($info['extension'] == 'css') {
            $optimized_css = CssMin::minify(file_get_contents($file));

            $target_css_dir_absolute = __DIR__ . '/' . $this->target_css_folder;
            if (!is_dir($target_css_dir_absolute)) {
                mkdir($target_css_dir_absolute);
                chmod($target_css_dir_absolute, 0777);
            }

            $new_name = md5($optimized_css) . '.css';
            file_put_contents($target_css_dir_absolute .  $new_name, $optimized_css);


            $file_relative_path = str_replace(__DIR__ , '', $file);
			
            self::$RESOURCES_ASSOCIATION[$file_relative_path] = '/' . $this->target_css_folder .  $new_name;

            print('CSS FILE: ' . $file . ' has been optimized to ' . $target_css_dir_absolute .  $new_name. chr(10));
			
        } else {
            print("skipping $file from optimization, not stylesheet file, just copying it". chr(10));
            
            $file_relative_path = str_replace(__DIR__ . $this->source_css_folder, '/', $file);
            
            $target_css_dir_absolute = __DIR__ . '/' . $this->target_css_folder .dirname($file_relative_path);
            
            if (!is_dir($target_css_dir_absolute)) {
                mkdir($target_css_dir_absolute);
                chmod($target_css_dir_absolute, 0777);
            }
            
            copy($file, $target_css_dir_absolute.'/'.basename($file));
        }
    }
	
	
    /**
     * iterating over the JS directory and optimizing all of its files contents'
     * every single JS file found it this directory will be passed to optimizeOneJS() method in order to optimize/minimize
     * @param string $dir
     */
    protected function optimizeJSResources($dir = null) {

        if (is_null($dir)) {
            $dir = __DIR__ . '/'.$this->source_js_folder;
        }
        print('getting JS inside ' . $dir. chr(10));

        if ($handle = opendir($dir)) {
            while (false !== ($entry = readdir($handle))) {
                
                if ($entry != "." && $entry != "..") {

                    if (is_dir($dir . $entry)) {
                        $this->optimizeJSResources($dir .  $entry);
                    } else {
                        $file_path = $dir . $entry;
                        $pathinfo = pathinfo($file_path);
                        if($pathinfo['extension']=='js'){
                            $this->optimizeOneJS($file_path);
                        }else{
                            print($file_path.' is not passed to optimization, its not a valid js file'. chr(10));
                        }
                    }
                }
            }
        }
    }

    /**
     * optimize one JS File using "Google Closure Compiler", 
     * store the optimized file in target directory named as hash of the file contents
     * add the new file name association to $RESOURCES_ASSOCIATION static variable in order to write resource association array later
     * @link  "https://developers.google.com/closure/compiler/docs/gettingstarted_api" "Google Closure Compiler API"
     * @param string $file js file absolute path to optimize/minify
     */
    protected function optimizeOneJS($file) {
	
        print("trying to optimize js ". $file. chr(10));

        $post_fields = array(
            'js_code' => file_get_contents($file),
            'compilation_level' => 'SIMPLE_OPTIMIZATIONS',
            'output_format' => 'text',
            'output_info' => 'compiled_code',
        );



        $ch = curl_init("http://closure-compiler.appspot.com/compile");
        curl_setopt($ch, CURLOPT_POST, count($post_fields));
        curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_fields));
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

        $optimized_js = curl_exec($ch);
        
        if(strpos($optimized_js,'Error(22): Too many compiles performed recently.') !==false){ //Google Closure API returned error on too many compilation
            trigger_error($file.' is failed to be compiled, skipped...');
            return;
        }
        
        curl_close($ch);

        $target_js_dir_absolute = __DIR__ . '/' . $this->target_js_folder;
        if (!is_dir($target_js_dir_absolute)) {
            mkdir($target_js_dir_absolute);
            chmod($target_js_dir_absolute, 0777);
        }

        $new_name = md5($optimized_js) . '.js';
        file_put_contents($target_js_dir_absolute . '/' . $new_name, $optimized_js);


        $file_relative_path = str_replace(__DIR__ , '', $file);
        self::$RESOURCES_ASSOCIATION[$file_relative_path] = '/' . $this->target_js_folder . $new_name;

        print('JS FILE: ' . $file . ' has been optimized to ' . $new_name. chr(10));
    }

    /**
     * write $resources_map array stored in $RESOURCES_ASSOCIATION into $resource_map_file file that will be used in generating
     * the association between original JS/CSS files and the optimized/minimized ones
     */
    protected function writeMappingData() {
        $str = "<?php \$resources_map = array(";
        foreach (self::$RESOURCES_ASSOCIATION as $original_file => $optimized_file) {
            $str .= "'$original_file'=>'$optimized_file', " . chr(10);
        }
        $str .= "); ";

        $f = fopen(__DIR__ . '/' . $this->resource_map_file, 'w+');
        fwrite($f, $str);
        fclose($f);

        echo 'mapping data written to ' . $this->resource_map_file . chr(10);
    }

    /**
     * this function will remove any file that exists $target_js_folder and $target_css_folder
     * and doesnot exist in $RESOURCES_ASSOCIATION array, most probably that were generated from old builds and not used anymore
     * @param $dir the relative path of the directory to cleanup
     * @param $extension_to_filter the extension that is going to be cleaned (either css or js), the idea is to ignore cleaning static resources like font files, ex. woff, eot
     */
    protected function cleanupOldData($dir, $extension_to_filter){
        $dir_absolute = __DIR__.'/'.$dir;
		
        if ($handle = opendir($dir_absolute)) {
            while (false !== ($entry = readdir($handle))) {
                if ($entry != "." && $entry != "..") {
                    
                    if (is_dir($dir_absolute .  $entry)) {
                        $this->cleanupOldData($dir.$entry, $extension_to_filter);
                    }else{
                        $file_path = $dir_absolute .  $entry;
                        $pathinfo = pathinfo($file_path);
                        print('examining   /'.$dir .  $entry. chr(10));
                        
                        //including the packup files deletion
                        if(in_array($pathinfo['extension'], array($extension_to_filter, $extension_to_filter.'~')) && !in_array('/'.$dir . $entry, self::$RESOURCES_ASSOCIATION)){                            
                            unlink($file_path);
                            print($file_path.' is deleted....'. chr(10));
                        }
                    }
                }
            }
        }        
    }
}


$task = new OptimizeResourcesTask();
$task->run();
echo 'optimization done...';

after running this class, it will generate “_resource_map.php” file that contains array to store mapping between original resources and compiled ones, its contents will be similar to this:

<?php $resources_map = array('/css/main.css'=>'/css-compiled/d41d8cd98f00b204e9800998ecf8427e.css', 
'/css/redmondjquery-ui-1.8.14.custom.css'=>'/css-compiled/d41d8cd98f00b204e9800998ecf8427e.css', 
'/css/style.css'=>'/css-compiled/f866be09baee73d596cb578b02d37d29.css', 
'/js/jquery-1.5.1.min.js'=>'/js-compiled/6c1b3f8d121bfefdad82fb4854a8f254.js', 
'/js/jquery-ui-1.8.14.custom.min.js'=>'/js-compiled/e34d1750b1305e35327964b7f0ea6bb9.js', 
'/js/jquery.cookie.js'=>'/js-compiled/08bf7e471064522f8e45c382b2b93550.js', 
'/js/jquery.easing-1.3.pack.js'=>'/js-compiled/0301f5ff89729b3c0fc5622b7633f4b8.js', 
'/js/jquery.fancybox-1.3.4.js'=>'/js-compiled/cb707a9b340d624510e1fa27d3692f0e.js', 
'/js/jquery.fancybox-1.3.4.pack.js'=>'/js-compiled/f58ec8d752b6148925d6a3f14061c269.js', 
'/js/jquery.min.js'=>'/js-compiled/5ee7bdd2dbbdec528925cb61c3010598.js', 
'/js/jquery.validate.min.js'=>'/js-compiled/9d28b87b0ec7b4e3195665adbd6918be.js', 
); 

now we need a function to get the optimized version of the files (in production environment only):

<?php 
function resource_path($file){
    global $config;
	if($config['env'] == 'prod'){ //serve compiled resource only on production environment
		include '_resource_map.php';
		if(isset($resources_map[$file])){
			return $resources_map[$file]; //return compiled version of the file
		}
	}
	return $file;
} ?>

and here is the use of example css/js file in “header.php”:

<link href="<?php echo resource_path('/css/style.css') ?>" rel="stylesheet" type="text/css" />
<script src="<?php echo resource_path('/js/jquery.min.js') ?>" type="text/javascript" ></script>

now once you render the page in production environment, the optimized css/js will be served instead of the original ones, as follows:
<link href="/css-compiled/f866be09baee73d596cb578b02d37d29.css" rel="stylesheet" type="text/css" />
<script src="/js-compiled/5ee7bdd2dbbdec528925cb61c3010598.js" type="text/javascript" ></script>

Now everything works good and you can serve optimized versions of your resource files with minimal effort upon each update on your website. Whenever there is some amendments to the website theme,  I would only run optimizeResourcesTask to optimize files and serve them automatically in production environments.

I used this code for my project s that written in native php or old symfony version, but as I mentioned  earlier there is some frameworks like symfony2 assetic that perform similar functionality with long list of optimizers available.