Tag Archives: laravel

Laravel collections usage with crawler links

Recently I have been maintaining a package for crawling website internal links called Arachnid, this package crawls website links, and extract information about each page including: title, meta tags, h1 tags, status code along with other info, it returns link information as an array. I was searching for a convenient way to extract meaningful summary information from the library output, and that’s when I come a cross Laravel Collections, and it was really useful in my case.

Basic Information

Arachnid library returns information about website links similar to the structure below:

 
  "/" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://zrashwani.com/"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://zrashwani.com"
    "depth" => 1
    "status_code" => 200
    "title" => "Z.Rashwani Blog - I write here whatever comes to my mind"
    "meta_keywords" => "Zeid Rashwani, Zaid Rashwani, zrashwani, web development, LAMP, PHP, mysql, Linux, Symfony2, apache, DBA"
    "meta_description" => "Zeid Rashwani personal blog, mostly contains technical topics related to web development using, LAMP, Linux, Apache, MySQL, PHP, and other open source technologies"
    "h1_count" => 3
    "h1_contents" => array:3 [ …3]
  ]
  "" => array:7 [▼
    "original_urls" => array:3 [ …3]
    "links_text" => array:2 [ …2]
    "visited" => false
    "dont_visit" => true
    "external_link" => false
    "source_link" => "http://zrashwani.com"
    "depth" => 1
  ]
  //... other links

As shown, each element store information about link including:
* original_urls: original url before normalization.
* links_text: text inside <a> tag for the link to appear in.
* absolute_url: absolute version of the url.
* visited: whether this link is visited/crawled or not.
* frequency: frequency on the link appearing in the website.
* source_link: the page which this link first crawled from.
* depth: on which level/depth the link is found.
* status_code: status code of the page (ex. 200, 404…etc.).
* title, meta description, meta keywords: meta information about the page.
* h1_count, h1_contents: number of <h1> tags, and their contents.

Main Functions

Laravel collections offer large set of functions, mainly it is wrapper over functional methods like: filter, map, reduce, each.
The main methods that I used in my case are the following:

  • filter: filter out any elements of an array that you don’t want.
  • map: transform each item in an array into something else.
  • each: loop over collection elements.

Those methods represents the main functional building blocks, which are wrappers over native php functions like: array_walk, array_filter, array_map but with cleaner OOP approach that can easily be used with method chaining.  Also these functions are higher order functions that take a function callback as a parameter.

As a basic example the link collection will be like following example:

<?php 
$url = 'http://zrashwani.com/'; //website url to be crawled 
$crawler = new \Arachnid\Crawler($url, 3);

$crawler->traverse();
$links = $crawler->getLinks();
$collection = collect($links);

Manipulate using Collections

1. Arrange links by page source:
So each array is arranged parent page URL is displayed as key, and value will be links info that parent page contains using groupBy function:

<?php
$linksBySource = $collection->groupBy('source_link');

The output will be an associative array, with each source page as key, and collection of links that this page contains, similar to the result below:
LinksCollection {#276 ▼
  #items: array:18 [▼
    "" => LinksCollection {#261 ▶}
    "http://zrashwani.com/" => LinksCollection {#117 ▼
      #items: array:144 [▶]
    }
    "http://zrashwani.com/category/technical-topics/" => LinksCollection {#84 ▼
      #items: array:2 [▶]
    }
    "http://zrashwani.com/category/anime-reviews/" => LinksCollection {#124 ▶}
    "http://zrashwani.com/about-me/" => LinksCollection {#168 ▶}
    "http://zrashwani.com/wcf-ssl-service-with-php/" => LinksCollection {#207 ▶}
    "http://zrashwani.com/tag/php/" => LinksCollection {#212 ▶}
    "http://zrashwani.com/author/zaid/" => LinksCollection {#295 ▶}
    "http://zrashwani.com/sonata-admin-bundle-multiple-connection/" => LinksCollection {#213 ▶}
    "http://zrashwani.com/tag/database/" => LinksCollection {#52 ▶}
    "http://zrashwani.com/tag/symfony/" => LinksCollection {#140 ▶}
    "http://zrashwani.com/applying-version-stamp-symfony-sonataadmin/" => LinksCollection {#239 ▶}
    "http://zrashwani.com/server-sent-events-example-laravel/" => LinksCollection {#145 ▶}
    "http://zrashwani.com/pagination-optimization-symfony2-doctrine/" => LinksCollection {#162 ▶}
    "http://zrashwani.com/tag/mysql/" => LinksCollection {#282 ▶}
    "http://zrashwani.com/materialized-views-example-postgresql-9-3/" => LinksCollection {#219 ▶}
    "http://zrashwani.com/simple-web-spider-php-goutte/" => LinksCollection {#270 ▶}
    "http://zrashwani.com/page/2/" => LinksCollection {#218 ▶}
  ]
}

2. Get External Links:
The external links can be retrieved by using filter function  according to “external_link” key as below:

<?php
$externalLinks = $collection->filter(function($link_info){
            return isset($link_info['external_link'])===true 
                       && $link_info['external_link']===true;
        });
        

The output will include all external links in the website as below:
LinksCollection {#64 ▼
  #items: array:112 [▼
    "http://www.wewebit.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://www.wewebit.com"
      "external_link" => true
      "visited" => true
      "frequency" => 1
      "source_link" => "http://zrashwani.com/"
      "depth" => 1
      "status_code" => 200
      "title" => "WeWebit - Web Development specialists"
      "meta_keywords" => "Wewebit, Website, Mt4, Forex web design , Development, Forex website design ,Forex web development ,forex bo system , Forex Backoffice Systems ,forex CRM , FX toolskit, forex toolskit , Forex client area , forex client cabinet , members cabinet , forex IB system , ecommerce website development"
      "meta_description" => "Web Development Company in jordan providing development and design services, including Forex Solutions , News Portals , Custom Web Applications, online e-commerce solutions"
      "h1_count" => 2
      "h1_contents" => array:2 [▶]
    ]
    "http://jo.linkedin.com/pub/zaid-al-rashwani/14/996/180/" => array:10 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://jo.linkedin.com/pub/zaid-al-rashwani/14/996/180/"
      "external_link" => true
      "visited" => false
      "frequency" => 1
      "source_link" => "http://zrashwani.com/"
      "depth" => 1
      "status_code" => 999
      "error_message" => 999
    ]
    "https://twitter.com/zaid_86" => array:14 [▶]
    ...
  ]
}    

3. Filter and search links by depth:
This useful for getting the links that first appeared in the nth level in the website, for example getting links in depth=3 in the website as below:

<?php
$depth = 3;
$depth2Links = $collection->filter(function($link) use($depth){
            return isset($link['depth']) && $link['depth'] == $depth;
        });
        

The output will be similar to this:
LinksCollection {#734 ▼
  #items: array:141 [▼
    "/introduction-to-sphinx-with-php-part2/" => array:8 [▶]
    "/category/technical-topics/page/2/" => array:8 [▶]
    "/anime-watched-in-summer-2015/" => array:8 [▶]
    "/anime-fall-2015-shows/" => array:8 [▶]
    "/anime-winter-2015-watch-list/" => array:8 [▶]
    "/anime-winter-2014-watch-list/" => array:8 [▶]
    

or you can get simple statistics about how many links in each level in the website by combining groupBy function with mapWithKeys function – which is same as map function but returns key/value pair – as below:

<?php
$linksGroupedByDepth = $collection->groupBy('depth')
        ->mapWithKeys(function($depthGroup,$depth){
            return [$depth =>$depthGroup->count()];
        });
        

it will display how many links exist in each site level:
LinksCollection {#824 ▼
  #items: array:4 [▼
    0 => 1
    1 => 75
    2 => 300
    3 => 141
  ]
}

4. Get Broken links:
The broken links can be retrieved by filtering items according to status code, success pages have status code between 200 and 299, so anything else will be considered as broken link, as below:

<?php 
$brokenLinks = $collection->filter(function($link){
    return isset($link['status_code']) && 
            ($link['status_code'] >= 300 || $link['status_code'] <200);
});

or better broken links can be grouped according to the page where link exists, using groupBy function, extract only summary information using map function:
 <?php
        $brokenLinksBySource = $collection->filter(function($link){
            return isset($link['status_code']) && 
                    ($link['status_code'] >= 300 || $link['status_code'] <200);
        })->map(function($link){
           return [
                'source_page' => $link['source_link'],
                'link'        => $link['absolute_url'],
                'status_code' => $link['status_code'],
                'links_text'  => $link['links_text'],
               ];
        })
        ->unique('link')
        ->groupBy('source_link'); 
        

5. Getting pages that have no title or h1 tags:
This is useful for SEO purposes, and can be done using filter method:

<?php
$linksWithMissingTitle = $collection->filter(function($link_info){
           return empty($link_info['title']); 
        });

<?php
//getting pages with no <h1> tags
$missingH1Pages = $collection->filter(function($link_info){
           return $link_info['h1_count']==0; 
        });
        

The output will contain all pages with no <h1> tag as below
LinksCollection {#823 ▼
  #items: array:2 [▼
    "http://wordcomat.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://wordcomat.com"
      "external_link" => true
      "visited" => true
      "frequency" => 2
      "source_link" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "depth" => 2
      "status_code" => 200
      "title" => ""
      "meta_keywords" => ""
      "meta_description" => ""
      "h1_count" => 1
      "h1_contents" => array:1 [▶]
    ]
    "http://matword.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:2 [▶]
      "absolute_url" => "http://matword.com"
      "external_link" => true
      "visited" => true
      "frequency" => 4
      "source_link" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "depth" => 2
      "status_code" => 200
      "title" => ""
      "meta_keywords" => ""
      "meta_description" => ""
      "h1_count" => 0
      "h1_contents" => []
    ]
  ]
}

6. Getting duplicate titles between different URLs:
This is useful to see if you have any different pages that have same title – which may effect your SEO negatively by combining several methods of filter,groupBy,unique and map as following:

<?php
        $duplicateTitlePages = $collection
                   ->filter(function($linkInfo){
                       return $linkInfo['visited']===true;
                   })                   
                   ->groupBy('title',true)
                   ->unique('absolute_url')        
                   ->filter(function($links){
                       return count($links)>1;
                   })->map(function($linkGroup){
                       return $linkGroup->map(function($linkInfo,$uri){
                            return $uri;
                       })->values();
                   });

The output will be a collection with the duplicate title as a key, along with URLs that have that title as following output:
LinksCollection {#652 ▼
  #items: array:1 [▼
    "Z.Rashwani Blog - I write here whatever comes to my mind" => LinksCollection {#650 ▼
      #items: array:2 [▼
        0 => "http://zrashwani.com/"
        1 => "/"
      ]
    }
  ]
}

More

There are many other ways to use Laravel collections method combinations to get useful information from such link info array, like: getting links with Internal server errors or links with long/short meta description…etc. And the more you are used to using collections the more useful information that can be retrieved.

Server Sent Events example with laravel

Recently I have read about HTML5 Server Sent Events, and liked the concept of establishing long-lived connections to the server, instead of performing frequent Ajax calls to pull the updates. And I wanted to put it into action by implementing live currency rates widget with Laravel as backend PHP application.

Basic Introduction

What are “Server Sent Events”?
As Wikipedia defines

Server-sent events (SSE) is a technology for a browser to get automatic updates from a server via HTTP connection. The Server-Sent Events EventSource API is standardized as part of HTML5 by the W3C.

Basically, its an HTML5 technology that helps web client getting data from the server, using one connection that lives on the server for long interval and sending stream of data to the browser without closing the connection  (basically the connection will remain active until browser closes); such technique is useful for pushing news updates, automatically send updates in a social network, and populating live prices components…etc.

The older approach is called Ajax Long Polling which implemented requesting the updates from the web client by issuing frequent separate requests (by initiating Ajax request recursively with timeout), like the following example:
(function poll(){
   setTimeout(function(){
      $.ajax({ url: "/path/to/url", success: function(data){
        console.log(data);  
        poll();
      }, dataType: "json"});
  }, 30000);
})();

To make the idea more clear, I will use live currency rates widget as an example; this widget gets the rates to convert between one currency to another, with displaying up & down arrows to indicate the change of the price.

Basic Usage

The following snippet shows the basic usage of SSE with javascript:

<script type="text/javascript">
var es = new EventSource("/path/to/url");
es.addEventListener("message", function(e) {
            console.log(e.data);
}, false);
</script>

This piece of javascript code intialize EventSource object which listen for the specified URL, and process the data, as the server sent it back to the browser. Each time the server send new data, the event listener method will be called and will process the information according to the callback function implementation.

The code

As I said, Laravel will be used to implement this example, I will implement two actions; one for rendering the whole page, and the other will send only modified data in json format to the EventSource; as the following:

First, I defined the routes in the routes.php

// in apps/routes.php 
Route::get('/prices-page', 'HomeController@pricesPage');
Route::get('/prices-values', 'HomeController@pricesValues');

Then, I will implement a method to retrieve the rates values ( I used yahoo service as a free feed source):
    /**
     * retrieve rates of currencies from feed
     * @return array
     */
    protected function getCurrencyRates() {
        $pair_arr = array('EURUSD', 'GBPUSD', 'USDJPY', 'XAUUSD', 'XAGUSD', 'USDJOD');
        $currencies_arr = array();

        foreach ($pair_arr as $pair) {
            try {
                
                $price_csv = file_get_contents("http://finance.yahoo.com/d/quotes.csv?e=.csv&f=sl1d1t1&s=$pair=X");
                $price_data = explode(',', $price_csv);
                $currencies_arr[$pair]['price'] = $price_data[1];
                $currencies_arr[$pair]['status'] = '';
            } catch (Exception $ex) {
                $currencies_arr['error'] = $ex->getMessage();
            }
        }
        return $currencies_arr;
    }

It is not efficient to get file from external source in a controller, but I use it here for the purpose of the example. Usually, I write a backend command to get the prices from external source (usually trading server) and controller methods retrieve data from the database.

Second, I will implement the the action to render the whole price block:

public function pricesPage() {
    $prices = $this->getCurrencyRates();
    return View::make('pricesPage', array('prices' => $prices));        
}

and here is the template:
<h1>Prices here</h1>
<table>
    <thead>
        <tr>
            <th>Currency</th>
            <th>Rate</th>
            <th>status</th>
        </tr>
    </thead>
    <tbody>
        <?php foreach($prices as $currency=>$price_info){?>
        <tr class="price-row">
            <td><?php echo $currency?></td>
            <td data-symbol-price="<?php echo $currency; ?>"><?php echo $price_info['price']; ?></td>
            <td data-symbol-status="<?php echo $currency; ?>"><?php echo $price_info['status']; ?></td>
        </tr>
        <?php }?>
    </tbody>
</table>

<script type="text/javascript">
        var es = new EventSource("<?php echo action('HomeController@pricesValues'); ?>");
        es.addEventListener("message", function(e) {
            arr = JSON.parse(e.data);
            
            for (x in arr) {    	
                $('[data-symbol-price="' + x + '"]').html(arr[x].price);
                $('[data-symbol-status="' + x + '"]').html(arr[x].status);
                //apply some effect on change, like blinking the color of modified cell...
            }
        }, false);
</script>    

And now I will implement pricesValues() action that will push the data to the server, as following:

    /**
     * action to handle streamed response from laravel
     * @return \Symfony\Component\HttpFoundation\StreamedResponse
     */
    public function pricesValues() {

            $response = new Symfony\Component\HttpFoundation\StreamedResponse(function() {
            $old_prices = array();

            while (true) {
                $new_prices = $this->getCurrencyRates();
                $changed_data = $this->getChangedPrices($old_prices, $new_prices);

                if (count($changed_data)) {
                    echo 'data: ' . json_encode($changed_data) . "\n\n";
                    ob_flush();
                    flush();
                }
                sleep(3);
                $old_prices = $new_prices;
            }
        });

        $response->headers->set('Content-Type', 'text/event-stream');
        return $response;
    }
    

    /**
     * comparing old and new prices and return only changed currency rates
     * @param array $old_prices
     * @param array $new_prices
     * @return array
     */
    protected function getChangedPrices($old_prices, $new_prices) {
        $ret = array();
        foreach ($new_prices as $curr => $curr_info) {
            if (!isset($old_prices[$curr])) {
                $ret[$curr]['status'] = '';
                $ret[$curr]['price'] = $curr_info['price'];                
            } elseif ($old_prices[$curr]['price'] != $curr_info['price']) {
                $ret[$curr]['status'] = $old_prices[$curr]['price']>$curr_info['price']?'down':'up';
                $ret[$curr]['price'] = $curr_info['price']; 
            }
        }

        return $ret;
    }

As you notice, the action that push data to the event source, have following properties:

  1. the content type of the response is text/event-stream.
  2. the response I returned here, is of type “StreamedResponse” which is part of Symfony HTTP foundation component, this type of response enables the server to return data to the client as chunks. StreamedResponse object accepts a callback function to output the transferred data chunks.
  3. The prices that have been changed since the latest push will be sent back to browser, (I have compared the old and new prices easily since they reside in the same action), so if the prices didn’t change nothing will be sent back to the browser.
  4. The data returned is prefixed with “data:” and appended “\n\n” characters to the end.
  5. flush() and ob_flush() are called to trigger sending data back to the browser.
For the browsers that don’t support HTML5 features, you can apply simple fallback as following:
<script type="text/javascript">
if(window.EventSource !== undefined){
    // supports eventsource object go a head...
} else {
    // EventSource not supported, 
    // apply ajax long poll fallback
    }
</script>

The final output

Now the live currency rates widget is ready, the widget will auto-refresh prices every 3 seconds, and the server will send only rates that has been changed, so the operation is optimized and will not exchange unnecessary requests/response.

SSE price rate
* screenshot of the final component.