Tag Archives: laravel collections

Laravel collections usage with crawler links

Recently I have been maintaining a package for crawling website internal links called Arachnid, this package crawls website links, and extract information about each page including: title, meta tags, h1 tags, status code along with other info, it returns link information as an array. I was searching for a convenient way to extract meaningful summary information from the library output, and that’s when I come a cross Laravel Collections, and it was really useful in my case.

Basic Information

Arachnid library returns information about website links similar to the structure below:

 
  "/" => array:14 [▼
    "original_urls" => array:1 [ …1]
    "links_text" => array:1 [ …1]
    "absolute_url" => "http://zrashwani.com/"
    "external_link" => false
    "visited" => true
    "frequency" => 1
    "source_link" => "http://zrashwani.com"
    "depth" => 1
    "status_code" => 200
    "title" => "Z.Rashwani Blog - I write here whatever comes to my mind"
    "meta_keywords" => "Zeid Rashwani, Zaid Rashwani, zrashwani, web development, LAMP, PHP, mysql, Linux, Symfony2, apache, DBA"
    "meta_description" => "Zeid Rashwani personal blog, mostly contains technical topics related to web development using, LAMP, Linux, Apache, MySQL, PHP, and other open source technologies"
    "h1_count" => 3
    "h1_contents" => array:3 [ …3]
  ]
  "" => array:7 [▼
    "original_urls" => array:3 [ …3]
    "links_text" => array:2 [ …2]
    "visited" => false
    "dont_visit" => true
    "external_link" => false
    "source_link" => "http://zrashwani.com"
    "depth" => 1
  ]
  //... other links

As shown, each element store information about link including:
* original_urls: original url before normalization.
* links_text: text inside <a> tag for the link to appear in.
* absolute_url: absolute version of the url.
* visited: whether this link is visited/crawled or not.
* frequency: frequency on the link appearing in the website.
* source_link: the page which this link first crawled from.
* depth: on which level/depth the link is found.
* status_code: status code of the page (ex. 200, 404…etc.).
* title, meta description, meta keywords: meta information about the page.
* h1_count, h1_contents: number of <h1> tags, and their contents.

Main Functions

Laravel collections offer large set of functions, mainly it is wrapper over functional methods like: filter, map, reduce, each.
The main methods that I used in my case are the following:

  • filter: filter out any elements of an array that you don’t want.
  • map: transform each item in an array into something else.
  • each: loop over collection elements.

Those methods represents the main functional building blocks, which are wrappers over native php functions like: array_walk, array_filter, array_map but with cleaner OOP approach that can easily be used with method chaining.  Also these functions are higher order functions that take a function callback as a parameter.

As a basic example the link collection will be like following example:

<?php 
$url = 'http://zrashwani.com/'; //website url to be crawled 
$crawler = new \Arachnid\Crawler($url, 3);

$crawler->traverse();
$links = $crawler->getLinks();
$collection = collect($links);

Manipulate using Collections

1. Arrange links by page source:
So each array is arranged parent page URL is displayed as key, and value will be links info that parent page contains using groupBy function:

<?php
$linksBySource = $collection->groupBy('source_link');

The output will be an associative array, with each source page as key, and collection of links that this page contains, similar to the result below:
LinksCollection {#276 ▼
  #items: array:18 [▼
    "" => LinksCollection {#261 ▶}
    "http://zrashwani.com/" => LinksCollection {#117 ▼
      #items: array:144 [▶]
    }
    "http://zrashwani.com/category/technical-topics/" => LinksCollection {#84 ▼
      #items: array:2 [▶]
    }
    "http://zrashwani.com/category/anime-reviews/" => LinksCollection {#124 ▶}
    "http://zrashwani.com/about-me/" => LinksCollection {#168 ▶}
    "http://zrashwani.com/wcf-ssl-service-with-php/" => LinksCollection {#207 ▶}
    "http://zrashwani.com/tag/php/" => LinksCollection {#212 ▶}
    "http://zrashwani.com/author/zaid/" => LinksCollection {#295 ▶}
    "http://zrashwani.com/sonata-admin-bundle-multiple-connection/" => LinksCollection {#213 ▶}
    "http://zrashwani.com/tag/database/" => LinksCollection {#52 ▶}
    "http://zrashwani.com/tag/symfony/" => LinksCollection {#140 ▶}
    "http://zrashwani.com/applying-version-stamp-symfony-sonataadmin/" => LinksCollection {#239 ▶}
    "http://zrashwani.com/server-sent-events-example-laravel/" => LinksCollection {#145 ▶}
    "http://zrashwani.com/pagination-optimization-symfony2-doctrine/" => LinksCollection {#162 ▶}
    "http://zrashwani.com/tag/mysql/" => LinksCollection {#282 ▶}
    "http://zrashwani.com/materialized-views-example-postgresql-9-3/" => LinksCollection {#219 ▶}
    "http://zrashwani.com/simple-web-spider-php-goutte/" => LinksCollection {#270 ▶}
    "http://zrashwani.com/page/2/" => LinksCollection {#218 ▶}
  ]
}

2. Get External Links:
The external links can be retrieved by using filter function  according to “external_link” key as below:

<?php
$externalLinks = $collection->filter(function($link_info){
            return isset($link_info['external_link'])===true 
                       && $link_info['external_link']===true;
        });
        

The output will include all external links in the website as below:
LinksCollection {#64 ▼
  #items: array:112 [▼
    "http://www.wewebit.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://www.wewebit.com"
      "external_link" => true
      "visited" => true
      "frequency" => 1
      "source_link" => "http://zrashwani.com/"
      "depth" => 1
      "status_code" => 200
      "title" => "WeWebit - Web Development specialists"
      "meta_keywords" => "Wewebit, Website, Mt4, Forex web design , Development, Forex website design ,Forex web development ,forex bo system , Forex Backoffice Systems ,forex CRM , FX toolskit, forex toolskit , Forex client area , forex client cabinet , members cabinet , forex IB system , ecommerce website development"
      "meta_description" => "Web Development Company in jordan providing development and design services, including Forex Solutions , News Portals , Custom Web Applications, online e-commerce solutions"
      "h1_count" => 2
      "h1_contents" => array:2 [▶]
    ]
    "http://jo.linkedin.com/pub/zaid-al-rashwani/14/996/180/" => array:10 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://jo.linkedin.com/pub/zaid-al-rashwani/14/996/180/"
      "external_link" => true
      "visited" => false
      "frequency" => 1
      "source_link" => "http://zrashwani.com/"
      "depth" => 1
      "status_code" => 999
      "error_message" => 999
    ]
    "https://twitter.com/zaid_86" => array:14 [▶]
    ...
  ]
}    

3. Filter and search links by depth:
This useful for getting the links that first appeared in the nth level in the website, for example getting links in depth=3 in the website as below:

<?php
$depth = 3;
$depth2Links = $collection->filter(function($link) use($depth){
            return isset($link['depth']) && $link['depth'] == $depth;
        });
        

The output will be similar to this:
LinksCollection {#734 ▼
  #items: array:141 [▼
    "/introduction-to-sphinx-with-php-part2/" => array:8 [▶]
    "/category/technical-topics/page/2/" => array:8 [▶]
    "/anime-watched-in-summer-2015/" => array:8 [▶]
    "/anime-fall-2015-shows/" => array:8 [▶]
    "/anime-winter-2015-watch-list/" => array:8 [▶]
    "/anime-winter-2014-watch-list/" => array:8 [▶]
    

or you can get simple statistics about how many links in each level in the website by combining groupBy function with mapWithKeys function – which is same as map function but returns key/value pair – as below:

<?php
$linksGroupedByDepth = $collection->groupBy('depth')
        ->mapWithKeys(function($depthGroup,$depth){
            return [$depth =>$depthGroup->count()];
        });
        

it will display how many links exist in each site level:
LinksCollection {#824 ▼
  #items: array:4 [▼
    0 => 1
    1 => 75
    2 => 300
    3 => 141
  ]
}

4. Get Broken links:
The broken links can be retrieved by filtering items according to status code, success pages have status code between 200 and 299, so anything else will be considered as broken link, as below:

<?php 
$brokenLinks = $collection->filter(function($link){
    return isset($link['status_code']) && 
            ($link['status_code'] >= 300 || $link['status_code'] <200);
});

or better broken links can be grouped according to the page where link exists, using groupBy function, extract only summary information using map function:
 <?php
        $brokenLinksBySource = $collection->filter(function($link){
            return isset($link['status_code']) && 
                    ($link['status_code'] >= 300 || $link['status_code'] <200);
        })->map(function($link){
           return [
                'source_page' => $link['source_link'],
                'link'        => $link['absolute_url'],
                'status_code' => $link['status_code'],
                'links_text'  => $link['links_text'],
               ];
        })
        ->unique('link')
        ->groupBy('source_link'); 
        

5. Getting pages that have no title or h1 tags:
This is useful for SEO purposes, and can be done using filter method:

<?php
$linksWithMissingTitle = $collection->filter(function($link_info){
           return empty($link_info['title']); 
        });

<?php
//getting pages with no <h1> tags
$missingH1Pages = $collection->filter(function($link_info){
           return $link_info['h1_count']==0; 
        });
        

The output will contain all pages with no <h1> tag as below
LinksCollection {#823 ▼
  #items: array:2 [▼
    "http://wordcomat.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:1 [▶]
      "absolute_url" => "http://wordcomat.com"
      "external_link" => true
      "visited" => true
      "frequency" => 2
      "source_link" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "depth" => 2
      "status_code" => 200
      "title" => ""
      "meta_keywords" => ""
      "meta_description" => ""
      "h1_count" => 1
      "h1_contents" => array:1 [▶]
    ]
    "http://matword.com" => array:14 [▼
      "original_urls" => array:1 [▶]
      "links_text" => array:2 [▶]
      "absolute_url" => "http://matword.com"
      "external_link" => true
      "visited" => true
      "frequency" => 4
      "source_link" => "http://zrashwani.com/simple-web-spider-php-goutte/"
      "depth" => 2
      "status_code" => 200
      "title" => ""
      "meta_keywords" => ""
      "meta_description" => ""
      "h1_count" => 0
      "h1_contents" => []
    ]
  ]
}

6. Getting duplicate titles between different URLs:
This is useful to see if you have any different pages that have same title – which may effect your SEO negatively by combining several methods of filter,groupBy,unique and map as following:

<?php
        $duplicateTitlePages = $collection
                   ->filter(function($linkInfo){
                       return $linkInfo['visited']===true;
                   })                   
                   ->groupBy('title',true)
                   ->unique('absolute_url')        
                   ->filter(function($links){
                       return count($links)>1;
                   })->map(function($linkGroup){
                       return $linkGroup->map(function($linkInfo,$uri){
                            return $uri;
                       })->values();
                   });

The output will be a collection with the duplicate title as a key, along with URLs that have that title as following output:
LinksCollection {#652 ▼
  #items: array:1 [▼
    "Z.Rashwani Blog - I write here whatever comes to my mind" => LinksCollection {#650 ▼
      #items: array:2 [▼
        0 => "http://zrashwani.com/"
        1 => "/"
      ]
    }
  ]
}

More

There are many other ways to use Laravel collections method combinations to get useful information from such link info array, like: getting links with Internal server errors or links with long/short meta description…etc. And the more you are used to using collections the more useful information that can be retrieved.