Scraping javascript websites using PHP Panther Library

Recently I have been searching for a way for scraping javascript websites using PHP. In order to do that most tutorials I found used libraries from another language – specifically node.js libraries – like Puppeteer or phantomjs – and connected the output to php. However I wanted minimum dependencies in pure php.

Also, I have read about Facebook php webdriver which allows you to control web browsers from PHP (run browsers in headless mode, but still the installation required extra steps like installing and configuring Selenuim webdriver.

Previously I used Goutte library to develop simple scrapper, however it is simple Http client that don’t run javascript.

Using Symfony Panther

Fortunately I found a library that make scraping much easier, Panther Symfony component, as mentioned in its github page:

Panther is a convenient standalone library to scrape websites and to run end-to-end tests using real browsers.

There are two main points for me to choose Panther:
1- Panther automatically finds your chrome binaries and launch it – using facebook webdriver- without any need to install any other software like Selenium, which made setup pretty easy as the effort needed is a minimal one.
2- Panther implements Symfony DomCrawler API which provides powerful methods to get information from DomDocument using Xpath or Css selectors.

Example output

What I want is to extract meta information about links inside a page, including: status code, title, meta description, canonical link, h1 tags, h2 tags…etc. I often use this example when testing scrapping libraries as it contains several selectors and type of information.
In order to do that, full html page content should be retrieved and rendered in the headless browser, then the desired elements should be extracted from resulting Dom Document

Basic Usage

First, we need to require Symfony Panther component via composer:

composer require symfony/panther

Then you can create panther client and request a page that produce object of type “Crawler”

<?php

use Symfony\Component\DomCrawler\Crawler as DomCrawler;
use Symfony\Component\Panther\Client as PantherClient;

//very basic usage
$client = PantherClient::createChromeClient();
$crawler = $client->request('GET', $url);

//extract any information from the page by Xpath or CSS selector
$crawler->filterXPath('//sample-xpath');
$crawler->filter('.css-based-selector');

Extracting Meta Information

request method will return crawler object that encapsulate all Dom querying functionality, you can filter elements either by CSS selector or Xpath,
and you can easily Extract SEO related information as below:

<?php

function extractMetaInfo(DomCrawler $crawler): array {
    $linkMetaInfo = [];
    $linkMetaInfo['title'] = trim(strip_tags($crawler->filter('title')->html()));
    $crawler->filterXPath('//meta[@name="description"]')->each(function (DomCrawler $node) use (&$linkMetaInfo) {
        $linkMetaInfo['metaDescription'] = strip_tags($node->attr('content'));
    });
    $crawler->filterXPath('//meta[@name="keywords"]')->each(function (DomCrawler $node) use (&$linkMetaInfo) {
        $linkMetaInfo['metaKeywords'] = trim($node->attr('content'));
    });

    $crawler->filterXPath('//link[@rel="canonical"]')->each(function(DomCrawler $node) use (&$linkMetaInfo) {
        $linkMetaInfo['canonicalLink'] = trim($node->attr('href'));
    });

    $h1Count = $crawler->filter('h1')->count();
    if ($h1Count > 0) {
        $crawler->filter('h1')->each(function (DomCrawler $node, $i) use (&$linkMetaInfo) {
            $linkMetaInfo['h1Contents'][] = trim($node->text());
        });
    }

    $h2Count = $crawler->filter('h2')->count();
    if ($h2Count > 0) {
        $crawler->filter('h2')->each(function (DomCrawler $node, $i) use (&$linkMetaInfo) {
            $linkMetaInfo['h2Contents'][] = trim($node->text());
        });
    }
    
    return $linkMetaInfo;
}

This snippet will extract SEO related meta data, like: title, meta description, keywords, canonical link, h1 tags contents …etc.

Note: $node->text() method, will work only on visible elements, so if you used that on hidden text like < title> tag, it will return empty string; that’s why I used $node->html()┬ámethod with strip tags to get page title.

Scraping Child Links?

You may want to repeat the process on all child links in the page, so to get list of child links, you can filter on <a> tags as below:

<?php

function getChildLinks(DomCrawler $crawler) : array{
    $childLinks = [];
    $crawler->filter('a')->each(function (DomCrawler $node, $i) use (&$childLinks) {  
        $hrefVal = $node->extract('href')[0];          
        $childLinks[] = is_array($hrefVal)?current($hrefVal):$hrefVal;
    });
    
    return $childLinks;
}

This code will get list of child links, but some (or many) of those will be relative links, in order to construct the absolute links for it I will use guzzlehttp/psr7 which provides PSR-7 implementation that include functionality to process http messages including URIs.
It should be required in composer:

composer require guzzlehttp/psr7

and in order to compute absolute url and detect if it is an external link or not:

<?php
function getAbsoluteUrl($childUrl, $fromUrl, &$isExternal):string{

    $childPageUri = new GuzzleHttp\Psr7\Uri($childUrl);
    $fromPageUri = new GuzzleHttp\Psr7\Uri($fromUrl);

    if($childPageUri->getHost()!== $fromPageUri->getHost() && $childPageUri !== ""){
      $isExternal = true;
    }else{
      $isExternal = false;
    }
    
    $newUri = \GuzzleHttp\Psr7\UriResolver::resolve($fromPageUri, $childPageUri);     
    $absolutePath = \GuzzleHttp\Psr7\Uri::composeComponents(
                $newUri->getScheme(),
                $newUri->getAuthority(),
                $newUri->getPath(),
                $newUri->getQuery(),
                ""
            );
    return $absolutePath;
}

Limitations

Currently it is not possible to get actual status code or content-type from a webdriver,
$client->request('GET', $nonExistantHostOrUri) will return 200 status code. As headless browser mission to reflect what people see on a browser; so for it displays user-friendly error page is still a success.

In background, this limitation causes lack of HttpFoundation request/response in Panther library, so we cannot invoke $client->getResponse()->getStatusCode() like it was done in other scrapping libraries like Goutte client.

So in order to get that information, get_headers function can be used as following:

<?php

function getUrlHeaders($url): array {
    // overriding the default stream context to disable ssl checking
    stream_context_set_default( [
      'ssl' => [
        'verify_peer' => false,
        'verify_peer_name' => false,
      ],
    ]);

    $headersArrRaw = get_headers($url, 1);
    if ($headersArrRaw === false) {
        throw new \Exception("cannot get headers for {$url}");
    }

    $headersArr = array_change_key_case($headersArrRaw, CASE_LOWER);
    if (isset($headersArr[0]) === true && strpos($headersArr[0], 'HTTP/') !== false) {
        $statusStmt = $headersArr[0];
        $statusParts = explode(' ', $statusStmt);
        $headersArr['status-code'] = $statusParts[1];

        $statusIndex = strrpos($statusStmt, $statusParts[1]) + strlen($statusParts[1]) + 1;
        $headersArr['status'] = trim(substr($statusStmt, $statusIndex));
    }
    if (is_array($headersArr['content-type']) === true) {
        $headersArr['content-type'] = end($headersArr['content-type']);
    }
    
    return $headersArr;
}

Putting pieces together

By combining those two information together, we will get result set including major information needed similar to this:

<?php 

$client = \Symfony\Component\Panther\Client::createChromeClient();

$resultSet = [];
$url = 'https://api-platform.com';
echo "Crawling $url".PHP_EOL;
$resultSet[$url] = getUrlSeoInfo($client, $url);

$i=0;
$countLinks = count($resultSet[$url]['childrenLinks']);
foreach($resultSet[$url]['childrenLinks'] as $childUrl){  
     $isExternal = false;
     $childAbsoluteUrl = getAbsoluteUrl((string)$childUrl, $url, $isExternal);   

     $i++;  
     echo "[{$i}/{$countLinks}] Crawling $childAbsoluteUrl".PHP_EOL;
     if($isExternal === false){
        $resultSet[$childUrl] = getUrlSeoInfo($client, $childAbsoluteUrl);
      }
}

file_put_contents('test.json', json_encode($resultSet));

The output will be something like this screenshot:
output for panther php library scraping javascrip websites

Finally

You can pretty easily scrap javascript sites from php script without using other languages like node.js libraries; also if you need other information from your page, all you have to do is to use different selectors or modify implementation of extractMetaInfo function in code sample above.

I’m maintaining arachnid library which is php crawler for SEO purposes, and I have added javascript support to it using symfony panther recently, if this script is interesting to you, I would love if you can check Arachnid and submit any bugs, issues or suggestions.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>