Recently I have been searching for a way for scraping javascript websites using PHP. In order to do that most tutorials I found used libraries from another language – specifically node.js libraries – like Puppeteer or phantomjs – and connected the output to php. However I wanted minimum dependencies in pure php.
Also, I have read about Facebook php webdriver which allows you to control web browsers from PHP (run browsers in headless mode, but still the installation required extra steps like installing and configuring Selenuim webdriver.
Using Symfony Panther
Fortunately I found a library that make scraping much easier, Panther Symfony component, as mentioned in its github page:
Panther is a convenient standalone library to scrape websites and to run end-to-end tests using real browsers.
There are two main points for me to choose Panther:
1- Panther automatically finds your chrome binaries and launch it – using facebook webdriver- without any need to install any other software like Selenium, which made setup pretty easy as the effort needed is a minimal one.
2- Panther implements Symfony DomCrawler API which provides powerful methods to get information from DomDocument using Xpath or Css selectors.
Example output
What I want is to extract meta information about links inside a page, including: status code, title, meta description, canonical link, h1 tags, h2 tags…etc. I often use this example when testing scrapping libraries as it contains several selectors and type of information.
In order to do that, full html page content should be retrieved and rendered in the headless browser, then the desired elements should be extracted from resulting Dom Document
Basic Usage
First, we need to require Symfony Panther component via composer:
composer require symfony/panther
Then you can create panther client and request a page that produce object of type “Crawler”
<?php use Symfony\Component\DomCrawler\Crawler as DomCrawler; use Symfony\Component\Panther\Client as PantherClient; //very basic usage $client = PantherClient::createChromeClient(); $crawler = $client->request('GET', $url); //extract any information from the page by Xpath or CSS selector $crawler->filterXPath('//sample-xpath'); $crawler->filter('.css-based-selector');
Extracting Meta Information
request method will return crawler object that encapsulate all Dom querying functionality, you can filter elements either by CSS selector or Xpath,
and you can easily Extract SEO related information as below:
<?php function extractMetaInfo(DomCrawler $crawler): array { $linkMetaInfo = []; $linkMetaInfo['title'] = trim(strip_tags($crawler->filter('title')->html())); $crawler->filterXPath('//meta[@name="description"]')->each(function (DomCrawler $node) use (&$linkMetaInfo) { $linkMetaInfo['metaDescription'] = strip_tags($node->attr('content')); }); $crawler->filterXPath('//meta[@name="keywords"]')->each(function (DomCrawler $node) use (&$linkMetaInfo) { $linkMetaInfo['metaKeywords'] = trim($node->attr('content')); }); $crawler->filterXPath('//link[@rel="canonical"]')->each(function(DomCrawler $node) use (&$linkMetaInfo) { $linkMetaInfo['canonicalLink'] = trim($node->attr('href')); }); $h1Count = $crawler->filter('h1')->count(); if ($h1Count > 0) { $crawler->filter('h1')->each(function (DomCrawler $node, $i) use (&$linkMetaInfo) { $linkMetaInfo['h1Contents'][] = trim($node->text()); }); } $h2Count = $crawler->filter('h2')->count(); if ($h2Count > 0) { $crawler->filter('h2')->each(function (DomCrawler $node, $i) use (&$linkMetaInfo) { $linkMetaInfo['h2Contents'][] = trim($node->text()); }); } return $linkMetaInfo; }
This snippet will extract SEO related meta data, like: title, meta description, keywords, canonical link, h1 tags contents …etc.
Scraping Child Links?
You may want to repeat the process on all child links in the page, so to get list of child links, you can filter on <a> tags as below:
<?php function getChildLinks(DomCrawler $crawler) : array{ $childLinks = []; $crawler->filter('a')->each(function (DomCrawler $node, $i) use (&$childLinks) { $hrefVal = $node->extract('href')[0]; $childLinks[] = is_array($hrefVal)?current($hrefVal):$hrefVal; }); return $childLinks; }
This code will get list of child links, but some (or many) of those will be relative links, in order to construct the absolute links for it I will use guzzlehttp/psr7
which provides PSR-7 implementation that include functionality to process http messages including URIs.
It should be required in composer:
composer require guzzlehttp/psr7
and in order to compute absolute url and detect if it is an external link or not:
<?php function getAbsoluteUrl($childUrl, $fromUrl, &$isExternal):string{ $childPageUri = new GuzzleHttp\Psr7\Uri($childUrl); $fromPageUri = new GuzzleHttp\Psr7\Uri($fromUrl); if($childPageUri->getHost()!== $fromPageUri->getHost() && $childPageUri !== ""){ $isExternal = true; }else{ $isExternal = false; } $newUri = \GuzzleHttp\Psr7\UriResolver::resolve($fromPageUri, $childPageUri); $absolutePath = \GuzzleHttp\Psr7\Uri::composeComponents( $newUri->getScheme(), $newUri->getAuthority(), $newUri->getPath(), $newUri->getQuery(), "" ); return $absolutePath; }
Limitations
Currently it is not possible to get actual status code or content-type from a webdriver,
$client->request('GET', $nonExistantHostOrUri)
will return 200 status code. As headless browser mission to reflect what people see on a browser; so for it displays user-friendly error page is still a success.
So in order to get that information, get_headers function can be used as following:
<?php function getUrlHeaders($url): array { // overriding the default stream context to disable ssl checking stream_context_set_default( [ 'ssl' => [ 'verify_peer' => false, 'verify_peer_name' => false, ], ]); $headersArrRaw = get_headers($url, 1); if ($headersArrRaw === false) { throw new \Exception("cannot get headers for {$url}"); } $headersArr = array_change_key_case($headersArrRaw, CASE_LOWER); if (isset($headersArr[0]) === true && strpos($headersArr[0], 'HTTP/') !== false) { $statusStmt = $headersArr[0]; $statusParts = explode(' ', $statusStmt); $headersArr['status-code'] = $statusParts[1]; $statusIndex = strrpos($statusStmt, $statusParts[1]) + strlen($statusParts[1]) + 1; $headersArr['status'] = trim(substr($statusStmt, $statusIndex)); } if (is_array($headersArr['content-type']) === true) { $headersArr['content-type'] = end($headersArr['content-type']); } return $headersArr; }
Putting pieces together
By combining those two information together, we will get result set including major information needed similar to this:
<?php $client = \Symfony\Component\Panther\Client::createChromeClient(); $resultSet = []; $url = 'https://api-platform.com'; echo "Crawling $url".PHP_EOL; $resultSet[$url] = getUrlSeoInfo($client, $url); $i=0; $countLinks = count($resultSet[$url]['childrenLinks']); foreach($resultSet[$url]['childrenLinks'] as $childUrl){ $isExternal = false; $childAbsoluteUrl = getAbsoluteUrl((string)$childUrl, $url, $isExternal); $i++; echo "[{$i}/{$countLinks}] Crawling $childAbsoluteUrl".PHP_EOL; if($isExternal === false){ $resultSet[$childUrl] = getUrlSeoInfo($client, $childAbsoluteUrl); } } file_put_contents('test.json', json_encode($resultSet));
The output will be something like this screenshot:

Finally
You can pretty easily scrap javascript sites from php script without using other languages like node.js libraries; also if you need other information from your page, all you have to do is to use different selectors or modify implementation of extractMetaInfo
function in code sample above.
I’m maintaining arachnid library which is php crawler for SEO purposes, and I have added javascript support to it using symfony panther recently, if this script is interesting to you, I would love if you can check Arachnid and submit any bugs, issues or suggestions.
Thanks for sharing information.