In the previous week, I have been working on some sort of scrapper with text analysis functionality to collect data about financial reports, market analysis and business news, and generate useful summary using it…
so I started researching the best approach to extract the required data and identify the most useful information for target sources so I can index, analyze and extract required results from it.
So, I explored many resources regarding structured data and metadata formats like microdata, RDFa and Schema.org vocabulary, so I can write scripts to do this job using PHP and Goutte Scrapper.
Meta Data between past and present
In the previous decade, website main meta data were implemented using classic <meta> tag inside <head> tag of the web page, so a classic page meta data was something like this:
<head> <meta name="description" content="Statment about your page"> <meta name="keywords" content="keyword1,keyword2,keyword3"> <!-- other metadata values --> </head>
however, those old meta data structure is very limited – it shows only few information like: title, description and keywords – and it can be easily misused by sites. Today those old meta tags have little effect on search engine behavior. Instead newer technologies emerged in order to make webpages more structured and turn it into rich entity so the data can be extracted and used more efficiently out of it.
In 2011, schema.org initiative was launched by the main search engine companies – Google, Ping, Yahoo and followed by Yandex – to create and maintain vocabularies and promote schema for the structured data/matadata that can be included in the web pages, and parsed by Search Engine spiders and other applications.
Schema.org provided wide range of schema vocabularies that are used to display rich details about webpages in the search engine results, like: product ratings, price, offers or movie title, director, genre, actors..etc.
here is an example of laptop product metadata in e-commerce site – from bestbuy.com site – that uses schema.org vocabulary:
<div itemscope itemtype="http://schema.org/Product"> <h1 itemprop="name">HP 15.6" Laptop - Silver (AMD Quad-Core A10-4655M / 1TB HDD / 8GB RAM / Windows 8.1)</h1> Model#: <span itemprop="model">15-p284ca</span> <img itemprop="image" src="/multimedia/Products/500x500/103/10362/10362700.jpg" alt="HP 15.6" Laptop - Silver (AMD Quad-Core A10-4655M / 1TB HDD / 8GB RAM / Windows 8.1)" /> Web Code: <span itemprop="productid">10362700</span> <div itemprop="offers" itemscope itemtype="http://schema.org/Offer"> <meta itemprop="priceCurrency" content="CAD"> <div itemprop="price"> <span class="amount">$549.99</span> <div class="clear"></div> </div> </div> </div>
as you can notice, it specifies many detailed data about the product, including its name, image, price, model, offer… etc.
Schema.org uses different formats to embed those vocabularies inside webpage contents, including: Microdata, RDFa, and JSON-LD, I will write brief information about each of those formats.
Structured Data Formats
There are four formats of meta data that I can going to cover there, one of them is facebook open graph, which is not actually part of standard schema.org structured data formats, but it has wide spread among the websites because facebook depends on it (like sharing preview …etc) and provide useful information for webpages.
Microdata
It is a part of HTML5 standard, and implemented by adding simple html attributes to your webpage, mainly via three attributes: itemscope, itemtype and itemprop. I will use the preview snippet to give brief explanation about them:
<div class="container" itemscope itemtype="http://schema.org/Product"> <h1 itemprop="name">HP 15.6" Laptop - Silver (AMD Quad-Core A10-4655M / 1TB HDD / 8GB RAM / Windows 8.1)</h1> Model#: <span itemprop="model">15-p284ca</span> <img itemprop="image" src="/multimedia/Products/500x500/103/10362/10362700.jpg" alt="HP 15.6" Laptop - Silver (AMD Quad-Core A10-4655M / 1TB HDD / 8GB RAM / Windows 8.1)" /> Web Code: <span itemprop="productid">10362700</span> <!-- other data here --> </div>
- itemscope attribute is used to indicate that child metadata belong to a single item or “thing”, as in line 1
- itemtype attribute defines what is the type of the “thing” you are providing tags/metadata for, its value is something like that: http://schema.org/TYPE_NAME
where TYPE_NAME can be Movie, Article, Person …etc. and usually this attribute will be on the same element as itemscope attribute.
in example above, line 1 is indicates that this item is a product. - itemprop attribute describes the type of the ‘tag’ or ‘metadata’ that is included, so in the line 3 in previous snippet, the text represents model of the product, while in line 5 it represents the productid
RDFa
RDF stands for Resource Definition Framework which is defined in Wikidepia:
The Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model.
RDFa is a part of RDF which is a W3C Recommendation that adds a set of attribute-level extensions to HTML. It is basically the same concept as microdata but with different attribute structure and more capabilities.
Here is the HTML of the previous sample after converted to RDFa format:
<div vocab="http://schema.org" typeof="Product"> <h1 property="name">HP 15.6" Laptop - Silver (AMD Quad-Core A10-4655M / 1TB HDD / 8GB RAM / Windows 8.1)</h1> Model#: <span property="model">15-p284ca</span> <img property="image" src="/multimedia/Products/500x500/103/10362/10362700.jpg" alt="HP 15.6" Laptop - Silver (AMD Quad-Core A10-4655M / 1TB HDD / 8GB RAM / Windows 8.1)" /> Web Code: <span property="productid">10362700</span> <div property="offers" vocab="http://schema.org" typeof="Offer"> <meta property="priceCurrency" content="CAD"> <div property="price"> <span class="amount">$549.99</span> <div class="clear"></div> </div> </div> </div>
As I said it is almost the same thing but changing of attribute names, as following:
- vocab=”http://schema.org” attribute is placed on the element along with typeof attribute, which corresponds to itemscope and itemtype in microdata
- property is the same as itemprop is microdata
JSON-LD
Stands for JSON for Linking Data, and it represents webpage metadata in JSON format that can be separated from html content of the page, as following example taken from worldcat.org site :
<script type="application/ld+json"> { "@context" : { "name" : "http://schema.org/name", "isbn" : "http://schema.org/isbn", "isSimilarTo" : { "@id" : "http://schema.org/isSimilarTo", "@type" : "@id" }, "label" : "http://www.w3.org/2000/01/rdf-schema#label", "description" : "http://schema.org/description", "location" : { "@id" : "http://schema.org/location", "@type" : "@id" }, "organizer" : { "@id" : "http://schema.org/organizer", "@type" : "@id" }, "about" : { "@id" : "http://schema.org/about", "@type" : "@id" }, "dateModified" : "http://schema.org/dateModified", "inDataset" : { "@id" : "http://rdfs.org/ns/void#inDataset", "@type" : "@id" }, "familyName" : "http://schema.org/familyName", "givenName" : "http://schema.org/givenName", "birthDate" : "http://schema.org/birthDate", "workExample" : { "@id" : "http://schema.org/workExample", "@type" : "@id" }, "publisher" : { "@id" : "http://schema.org/publisher", "@type" : "@id" }, "placeOfPublication" : { "@id" : "http://purl.org/library/placeOfPublication", "@type" : "@id" }, "exampleOfWork" : { "@id" : "http://schema.org/exampleOfWork", "@type" : "@id" }, "publication" : { "@id" : "http://schema.org/publication", "@type" : "@id" }, "describedby" : { "@id" : "http://www.w3.org/2007/05/powder-s#describedby", "@type" : "@id" }, "bookFormat" : { "@id" : "http://schema.org/bookFormat", "@type" : "@id" }, "productID" : "http://schema.org/productID", "oclcnum" : "http://purl.org/library/oclcnum", "isPartOf" : { "@id" : "http://schema.org/isPartOf", "@type" : "@id" }, "copyrightYear" : "http://schema.org/copyrightYear", "inLanguage" : "http://schema.org/inLanguage", "datePublished" : "http://schema.org/datePublished", "creator" : { "@id" : "http://schema.org/creator", "@type" : "@id" }, "hasPart" : { "@id" : "http://schema.org/hasPart", "@type" : "@id" }, "identifier" : "http://purl.org/dc/terms/identifier", "schema" : "http://schema.org/", "rdfs" : "http://www.w3.org/2000/01/rdf-schema#", "genont" : "http://www.w3.org/2006/gen/ont#", "wdrs" : "http://www.w3.org/2007/05/powder-s#", "xsd" : "http://www.w3.org/2001/XMLSchema#", "library" : "http://purl.org/library/", "void" : "http://rdfs.org/ns/void#", "rdf" : "http://www.w3.org/1999/02/22-rdf-syntax-ns#", "bgn" : "http://bibliograph.net/", "pto" : "http://www.productontology.org/id/", "dcterms" : "http://purl.org/dc/terms/" } } </script>
Many people think it is cleaner to include metadata as json format because it doesn’t require HTML changes to be performed on the web page, basically add json-ld script in your webpage – usually at head .
Facebook Open Graph
Open graph is the protocol used by facebook, it definition as per their website:
The Open Graph protocol enables any web page to become a rich object in a social graph. For instance, this is used on Facebook to allow any web page to have the same functionality as any other object on Facebook.
it is implemented as metatags have og: prefix, as following:
<html prefix="og: http://ogp.me/ns#"> <head> <title>The Rock (1996)</title> <meta property="og:title" content="The Rock" /> <meta property="og:type" content="video.movie" /> <meta property="og:url" content="http://www.imdb.com/title/tt0117500/" /> <meta property="og:image" content="http://ia.media-imdb.com/images/rock.jpg" /> ... </head>
open graph doesn’t support as many details as schema.org formats, but it is widely applied among websites – probably more than other three formats described earlier.
Extracting structured metadata using PHP
The structured data formats make it easier to extract information about a webpage, it is now used by search engines to let the crawler understand much more details about your entity and its semantics and display rich snippets in search results, and it allows software to easily understand webpage content and get useful information out of it.
Below are some php examples for extracting information using Goutte library taking advantage of structured data formats, as following:
make sure to include Goutte in your composer – of course you composer is required:
composer require fabpot/goutte
and here is the configuration of Goutte client
<?php require 'vendor/autoload.php'; use Goutte\Client as GoutteClient; $client = new GoutteClient(); $client->followRedirects(); $client->getClient()->setDefaultOption('config/curl/' . CURLOPT_SSL_VERIFYHOST, FALSE); $client->getClient()->setDefaultOption('config/curl/' . CURLOPT_SSL_VERIFYPEER, FALSE);
Scrapping Microdata (Movie details example)
Let’s say that you want to extract movie information from one of the famous movies sites, like: imdb or rottentomatoes. Most of famous sites implement schema.org vocabularies so you can write your script without need to know exact html structure of the document as following:
To extract specific Microdata from the web page you can use the proper xpath expression; for example to extract movie image, genre, director and actors names, you can do the following:
/**** Microdata Example ****/ $url = "http://www.rottentomatoes.com/m/jurassic_world/"; $crawler = $client->request('GET', $url); $microdata_arr = array(); //xpath expression to retrieve several attributes $crawler->filterXPath("//*[@itemtype='http://www.schema.org/Movie']//*[contains('image genre actors director', @itemprop)]") ->each(function($node) use (&$microdata_arr){ $ret = getNodeStructuredData($node, 'microdata'); $microdata_arr[$ret['property']][] = $ret['value']; }); dump($microdata_arr); /** * extracting structured data from a DomCrawler node * @param Symfony\Component\DomCrawler\Crawler $node * @param string $type either 'microdata' or 'rdfa' * @return array */ function getNodeStructuredData($node, $type='microdata') { $node_name = $node->nodeName(); if ($node_name == 'link' || $node_name == 'a') { $value = $node->attr('href'); } elseif ($node_name == 'img') { $value = $node->attr('src'); } elseif ($node_name == 'meta') { $value = $node->attr('content'); } else { $value = trim($node->text()); } if($type == 'microdata'){ $property = current($node->extract(array('itemprop'))); }elseif($type=='rdfa'){ $property = current($node->extract(array('property'))); } return array( 'property' => $property, 'value' => $value, ); }And here is the scrapping results will be as following:

you can see that I limited the search domain for itemprop to be a child of item with type movie
[@itemtype=’http://www.schema.org/Movie’]
Let’s say we want to extract all information – in Microdata format – in the same script, we just have to amend Xpath to be more generic:
<?php //Exctract all microdata from a page $crawler->filterXPath("//*[@itemprop and not (@itemscope)]") ->each(function($node) use (&$microdata_arr){ $ret = getNodeMicrodata($node); $microdata_arr[$ret['itemprop']][] = $ret['value']; });
Scrapping RDFa
you may also apply the same concept to the RDFa, just by changing xpath as following:
<?php $crawler->filterXPath('//*[@vocab="http://schema.org"]//*[@property]') ->each(function($node) use (&$microdata_arr){ $ret = getNodeMicrodata($node); $microdata_arr[$ret['itemprop']][] = $ret['value']; });
Scrapping JSON-LD
and here an example of extracting JSON-LD:
/**** JSON-LD example ****/ $url = "http://www.worldcat.org/title/art-of-computer-programming/oclc/823849&referer=brief_results"; $crawler = $client->request('GET', $url); $metadata_arr = array(); $crawler->filterXPath('//*[@type="application/ld+json"]') ->each(function($node) use (&$metadata_arr) { $jsonld_lnk = ''; if ($node->nodeName() == 'link') { $jsonld_lnk = $node->attr('href'); } elseif ($node->nodeName() == 'script') { $jsonld_lnk = $node->attr('src'); } $ret = null; if (!empty($jsonld_lnk)) { $guzzle_client = new GuzzleHttp\Client(); $response = $guzzle_client->get($jsonld_lnk); $ret = $response->json(); } else { $ret = json_decode(trim($node->text())); } $metadata_arr = $ret; }); dump($metadata_arr);
Scrapping OG tags (URL social preview example)
I have worked on social posting and sharing components for few projects, that require sharing capabilities similar to facebook, where once a URL is submitted, a brief information about the url is shown, like: title, author, image, site name, description. like following screenshot:
This functionality is implemented by extracting Facebook open graph tags as following:
/*** Example of extracting facebook open graph information from webpage***/ $url = "http://edition.cnn.com/2015/07/17/sport/china-sport-official-sacked/index.html"; $crawler = $client->request('GET', $url); $og_data = array(); $crawler->filterXPath("//head/meta[starts-with(@property, 'og')]") ->each(function($node) use(&$og_data){ $og_data[$node->attr('property')] = $node->attr('content'); }); dump($og_data);
And here is the results you can use to display post preview box from og tags:
Conclusion
In this post I explained some of modern metadata approach, which is represented by schema.org vocabularies implemented in microdata, RDFa, JSON-LD along with facebook open graph, and how to use them in simple practical use cases to scrap certain piece of information like: movies data, books information and post social preview of shared URL.
However, having structured data for websites offers great capabilities for many more complicated applications like: vertical search engines and text mining.
There are many resources talking about metadata, schema.org and their applications especially in semantic web, and here is some that I have read/watched: