Scraping HTML from Websites with PHP
January 13, 2013 — 21:06

Here’s a simple and useful PHP script that i’ve used a number of times to pull data from an external site to use elsewhere. I first used it with my Cartell.ie android app. To minimise data use on the phone the entered reg number is sent to one of my webservers which requests the cartell.ie site, submitting the given reg and returning just the vehicle information.

The phone requests server/file.php?registration=<reg number> Where the registration number is first read into the variable ‘reg’


$reg = $_REQUEST['registration'];

The content is requested and then saved into ‘html’ using file_get_contents()


$html = file_get_contents('https://www.cartell.ie/ssl/servlet/beginStarLookup?registration='.$reg);

At this point we have the entire HTML response from the website stored (think of it as the “view source” content that you would see) in html.

To extract the relevant information we need to find a string of text in the source which identifies the start of the code containing the information we want, its best to try and ensure this code is unique in the file to save on any issues. In the case of the cartell script the start of the data table is easily and uniquely identified by the use of the vehicle-description class. We mark this as our start point.

$start = strpos($html,'<table class="vehicle-description">');

The end is is simply the next closing of the table tag, </table> 

$end = strpos($html,'</table>',$start) + 8;

By default we will be marking the position of the start of the searched string, if we want to also include this string we can move the position forward by adding the integer number of characters in the string, in our case this is 8, hence the ‘+8’ in the code. This may or may not be necessary, depending on what you wish to do.

With the start and end positions in hand we can extract the code which we need.

$table = substr($html,$start,$end-$start);

Finally we print out the resulting code using the echo command.

echo $table;

The complete code for this script is:


<?php

$reg = $_REQUEST['registration'];

$html = file_get_contents('https://www.cartell.ie/ssl/servlet/beginStarLookup?registration='.$reg);

$start = strpos($html,'<table class="vehicle-description">');

$end = strpos($html,'</table>',$start) + 8;

$table = substr($html,$start,$end-$start);

echo $table;

?>

Optionally we can run a find and replace on this to change anything we like by using the str_replace() function.

$table = str_replace('<replace this>','<with this>',$table);

So there you have it, like i said, it’s simple but can prove very useful for extraction data for apps or other projects.

Comments:
  • […] The weather text for the main screen is parsed from the met eireann site. The icons and detailed weather are parsed from Meteo Europ although this is formatted and parsed by a script on this server as it was easier than doing it from within the android. Similarly the AA report comes from the AA Roadwatch site and is parsed on the server. These were done using PHP the same way as i posted about before. […]

    March 6, 2013 — 15:16
  • Leave a Reply to Chet.ie Cancel reply

    Your email address will not be published. Required fields are marked *