Foraging for data with Scraper Wiki

ScraperWiki is a service that helps you to gather data from websites that do not provide it as raw data. ScraperWiki provides a programming environment where you can write and share a scraper from your browser. ScraperWiki will run your scraper for you once a day, and will make the results available to download and through Application Programming Interfaces (API ) for other web programs to use as well.

You will need

  • An account at www.scraperwiki.com (free)
  • Some programming experience
  • A website with structured information on that you want to scrape

Step by step

Step 1: Find a site to scrape

In this example I’m looking at the location of Garages to Rent in Oxford City. First I check when viewing the page that the elements I want to scrape are presented fairly uniformly (e.g. there is always the same title for the same thing) as lots of variation in the way similar things are presented makes for difficult scraping.

Secondly, I take a look at the source code of the web page to explore whether each ‘field’ I want to scrape (e.g. Postcode; Picture etc.) is contained neatly in it’s own HTML element. In this case, whilst each listing is in a <div> html element, a lot of the rest of the text is only separated by line-breaks.

I’ve used the FireBug plugin for Firefox web browser to look at the structure of the page, as it allows me to explore in more detail than the standard ‘View Source’ feature on most browsers.

Step 2: Create a Scraper

I’m going to be creating a PHP scraper as this is the programming language I’m most comfortable with, but you can also create scrapers using Python and Ruby languages.

The PHP Startup Scraper will load with some basic code for fetching a web page and starting to parse it already set. It makes use of the simple_html_dom library which allows you to access elements of web pages using simple selectors.

Change the default URL so scraperwiki is finding the page you are interested in. Then also change the line

foreach($dom- >find(
'td'
) as $data)

using a selector identified in your earlier exploration to see if you can pick out the elements you want to scrape.

For example, each of the listings of Garages for Rent in Oxford are contained within a div with the class ‘pagewidget’, so I can use the selector

$dom- >find(
'div.pagewidget'
)

to locate them. (This sort of selector will be familiar to anyone used to working with CSS- Cascading Style Sheets).

Step 3: Test and refine

If you click ‘Run’ below your scraper you should now see a range of elements returned in the console. The default PHP template loops through all the elements that match the selector we just set and prints them out to the console.

My scraper returns quite a few elements I don’t want (there must be more than just the Garage listings picked out by the div.pagewidget selector), so I look for something uniform about the elements I do want. In this case they all start with ‘Site Location’ (or at least the plaintext versions of them, as returned by

$data- >plaintext

do.


I can now add some conditional code to my scraper to only carry on processing those elements that contain ‘Site Location’. I’ve chosen to use the ‘stristr’ function on PHP that just checks if one string is contained in another and is case insensitive, rather than checking the exact position of the phrase, to be tolerant in case there is variation in the way the data is presented that I’ve not spotted.

Step 4: Finding the right elements

The next steps will depend on how your data is formatted. You may have lots more nested selectors to work through to pick out the elements you want. You can use $data just like the $dom object earlier. So, for example, we can use

$data- >find("img",0)- >src;

to return the ‘src’ attribute of the first (0) image element (img) we find in each garage listing.

Sometimes, you get down to text which isn’t nicely formatted in HTML, and then you will need to use different string processing to pull apart the bits you want. For example, in the Garage listings we can separate each line of plain text by splitting the text by <br> elements, and then splitting each line at the colon ‘:’ used to separate titles and values.

A check of the raw source shows the Oxford Garages page uses both <BR> and <br /> as elements so we can use a replace function to standardise these (or we could use regular expressions for splitting).

In the Oxford Garages case as well, our data is split across multiple pages, so once we have the scraper for a single page working right, we can nest it inside a scraper that grabs the list of pages and loops through those too. Scraper Wiki also includes useful helper code for working with forms, for sites where you have to submit searches or make selections in forms to view any data.

Step 5: Saving the data

Towards the end of each loop through the elements you are scraping (each row in your final dataset) you will need to call the

scraperwiki::save()

function. This takes four paramaters:

Firstly, an array indicating the name of the unique key in your data that should be used to work out whether a record is new, or an update to an existing record.

Second, an array of data values to save.

Third, the date of the record (for indexing). Leave as null to just use the date the scraper was run.

Fourth, an array of latitude and longitude if you have geocoded your data.

Run you scraper and check the ‘data’ tab to see what is being saved.

Step 6: Geocoding

If you have a UK postcode in your data then you can use the

scraperwiki::gb_postcode_to_latlng();

function to turn it into a latitude and longitude, and then save then into your generated dataset.

For example, we can use

$lat_lng=scraperwiki::gb_postcode_to_latlng($values[
'Postcode'
]);

and then when we save our data we add the $lat_lng values to the end of the save function.

scraperwiki::save(array(
'Site location'
),$values,null,$lat_lng);

Step 7: Run

You can now run your scraper. You will be able to access the results as a CSV file, through the scrape wiki API, and to load them into a Google Spreadsheet.

You can also create ‘Views’ onto your data, using pre-prepared templates to create maps and other useful visualisations of your data, direct from within Scraperwiki.

Scraperwiki will run your scraper every 24 hours, meaning that as long as it keeps working, you can rely on it as an up-to-date data source.

Below is the map I produced, showing Garages to Rent around Oxford, with the number of garages, photos, and links off to the pages with details about them.


One of the best things about Scraper Wiki overall though, is that it is wiki-like. You can take a look at my Oxford Garages code at http://scraperwiki.com/scrapers/oxford-garages-to-rent/ and you can edit and improve it (and there are lots of potential improvements to be made).

You can also suggest scrapers you would like other people to create, or respond to requests for scrapers from others.

Health Warnings

Make sure the terms and conditions of the site you are gathering data from don't prohibit scraping. It's often worth contacting the site you are scraping to see if they will release the raw data to save you the effort!

Examples and variations

recipe/foraging_for_data_with_scraper_wiki.txt · Last modified: 2012/10/24 10:04 (external edit)
You are here: startrecipeforaging_for_data_with_scraper_wiki
CC Attribution-Share Alike 3.0 Unported Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0