ScraperWiki is a project that’s been on my radar for a while. Last week Aine McGuire and Richard Pope, two of the people behind the project, happened to be in town, and were nice enough to drop by Sunlight’s offices to talk about what they’ve been up to.
Let’s start with the basics: remedial screen scraping 101. “Screen scraping” refers to any technique for getting data off the web and into a well-structured format. There’s lots of information on web pages that isn’t available as a non-HTML download. Making this information useful typically involves writing a script to process one or more HTML files, then spit out a database of some kind.
It’s not particularly glamorous work. People who know how to make nice web pages typically know how to properly release their data. Those who don’t tend to leave behind a mess of bad HTML. As a result, screen scrapers often contain less-than-lovely code. Pulling data often involves doing unsexy thing like treating presentation information as though it had semantic value, or hard-coding kludges (“# ignore the second span… just because”). Scraper code is often ugly by necessity, and almost always of deliberately limited use. It consequently doesn’t get shared very often — having the open-sourced code languish sadly in someone’s Github account is normally the best you can hope for.
The ScraperWiki folks realized that the situation could be improved. A collaborative approach can help avoid repetition of work. And since scrapers often malfunction when changes are made to the web pages they examine, making a scraper editable by others might lead to scrapers that spend less time broken.
At this point you could be forgiven for rolling your eyes. “Great,” you may be thinking, “Another soon-to-ignored MediaWiki installation — except this time for a type of content that it was never meant to host. Maybe they really went the extra mile and installed syntax highlighting.”
But that’s not it at all! Not only do you compose and store your scrapers within the ScraperWiki site, the project actually runs them. This lets SW figure out which ones are having problems; and it lets it store and distribute the data in a unified manner. This may sound simple, but doing it in a usable and sustainable way actually requires some clever implementation choices.
It’s a great idea, and ScraperWiki has implemented it well, with support for Ruby, Python and PHP. I had a very pleasant time implementing a simple scraper for Rep. Cantor’s press releases — something that might come in handy for tracking the public pronouncements of the man who’s likely to be the next House Majority Leader. I wrote my scraper through the site’s slick in-browser interface, but Pythonistas with established dev environments might be interested in the fakerwiki project, which allows for local development with the ScraperWiki APIs.
I’d encourage you to give ScraperWiki a look. Maybe you could grab the press releases of some other legislator. Or you could have a look at the scrapers that have been requested or are in need of a fix. At the very least, check out what the project can do — it’s an exciting concept, and could prove to be a very useful tool.