Blog : Sunlight Foundation

ScraperWiki is Extremely Cool

by Tom Lee

technology

Nov 9, 2010 11:03 am

ScraperWiki is a project that's been on my radar for a while. Last week Aine McGuire and Richard Pope, two of the people behind the project, happened to be in town, and were nice enough to drop by Sunlight's offices to talk about what they've been up to.

Let's start with the basics: remedial screen scraping 101. "Screen scraping" refers to any technique for getting data off the web and into a well-structured format. There's lots of information on web pages that isn't available as a non-HTML download. Making this information useful typically involves writing a script to process one or more HTML files, then spit out a database of some kind.

It's not particularly glamorous work. People who know how to make nice web pages typically know how to properly release their data. Those who don't tend to leave behind a mess of bad HTML. As a result, screen scrapers often contain less-than-lovely code. Pulling data often involves doing unsexy thing like treating presentation information as though it had semantic value, or hard-coding kludges ("# ignore the second span... just because"). Scraper code is often ugly by necessity, and almost always of deliberately limited use. It consequently doesn't get shared very often -- having the open-sourced code languish sadly in someone's Github account is normally the best you can hope for.

The ScraperWiki folks realized that the situation could be improved. A collaborative approach can help avoid repetition of work. And since scrapers often malfunction when changes are made to the web pages they examine, making a scraper editable by others might lead to scrapers that spend less time broken.

Building a Better Partnership for Open Government: Right Here

by Ellen Miller Nov 8, 2010 5:28 pm

President Obama recently attended an Expo on Democracy and Open Government on his trip to India and announced the creation... View Article

What’s Going on in the Labs

by Josh Ruihley

technology

Nov 8, 2010 2:05 pm

It's that time again...

Daily Disclosures

by Bill Allison Nov 8, 2010 1:20 pm

A roundup of what we’re noticing in the Reporting Group as we dig into government data and disclosures: The Daily... View Article

Transparency and the Earmark Moratorium

by John Wonderlich Nov 8, 2010 12:33 pm

Presumptive Speaker Boehner has now come out in favor of a full earmark moratorium in the House. There will be... View Article

Stories to start the week

by Paul Blumenthal Nov 8, 2010 11:39 am

1) House Democratic campaign chief Chris Van Hollen is calling for the Senate to pass a bill to bring transparency... View Article

The Future of Earmark Transparency: Event Announcement

by Daniel Schuman Nov 8, 2010 10:53 am

Earmark transparency will be the subject of a panel discussion next Monday on Capitol Hill. The conversation is particularly timely... View Article

New Congress provides a moment for transparency change

by Paul Blumenthal Nov 5, 2010 12:27 pm

Each new Congress begins with its own unique face. The ascendant Newt Gingrich in 1994, the first woman Speaker of... View Article

Redaction and Technical Incompetence

by Tom Lee

technology

Nov 5, 2010 10:45 am

Felix Salmon, finance blogger extraordinare, was inspired by some reporting by Bloomberg to have a look at Treasury's website. Apparently Tim Geithner visited Jon Stewart back in April, and Felix was understandably interested in seeing the evidence for himself. He went to the Treasury website, and then... well, things took a turn for the worse:

First, you go to the Treasury homepage. Then you ignore all of the links and navigation, and go straight down to the footer at the very bottom of the page, where there’s a link saying FOIA. Click on that, and then on the link saying Electronic Reading Room. Once you’re there, you want Other Records. Where, finally, you can see Secretary Geithner’s Calendar April – August 2010.

Be careful clicking on that last link, because it’s a 31.5 MB file, comprising Geithner’s scanned diary. Search for “Stewart” and you won’t find anything, because what we’re looking at is just a picture of his name as it’s printed out on a piece of paper.

In other words, these diaries, posted for transparency, are about as opaque as it can get. Finding the file is very hard, and then once you’ve found it, it’s even harder to, say, count up the number of phone calls between Geithner and Rahm Emanuel. You can’t just search for Rahm’s name; you have to go through each of the 52 pages yourself, counting every appearance manually.

Is this really how Obama’s web-savvy administration wants to behave? The Treasury website is still functionally identical to the dreadful one we had under Bush, and we’ve passed the midterm elections already. I realize that Treasury’s had a lot on its plate these past two years, but much more transparent and usable website is long overdue.

This all sounds sadly familiar to me. I still remember when Treasury started posting TARP disbursement reports as CSVs instead of PDFs. I was working on Subsidyscope at the time, and had to load those reports on a weekly basis. It's more than a little sad how much better my life got when they made that change.

record of Timothy Geithner's meeting with Jon Stewart

But I think it's important to note that Felix's frustration isn't just the product of bad technology.

Swing State Confidential: Colorado–the Wild Wild West

by Nancy Watzman Nov 5, 2010 10:43 am

Over the last several weeks, the Colorado Senate race became the poster race of the outrageous amounts of money pouring... View Article

« Previous
1
…
630
631
632
633
634
…
1,067
Next »