As stated in the note from the Sunlight Foundation′s Board Chair, as of September 2020 the Sunlight Foundation is no longer active. This site is maintained as a static archive only.

Follow Us

ScraperWiki is Extremely Cool

by

ScraperWiki logo

ScraperWiki is a project that's been on my radar for a while. Last week Aine McGuire and Richard Pope, two of the people behind the project, happened to be in town, and were nice enough to drop by Sunlight's offices to talk about what they've been up to.

Let's start with the basics: remedial screen scraping 101. "Screen scraping" refers to any technique for getting data off the web and into a well-structured format. There's lots of information on web pages that isn't available as a non-HTML download. Making this information useful typically involves writing a script to process one or more HTML files, then spit out a database of some kind.

It's not particularly glamorous work. People who know how to make nice web pages typically know how to properly release their data. Those who don't tend to leave behind a mess of bad HTML. As a result, screen scrapers often contain less-than-lovely code. Pulling data often involves doing unsexy thing like treating presentation information as though it had semantic value, or hard-coding kludges ("# ignore the second span... just because"). Scraper code is often ugly by necessity, and almost always of deliberately limited use. It consequently doesn't get shared very often -- having the open-sourced code languish sadly in someone's Github account is normally the best you can hope for.

The ScraperWiki folks realized that the situation could be improved. A collaborative approach can help avoid repetition of work. And since scrapers often malfunction when changes are made to the web pages they examine, making a scraper editable by others might lead to scrapers that spend less time broken.

Continue reading

Redaction and Technical Incompetence

by

Felix Salmon, finance blogger extraordinare, was inspired by some reporting by Bloomberg to have a look at Treasury's website. Apparently Tim Geithner visited Jon Stewart back in April, and Felix was understandably interested in seeing the evidence for himself. He went to the Treasury website, and then... well, things took a turn for the worse:

First, you go to the Treasury homepage. Then you ignore all of the links and navigation, and go straight down to the footer at the very bottom of the page, where there’s a link saying FOIA. Click on that, and then on the link saying Electronic Reading Room. Once you’re there, you want Other Records. Where, finally, you can see Secretary Geithner’s Calendar April – August 2010.

Be careful clicking on that last link, because it’s a 31.5 MB file, comprising Geithner’s scanned diary. Search for “Stewart” and you won’t find anything, because what we’re looking at is just a picture of his name as it’s printed out on a piece of paper.

In other words, these diaries, posted for transparency, are about as opaque as it can get. Finding the file is very hard, and then once you’ve found it, it’s even harder to, say, count up the number of phone calls between Geithner and Rahm Emanuel. You can’t just search for Rahm’s name; you have to go through each of the 52 pages yourself, counting every appearance manually.

Is this really how Obama’s web-savvy administration wants to behave? The Treasury website is still functionally identical to the dreadful one we had under Bush, and we’ve passed the midterm elections already. I realize that Treasury’s had a lot on its plate these past two years, but much more transparent and usable website is long overdue.

This all sounds sadly familiar to me. I still remember when Treasury started posting TARP disbursement reports as CSVs instead of PDFs. I was working on Subsidyscope at the time, and had to load those reports on a weekly basis. It's more than a little sad how much better my life got when they made that change.

record of Timothy Geithner's meeting with Jon Stewart

But I think it's important to note that Felix's frustration isn't just the product of bad technology.

Continue reading

CFC (Combined Federal Campaign) Today 59063

Charity Navigator