Tom Lee : Sunlight Foundation

Better Tools Won’t Save Us

by Tom Lee

technology

Nov 15, 2010 1:47 pm

Sam Smith wrote a post reacting to what I had to say about the Geithner schedule. In it, he argues that pushing for data to be released in better formats may not be the best course of action: tools exist to sidestep the problem.

Sunlight, as an organisation which complains about this often enough, has much better tools at their disposal than complaining about it. As people using computers in 2010, we all have better tools to use on PDFs than we currently use. We often complain about how inaccessible PDFs are, without doing the basic, simple, automatable tasks which can make them readable.

Opening the PDF in acrobat, pressing the "Recognise text using OCR" [button] and then [you'll find that] it's searchable, and Sunlight could republish this for everyone to use (or put up a webservice which adds the OCR text in such a way that when you search, what you get highlighted is the relevant bits of the page where the OCRed text matches). That is possible now.

But, as a community, we prefer to stick to the notion that anything in PDF is utterly locked up in a way which no one can get at.

It's not (really).

It is far from ideal, it's a bugger to use, and it is not the best format for most things, but it's what we've got. And showing how valuable this data is will get us far further than complaining that we can't read a file that most people clearly can in the tools they use. It's the tools we choose to use that are letting us down. And, as a movement, open data has to get better at it, and then it'll be less of a problem for us, and we can spend more time doing what we claim to be wanting to do.

I appreciate the response, but I disagree. Nothing Sam says about what technology makes possible is wrong, per se. And better tools are of course useful and desirable. But the last thing I want is for government to begin thinking that OCR can make up for bad document workflows. It simply can't: even though it happens to work well on the Geithner schedule, OCR remains a fundamentally lossy technology.

A Quick Reminder

by Tom Lee

technology

Nov 12, 2010 3:59 pm

Are you following @sunlightlabs on Twitter? No? We feel strongly that that's a mistake. Similarly, you might want to take a second to join our mailing list if you haven't already. We do our best to use both of these outlets as ways of discussing interesting topics and highlighting the exciting work of others -- not (just) as a means of relentlessly hawking our own work (though of course we tend to find that work exciting, too).

ScraperWiki is Extremely Cool

by Tom Lee

technology

Nov 9, 2010 11:03 am

ScraperWiki is a project that's been on my radar for a while. Last week Aine McGuire and Richard Pope, two of the people behind the project, happened to be in town, and were nice enough to drop by Sunlight's offices to talk about what they've been up to.

Let's start with the basics: remedial screen scraping 101. "Screen scraping" refers to any technique for getting data off the web and into a well-structured format. There's lots of information on web pages that isn't available as a non-HTML download. Making this information useful typically involves writing a script to process one or more HTML files, then spit out a database of some kind.

It's not particularly glamorous work. People who know how to make nice web pages typically know how to properly release their data. Those who don't tend to leave behind a mess of bad HTML. As a result, screen scrapers often contain less-than-lovely code. Pulling data often involves doing unsexy thing like treating presentation information as though it had semantic value, or hard-coding kludges ("# ignore the second span... just because"). Scraper code is often ugly by necessity, and almost always of deliberately limited use. It consequently doesn't get shared very often -- having the open-sourced code languish sadly in someone's Github account is normally the best you can hope for.

The ScraperWiki folks realized that the situation could be improved. A collaborative approach can help avoid repetition of work. And since scrapers often malfunction when changes are made to the web pages they examine, making a scraper editable by others might lead to scrapers that spend less time broken.

Redaction and Technical Incompetence

by Tom Lee

technology

Nov 5, 2010 10:45 am

Felix Salmon, finance blogger extraordinare, was inspired by some reporting by Bloomberg to have a look at Treasury's website. Apparently Tim Geithner visited Jon Stewart back in April, and Felix was understandably interested in seeing the evidence for himself. He went to the Treasury website, and then... well, things took a turn for the worse:

First, you go to the Treasury homepage. Then you ignore all of the links and navigation, and go straight down to the footer at the very bottom of the page, where there’s a link saying FOIA. Click on that, and then on the link saying Electronic Reading Room. Once you’re there, you want Other Records. Where, finally, you can see Secretary Geithner’s Calendar April – August 2010.

Be careful clicking on that last link, because it’s a 31.5 MB file, comprising Geithner’s scanned diary. Search for “Stewart” and you won’t find anything, because what we’re looking at is just a picture of his name as it’s printed out on a piece of paper.

In other words, these diaries, posted for transparency, are about as opaque as it can get. Finding the file is very hard, and then once you’ve found it, it’s even harder to, say, count up the number of phone calls between Geithner and Rahm Emanuel. You can’t just search for Rahm’s name; you have to go through each of the 52 pages yourself, counting every appearance manually.

Is this really how Obama’s web-savvy administration wants to behave? The Treasury website is still functionally identical to the dreadful one we had under Bush, and we’ve passed the midterm elections already. I realize that Treasury’s had a lot on its plate these past two years, but much more transparent and usable website is long overdue.

This all sounds sadly familiar to me. I still remember when Treasury started posting TARP disbursement reports as CSVs instead of PDFs. I was working on Subsidyscope at the time, and had to load those reports on a weekly basis. It's more than a little sad how much better my life got when they made that change.

record of Timothy Geithner's meeting with Jon Stewart

But I think it's important to note that Felix's frustration isn't just the product of bad technology.

Technology Lock-In with the DC Metro

by Tom Lee

technology

Oct 15, 2010 1:15 pm

SmarTrip card I think that sometimes when technologists make the case for open standards it can seem like a purely theoretical exercise. For most people the downsides to publishing a document as, say, an MS Word file aren't readily apparent. Every computer they've used has had a Windows license built into its price. They've never had a reason to learn how to manipulate text programmatically. Everyone else with whom they exchange files has Word, and the program is pretty well-designed for most office work use cases*. The dire warnings issued by developers just don't seem plausible.

So it's worth taking a second to note an example of these problems happening in a different arena. Here in DC our primary transit agency, WMATA, issues an RFID card called the SmarTrip which works with nearly all of the area's various transit systems. It's quite handy: you don't have to take it out of your wallet to use it, the balance is supposedly loss- and theft-proof, and it automates things like bus transfers.

Unfortunately, this morning brought news that the SmarTrip has to be replaced. Why? Well, the vendor that our transit planners bought it from ~~has gone out of business is ceasing to support the card, and they're pulling SmarTrip into oblivion with them~~ is ceasing to support SmarTrip, and no one else can take their place: the card incorporates proprietary technology, so it's impossible to find a new supplier. WMATA has a stockpile of cards that'll last about two years, but after that it'll have to start using a new solution.

To Know the Name of a Thing is to Have Power Over It

by Tom Lee

technology

Oct 1, 2010 4:04 pm

blackwater badge A flowery title for a blog post, I'll admit, but I hope that at least the Le Guin fans out there will forgive me. The problem of knowing something's true name is in the news, most particularly in this story from Wired's Spencer Ackerman:

Through a "joint venture," the notorious private security firm Blackwater has won a piece of a five-year State Department contract worth up to $10 billion, Danger Room has learned.

Apparently, there is no misdeed so big that it can keep guns-for-hire from working for the government. And this is despite a campaign pledge from Secretary of State Hillary Clinton to ban the company from federal contracts.

Eight private security firms have won State's giant Worldwide Protective Services contract, the big Foggy Bottom partnership to keep embassies and their inhabitants safe. Two of those firms are longtime State contract holders DynCorp and Triple Canopy. The others are newcomers to the big security contract: EOD Technology, SOC, Aegis Defense Services, Global Strategies Group, Torres International Services and International Development Solutions LLC.

Don't see any of Blackwater's myriad business names on there? That's apparently by design. Blackwater and the State Department tried their best to obscure their renewed relationship. As Danger Room reported on Wednesday, Blackwater did not appear on the vendors' list for Worldwide Protective Services. And the State Department confirms that the company, renamed Xe Services, didn't actually submit its own independent bid. Instead, they used a blandly-named cut out, "International Development Solutions," to retain a toehold into State's lucrative security business. No one who looks at the official announcement of the contract award would have any idea that firm is connected to Blackwater.

This is a troubling story. But for those of us who work with government data, it's an all-too-familiar one. Navigating the link between an entity's name and its identity is very, very difficult. Sunlight Reporting Group wrote about a similar problem back in January: a blacklist of contractors called the Excluded Party List System has been failing to do its job, partly because of difficulties in positively identifying the companies entered into it. People and even companies can have similar names, or the names entered into the system can contain typos. It's not uncommon to wind up with a fuzzy sort of match, and then to have to use whatever additional data is on hand -- an address, or a date, whatever -- to add confidence to the guess.

Open Source, Open Gov, Open House

by Tom Lee

technology

Sep 20, 2010 11:20 am

Hopefully all of you know this already (you are subscribed to the Labs Google Group, right? And following us on Twitter? And on our general Sunlight mailing list and maybe watching our office windows from across the street in case we write something important on a whiteboard? Good.). But if not: we're having an open house! It's this Thursday, it's happening at 6pm, and it would be great if you could make it. We'll have some drinks, some videogames, and some in-progress projects to demo. You just need to bring your charming self.

What: Sunlight Labs Open House
When: Starts at 6 pm on Thursday, September 23rd, 2010
Where: 1818 N Street Suite 300 NW Washington DC

If you can manage an RSVP, we'd appreciate it (but if you can't, that's fine, too).

Hope to see you Thursday!

Carrots and Sticks

by Tom Lee

technology

Sep 10, 2010 4:03 pm

The response to Clearspending has been overwhelmingly positive. People seem to care about government spending data quality to an extent I never would have anticipated. It's encouraging, and it makes me think we have a real shot at getting these problems fixed.

But there are some people with a different perspective. One of them is Gunnar Hellekson, who wrote a thoughtful blog post about why he disagrees with our approach. Naturally I don't plan to write responses to everyone who disagrees with us. But we really like and respect Gunnar, and he raised some important points in his post. To wit:

Announcing Clearspending — and Why It’s Important

by Tom Lee

technology

Sep 8, 2010 11:08 am

Today we're launching Clearspending -- a site devoted to our analysis of the data behind USASpending.gov. Ellen's already written about this project over on the main foundation blog, and you should certainly check out her post. But I wanted to talk about it a little bit here, too, because this project is near & dear to my heart, having grown out of work that Kaitlin, Kevin and I did together before I stepped into the role of Labs Director.

The three of us had been working with the USASpending database for a while, and in the course of that work we began to realize some discouraging things. The data clearly had some problems. We did some research and wrote some tests to quantify those problems -- that effort turned into Clearspending. The results were unequivocal: the data was bad -- really bad. Unusably bad, in fact. As things currently stand, USASpending.gov really can't be relied upon.

You can read all about it over at the Clearspending site, and I hope you will -- in addition to an analysis that looked at millions of rows of data and found over a trillion dollars' worth of messed-up spending reports, we spent a lot of time talking to officials at all levels of the reporting chain. I don't think you're likely to find a better discussion of these systems and their problems.

And make no mistake, these systems are important.

Preparing for the Worst

by Tom Lee

technology

Aug 27, 2010 12:23 am

I should say up front that Google's been a great friend to Sunlight: they've helped support our contests, they've sent us phones and Summer of Code students to help our Android development efforts, and when I visited their DC offices a couple of weeks ago they let me eat as much candy as I wanted.

Still, I'd be lying if I said the incredible scope of their success didn't make me a little uneasy. We use Google Apps for our work email, for instance, and YouTube is essential to our video production efforts. We're as dependent as anyone else on Google for search, both as a tool and a source of traffic. I know we're not the only ones to be a bit unnerved at being so reliant on the goodwill of a private enterprise -- and of course over the past few weeks, other voices expressing those concerns have become significantly louder.

So, while we're looking forward to continuing to work with Google, it would be irresponsible for us not to prepare for the unthinkable. I'm happy to say that we've taken the necessary precautions, and today the future seems a bit less uncertain:

Of course, what happens after we run through our 1000 free hours is anyone's guess.

(Many thanks to Pierre Huggins of Rox Chox & Blox Woodworking for lending his awesome fabrication capabilities to this ridiculous project (and to our own sysadmin extraordinaire, Tim, for finding Pierre via HacDC)

« Previous
1
…
7
8
9
10
11
Next »