Screenscraping in the Former Soviet Bloc

by

A couple of weeks ago I had the chance to go to Georgia and hang out with the folks at the Tbilisi chapter of Transparency International. It was a great opportunity to learn about a part of the world that I was completely unfamiliar with, to share some technical knowledge, and — somewhat unexpectedly — to gain some perspective on the work we do at Sunlight.

First things first: is Sunlight expanding its mission to include the Caucasus? In a word, no. This trip came together thanks to the dumb luck of having friends with interesting international jobs, and TI Georgia’s generous willingness to play host in exchange for some help with their tech (including — though Clay will never forgive me for admitting that he let an employee work on it — Drupal).

Georgia has only been an independent country since 1991, and admittedly this means the challenges facing a transparency organization in Tbilisi are sometimes quite different than the ones facing those of us in Washington. Election monitoring isn’t part of Sunlight’s mission, for instance. But TI and Sunlight have broadly similar missions. In fact, the similarities go beyond a commitment to transparency — I found there to be a surprising amount of overlap in the issues we cover. Both Sunlight and TI work on issues like campaign finance and the influence economy, tracking the use of public funds, and finding ways to help the media work better.

And both organizations are excited about the potential that technology has to further our efforts. One example of this came early during my trip. TI’s been working on tracking media ownership in Georgia. Unfortunately, using the information disclosed by the government can be difficult: the search interface to the Georgian Public Register — which tracks basic information about businesses in the country — doesn’t work very well. The site also uses the Georgian alphabet, which, while beautiful and unique, can add another layer of inconvenience for non-native speakers and typists. But a little clicking around with Live HTTP Headers turned on revealed a simple AHAH process that passed an ID to a script and got back a couple of HTML tables. Further poking around showed that the IDs were probably just an autoincrementing primary key from the database — though there were hits both at the very bottom of the possible range and for implausibly high values (implausibly high unless every Georgian has started multiple businesses). I wrote a quick script to take samples of a few consecutive records at gaps of fifty thousand records or so, then tested them for emptiness and had a look at the resulting frequency distribution.

To cut to the chase: there were active records at the top and bottom of the range, with a big gap in between — probably the result of a manual data import step and subsequent boosting of the next ID (to leave plenty of space for revisions to the import, or maybe some other reason). I fired up an EC2 instance and by the end of the weekend we had an Excel copy of the registry, transliterated into latin characters and ready for analysis — and, if anyone wants it, redistribution.

This probably sounds pretty familiar to my colleagues at Sunlight: we do this kind of thing all the time. The folks at TI — who I hasten to add were talented technologists in their own right — hadn’t run into the technique. It was rewarding to be able to share this approach.

And it was nice to realize that the work we’re doing today, in the U.S. and other wired, older democracies, isn’t just going to be used once. The tools and lessons that emerge from our efforts are going to be adapted and improved by waves of democracies as they establish and improve the online face of their governments. And that process is probably going to happen faster than any of us expect. In the same way that developing countries often don’t bother building landline infrastructure, citizens benefiting from the transparency movements of the future will probably enjoy systems with less of the historical bureaucratic cruft that we early adopters have to deal with — if we do our jobs well.