On January 30th, the House of Representatives held a public meeting on its efforts to release more legislative information to the public in ways that facilitate its reuse. This was the second meeting hosted by the Bulk Data Task Force where members of the public were included; it began privately meeting in September 2012. (Sunlight and others made a presentation at a meeting, in October, on providing bulk access to legislative data.) This public meeting, organized by the Clerk's office, is a welcome manifestation of the consensus of political leaders of both parties in the House that now is the time to push Congress' legislative information sharing technology into the 21st century. In other words, it's time to open up Congress. The meeting featured three presentations on ongoing initiatives, allowed for robust Q&A, and highlighted improvements expected to be rolled out of the next few months. In addition, the House recorded the presentations and has made the video available to the public. The ongoing initiatives are the release of bill text bulk data by GPO, the addition of committee information for docs.house.gov, and the release on floor summary bulk data. It's expected that these public meetings will continue at least as frequently as once per quarter, or more often when prompted by new releases of information. As part of the introductory remarks, the House's Deputy Clerk explained that a report had been generated by the Task Force at the end of the 112th Congress on bulk access to legislative data and was submitted to the House Legislative Branch Appropriations Subcommittee. It's likely that the report's recommendations will become public as part of the committee's hearings on the FY 2014 Appropriations Bill, at which time the public should have an opportunity to comment.Continue reading
It may feel like an ordinary Wednesday, but today is a milestone for legislative transparency. The House’s leadership has issued... View ArticleContinue reading
When a Leadership PAC and a Super PAC join forces, the influx of cash can help swing an election, concentrating... View ArticleContinue reading
Speaker Boehner and Majority Leader Cantor today sent a letter to the Clerk of the House calling for better access... View ArticleContinue reading
As House Republican leaders examine their options for House reforms, the 72 Hour Rule, or ReadTheBill, is always near the... View ArticleContinue reading
ScraperWiki is a project that's been on my radar for a while. Last week Aine McGuire and Richard Pope, two of the people behind the project, happened to be in town, and were nice enough to drop by Sunlight's offices to talk about what they've been up to.
Let's start with the basics: remedial screen scraping 101. "Screen scraping" refers to any technique for getting data off the web and into a well-structured format. There's lots of information on web pages that isn't available as a non-HTML download. Making this information useful typically involves writing a script to process one or more HTML files, then spit out a database of some kind.
It's not particularly glamorous work. People who know how to make nice web pages typically know how to properly release their data. Those who don't tend to leave behind a mess of bad HTML. As a result, screen scrapers often contain less-than-lovely code. Pulling data often involves doing unsexy thing like treating presentation information as though it had semantic value, or hard-coding kludges ("# ignore the second span... just because"). Scraper code is often ugly by necessity, and almost always of deliberately limited use. It consequently doesn't get shared very often -- having the open-sourced code languish sadly in someone's Github account is normally the best you can hope for.
The ScraperWiki folks realized that the situation could be improved. A collaborative approach can help avoid repetition of work. And since scrapers often malfunction when changes are made to the web pages they examine, making a scraper editable by others might lead to scrapers that spend less time broken.Continue reading