It’s no secret that we’re dataphiles here at Sunlight, or that we want everyone to have access to the underlying... View ArticleContinue reading
Earlier today, we released an analysis of dark money spending in this year's election. For those who want to play around with the underlying data, here is the raw data: Dark Money GroupsContinue reading
When Mitt Romney claims, as he did in a private talk at a fundraiser for well-heeled donors, that 47 percent of Americans do not pay income taxes, where can one check his math? When President Barack Obama tells David Letterman and his audience that he doesn't know how much the national debt is, what's the best place to get the latest number? When a member of Congress claims federal spending has been cut to the bone, what's the best place to check that claim?
Activists expressed concern this week that several United Nations proposals to regulate the Internet would undermine freedom and give too much control over the World Wide Web.
Proposals to centralize Internet regulation will be discussed at the two upcoming U.N. winter conferences -- the Internet Governance Forum in November and the International Telecommunication Union (ITU) meeting in December.
But panelists at “Clear and Present Danger: Attempts to Change Internet Governance and Implications for Press Freedom,” a National Endowment for Democracy forum in Washington argued for maintaining a more decentralized Internet.
“We see the potential to shift Internet governance away from ...Continue reading
We're opening a new tool to the public today for beta testing, called Scout.
Scout is an alert system for the things you care about in state and national government. It covers Congress, regulations across the whole executive branch, and legislation in all 50 states.
You can set up notifications for new things that match keyword searches. Or, if you find a particular bill you want to keep up with, we can notify you whenever anything interesting happens to it -- or is about to.
Just to emphasize, this is a beta - it functions well and looks good, but we're really hoping to hear from the community on how we can make it stronger. You can give us feedback by using the Feedback link at the top of the site, or by writing directly to firstname.lastname@example.org.Continue reading
I've put up a dataset on Github that maps popular search terms to bills in Congress. It's a simple, 5-column CSV designed to help people create better search engines that take in user input to search for bills. The idea is that this will be useful to, and get contributions from, the community of people out there working with legislation and building tools around them.
It's humble - I started it out with a mere 7 rows, assigning the keywords "Obamacare", "SOPA", "PIPA", and "PPACA" to the appropriate bills. There are certainly more good candidates than that, so please contribute via pull request, or if you don't know how to do that, open an issue and talk about it with words.Continue reading
Today we're launching 6° of Corporations, a new micro-site that provides some insight into the complicated area of corporate identity. It may sound trivial, but uniquely identifying a corporate entity is not easy. For federal contracting data (like in USASpending.gov), DUNS numbers are used to (supposedly) uniquely identify a contractor. However, there are problems in not only how DUNS numbers are issued and maintained, but also with the agency's use of DUNS numbers. To help illustrate this, we’ve created a visualization that shows the relationship between company names and company DUNS numbers in USASpending.gov.Continue reading
Today at 4pm, the White House will host an online chat on how to improve the online experience with Federal... View ArticleContinue reading
Many of you have probably already seen that earlier today we stood up a copy of the Elena's Inbox code for the Sarah Palin email collection. You can find the site here. I think that by most reasonable standards, Sarah Palin is currently a less newsworthy figure than Justice Kagan was at the time of her confirmation. But there's no question that many people find her fascinating, and folks seem to really enjoy having this sort of interface available -- the response has been overwhelmingly positive, even in spite of its horrifying Gmail 1.0 look (for what it's worth, Sunlight's design team deserves absolutely none of the blame for this one!).
It's worth taking a moment to reflect on what it took to get this site online. The state of Alaska released Governor Palin's email records on paper. News organizations had to have people on the ground to collect, scan and OCR these documents. Our thanks goes out to Crivella West, msnbc.com, Mother Jones and Pro Publica, whose incredibly quick and high-quality work provided us with the baseline data that powers the site.
But it wasn't yet structured data. It was easy enough to convert the PDFs into text, though this introduced some errors -- dates from the year "20Q7", for instance. Then we had to parse the text into documents, each with recipients, a subject line, and a sender. This is trickier than it might seem. Consider the following recipient list:
To: Smith, John; Jane Doe; Anderson; Andy (GOV); Paul Paulson
It's parseable... sort of. It turns out that, in this case, "Andy Anderson" should be treated as an entity. In this dataset, portions of names are delimited by semicolons, but so are names. It's a bit of a mess. Sunlight staff spent the better part of Monday performing a manual merge of the detected entities, collapsing over 6,000 automatically-captured people to less than half that number. I won't pretend that the dataset is now spotless, but it's considerably more structured than it used to be.
And that structure makes possible not only novel interfaces like Sarah's Inbox, but also novel analyses. Consider this graph of how often the word "McCain" appears in the emails:
Interesting, right? More substantively, consider the efforts of Andree McCloud, who's raising questions about an apparent gap in the Palin emails near the beginning of the governor's term. With the data captured, it's easy to visualize this -- here's a graph of the total email volume in the system by week, beginning with the first week of December 2006, when Palin took office:
(To be clear, I don't think you can necessarily conclude from this graph that there's anything nefarious about that period's low email volume -- there are plenty of potential explanations. Still, it's useful to be able to be able to understand the outlier period in the larger context of the document corpus.)
Of course, these analyses and interfaces could be even better if Alaska had just released the files digitally. In fact, if they had, we might be able to draw some more solid conclusions: as our sysadmin Tim pointed out, message headers' often-sequential IDs could conceivably show whether there actually are missing emails from those first few weeks.
It's a shame that that didn't happen -- and not just because it meant my weekend was spent parsing PDFs. Releasing properly structured data ultimately allows everyone to do better work in less time. It's unfortunate that the authorities in Alaska introduced such a substantial and unnecessary roadblock.
But we at Sunlight can at least share what we've done to improve the situation. If you're interested in running your own analysis, you can find our code here, and the data to power it here (12M). At the moment it's in the form of a Django project -- if you need it in a different format, don't hesitate to ask on our mailing list. If you do something neat with it, please tell us!Continue reading
Followers of this blog are probably already aware of two of the main sites developed by our Data Commons team: TransparencyData.com and InfluenceExplorer.com. Both sites present a variety of influence related data sets, such as campaign finance, federal lobbying, earmarks and federal spending. Influence Explorer provides easy to use overview information about politicians, companies, industries and prominent individuals, while Transparency Data allows users to search and download detailed records from various influence data sets.
In this blog post I want to show how easy it can be to use the public APIs for both sites to integrate influence data into your own projects. I'll walk through a couple examples and show how to use both the RESTful API and the new Python wrapper.Continue reading