New York Times

 

Two principles to avoid common data mistakes

If David Brooks is correct, the “rising philosophy of the day” is “data-ism.” But you don’t have to believe David Brooks. Just look at the big data (e.g. Google Trends) on “big data.”

For the political junkies, data became sexy in 2012. First, the New York Times’ Nate Silver’s meta-analyses of polling data triumphed over the pundits’ “gut feelings.” Second, the Obama campaign successfully used data analytics to increase voter turnout. This caused people to pay attention (witness, for example, David Brooks’ new devotion to the subject as prime column-fodder).

Of course, for those of us in the transparency and accountability advocacy community, data has long been a prized commodity. And as governments around the world increasingly commit to open data promises, more and more data is becoming available.

At its best, data allows us to transcend our personal anecdotal experiences, giving us the big picture. It allows us to detect relationships and patterns that we wouldn’t otherwise see. Using data smartly can help us to make better decisions about both our own lives and our society.

But it’s important to understand that data and data analysis are merely tools. They can be used well, or they can be used poorly. It is remarkably easy both to mislead and to be misled by data. Hence the old adage: “There are three kinds of lies: lies, damned lies, and statistics.”

For many people, data can quickly overwhelm and confuse. It’s easy to misinterpret data, or to use it irresponsibly. We as humans are not particularly good at intuitively grasping large numbers, and our educational system generally does a poor job of helping us to counter this problem.

For that reason, I want to offer two basic principles that I think could prevent a majority of the data mistakes that I observe:

  1. Cherry-picking works better with fruit than data
  2. Correlation provokes questions better than it answers them

Let’s go at these one at a time.

Read more

Cuomo's "Leave No Trace" Administration Casts Shadows Over NY Government

“Create Open NY” is the fourth item on a list of prominent issues New York Governor Andrew Cuomo highlighted as part of his agenda to “Clean Up Albany” -- “a comprehensive plan for how to fix the State government” that he released in June 2011, seven months after taking office. Although most of the "Open NY" section targets how to use technology to process and make public the “staggering amounts of valuable information” (page 65) the state possesses in a data catalog, other parts of “Clean Up Albany” also indicate ambitious updates to ethics and disclosure laws and their enforcement.

It almost makes you think that Cuomo cares about government transparency...at least, as long as it doesn’t apply to his office.

According to recent media reports, the Cuomo administration is doing everything in its power to reduce the amount of valuable information about their operations to zero. The abuses listed include limiting staff communications to telephone chats and untraceable Blackberry messages (rather than FOI-able email, text, or instant messages) and reports about record destruction related to Cuomo’s service as state Attorney General. This kind of calculated, selective disclosure, if true, can not be tolerated.

Last we checked, the materials generated by state executives and their senior staff are just as relevant to the public as the data about budget spending and contracts talked about in Cuomo’s “Open NY” report. Retaining these email communications and archival materials provides vital insight into the process of governance -- not just the approved outcomes. Although requiring disclosure for the communications of top officials can be complicated, we’ve put to rest whether or not the public right to access these records exists -- and we’ve created appropriate restrictions to allow for confidentiality and security exemptions. In this context, the need to go a step further -- to not (just) lock up public records but to prevent their existence in the first place -- is extreme.

To be fair, the letter of the law gives Governor Cuomo a long leash: According to New York law, outside a few specified documents, the governor's office is only legally responsible for retaining what he deems “of sufficient value for preservation.” And, to his credit, Cuomo didn’t accept all the slack: On July 2, 2012, he released a record-keeping policy outlining the various categories of records dealt with by his office and timelines for retention (when applicable).

But writing policy doesn’t create a clean slate, nor does it grant license to avoid the constraints of said policy (let alone to flout open records laws already on the books). The best reporting to-date has covered Cuomo’s use of Blackberry PIN communication, a system that allows for email-sized communications to pass from one Blackberry to another without leaving a traceable footprint. Actual email is reportedly left for nonsubstantive communications between staffers. Cuomo himself never touches the stuff.

Although Cuomo’s spokesman would prefer to pass these operations off as “normal, standard offices practices” to ensure confidentiality, let’s get real. It’s “normal” to use email. Email is subject to disclosure under public records laws and Blackberry PIN messages are not -- and neither are telephone conversations and other communication mediums that leave no trace or record of their existence. The media has speculated as to the motive behind the decision to operate this way -- lessons learned as NY Attorney General, a looming 2016 Presidential bid -- but Cuomo’s motivation is irrelevant. One doesn’t just stumble into conducting official business via recordless operations. The decision to do so is calculated and is an obvious attempt to evade standard disclosure requirements. (Remember: We created exemptions in our open records laws for a reason.)

History isn’t supposed to be flattering. It’s supposed to reflect, to the best extent possible, the events as they happened. Executive records are essential for understanding how and why decisions were made and what was the context and working conditions in which discussions occurred. Sometimes these records reveal unsavory dealings. More often, they don’t. When emails from former Governor Sarah Palin were released to the public in June 2011, the wild scandals some people wanted to see just didn’t appear.

Palin’s emails were made public because, shortly after she was named a vice presidential candidate, various media organizations and individuals requested these records through Alaska’s public record law. Using those emails, Sunlight created a simple web tool -- Sarah’s Inbox -- that let you examine all the emails sent or received by Alaska’s 9th Governor in a familiar format.

It’s interesting to reflect that if the Cuomo administration continues to operate like a black hole, there will never be an Andrew’s Inbox.

“Open NY” is supposed to “use the power of digital information to bring about the beginnings of a new era of public participation in everyday governance” -- in other words, the opposite of the way the Cuomo administration operates. New Yorkers should demand more from their governor because, in his own words, “You can always have more transparency.”

The News Without Transparency - Region is Reshaped As Minorities Go to Suburbs

The American Community Survey is a project of the Census Bureau that collects demographic, economic, and other data from a random sampling of addresses in the United States and Puerto Rico on a regular basis. It informs decision making by government and business, and supports a variety of journalistic endeavors. It is also under attack by the United States House of Representatives. Data from the survey helped inform a December 2010 New York Times analysis of population trends in the New York region. The piece, much of which would not have been possible to write without the data released by the Census Bureau, found that minority populations are rapidly expanding in the suburbs, while whites are moving back into denser urban neighborhoods. The ACS, and the type of analysis that it enables, has wide appeal. In addition to journalists, the ACS is valuable to governments trying to allocate funds and provide essential services as well as businesses deciding where to locate, advertise and ship their products. ACS data is updated on a yearly basis, making it more dynamic and potentially useful than normal census data. For example, Target uses ACS data to understand changing demographics at their urban, suburban and rural locations. They use this analysis to stock their stores more efficiently and effectively. Meanwhile, academics in Portland, OR used ACS data to analyze and predict enrollment trends in Portland Public Schools. This sort of insight can help school systems allocate funding and other resources in a more effective manner. The ACS has the support of business groups, like the U.S. Chamber of Commerce and the National Retail Federation, community planners, librarians and a range of non-profit groups organizations. However, the House of Representatives recently voted to defund the survey, arguing that it is an “intrusive...inappropriate use of taxpayer dollars.” Proponents of the survey contend that the data provides valuable insight about the state of the American economy and gives the US government and businesses a leg up over other nations that do not collect such detailed data. Funding for the ACS is included in the appropriations bill for Commerce, Justice and Science programs. The Senate is expected to take up the bill soon. While it is unlikely to fully defund the ACS entirely, it may agree to a compromise with the House that will make the survey voluntary, a move experts say would increase its cost and lower the quality of its data. The Joint Economic Committee is holding a hearing on Tuesday, June 19th to explore the economic impact of ending or reducing funding for the ACS. ----- "The News Without Transparency" shows you what the news would look like without public access to information. Laws and regulations that force the government to make the data it has publicly available are absolutely vital, along with services that take that raw data and make it easy for reporters to write sentences like the ones we've redacted in the piece above. If you have an article you'd like us to put through the redaction machine, please send us an email at rsibley@sunlightfoundation.com.

Announcing Sarah's Inbox

A screenshot of Sarah's Inbox, a project of the Sunlight Foundation.Today the Sunlight Foundation is proud to unveil Sarah's Inbox, our attempt to make Sarah Palin's recently released email records easier to use with a searchable function and an interface similar to Gmail. It builds on Elena's Inbox, our wildly popular project launched almost exactly one year ago that took the email data of Supreme Court justice Elena Kagan released by the Clinton Library and made it more accessible online.

Sarah's Inbox allows users to view the more than 14,000 emails from Sarah Palin's tenure as Governor of Alaska with familiar sorting functions. You can go page by page starting from the most recent emails or, most importantly, search. To help direct folks to interesting items, try some of our sample searches, star emails for later viewing or view the most starred emails by all users.

The project started after we were again approached by folks on Twitter and the Sunlight Labs list (join!) to take this ugly data and add the Sunlight secret sauce to make it user friendly. Initially we were cautious because the cast of characters who directly obtained the data included the likes of the New York Times, ProPublica, Mother Jones and MSNBC.com. We spoke with ProPublica and they encouraged us to take a stab at fashioning our own tool, so we borrowed their data and went to work. Sarah's Inbox would not be possible if not for the great people at Crivella West to gather, lift, scan and pay for all this data.

Like Elena's Inbox, Sarah's Inbox faced staggering issues of data quality because government officials continue to release digital files as hideous printouts requiring a laborious and error-ridden optical character recognition (OCR) pass over. You will notice that many of the emails are garbled, incomplete or contain odd characters - please keep in mind that we did the best with what we had and are not responsible for the content. Due to the programmatic nature of the tools used to build this site, we recommend checking any research effort against the source files.

Disclaimers aside, please enjoy Sarah's Inbox and tweet interesting items you find with #sarahsinbox.

Is AFSCME or the Chamber the top political spender?

The Wall Street Journal brings an apple to the orange convention, writing that, "The American Federation of State, County and Municipal Employees is now the biggest outside spender of the 2010 elections, thanks to an 11th-hour effort to boost Democrats that has vaulted the public-sector union ahead of the U.S. Chamber of Commerce, the AFL-CIO and a flock of new Republican groups in campaign spending."

They may well end up being the top spender, but our data currently puts them at number 8, having spent a more modest $9.6 million--significantly less than the $87.5 million the Journal reports. The New York Times, meanwhile, reports that the top spender among non-party committees is the U.S. Chamber of Commerce at $21.1 million--very close to our own figure of $23.6 million (which, in fairness to the Times, is a constantly rising number). Why the discrepancy between the Journal's figures and those that Times and we put out?

The Journal got its $87.5 million figure directly from AFSCME, and also got political spending totals from the U.S. Chamber of Commerce, Service Employees International Union and other groups. The Sunlight Foundation is totaling spending reported to the Federal Election Commission.

The disparities between what groups say they are planning to spend and what they've reported spending are troubling, to say the least. It's one of the reasons that we in the Reporting Group use formulations like "U.S. Chamber of Commerce reports spending $29.2 million on lobbying (which includes a wide range of political activities) in the third quarter of 2010," rather than saying "the Chamber of Commerce spent..."

Some time in 2011, we'll get more complete annual reports from the labor unions which dislcose their political spending, forms 990 from groups like the U.S. Chamber of Commerce, year end reports from 527 organizations, and from this information we will begin to piece together how much was spent on the mid-term elections. Even then, it can be daunting.

Lets take a look at one organization: The U.S. Chamber of Commerce disclosed, in the 2008 form 990 it filed with the Internal Revenue Service, spending $23 million of election-related spending and another $4.7 million lobbying. It reported spending $16.5 on electioneering communications to the Federal Election Commission. The Chamber also disclosed to the House and Senate in 2008 that it spent $62 million on lobbying--defined as influencing legislation; participating in any political campaign, including state and local races; attempting to influence the public to on political matters or elections as well as contacts with certain high ranking executive branch officials. Its affiliates spent about $31 million more. Which is the right number?

It might seem like we are mixing apples, oranges and xylophones here (FEC, IRS and lobbying disclosures), but remember, we're trying to get a handle on who spends the most on political activity. The IRS lobbying definition the Chamber uses when it files lobbying reports with Congress doesn't produce a number that matches the two, separate numbers it reports to Internal Revenue Service or those two numbers added to the FEC number--16.5 + 4.7 + 23 does not equal 62 (being a journalist, I've counted it out on my fingers twice to make sure). This doesn't mean that any of these individual numbers is wrong or inaccurate (although I suspect all of them are to some degree)--just that they report different things.

In the case of the Chamber, the lobbying disclosure form filed with the House and Senate comes closest to capturing total spending but offers no itemization. For labor unions, it's the annual report--form LM2--they file with the Labor Department, which unlike lobbying disclosure forms actually itemize expenses.

So while one should treat the Wall Street Journal's numbers with a bit of skepticism--consider the source--it's not crazy to ask these groups what they're spending. I'm not sure I'd be comfortable saying that AFSCME has leapt into first place on their say-so, but one should also be careful to recognize that disclosure from the Federal Election Commission is by no means all-inclusive. For example, no U.S. Chamber of Commerce ad that ran more than 30 days before a primary or more than 60 days before the general election had to be disclosed anywhere--except to the local TV, radio or cable operator that ran it. But that's because Federal Communications Commission rules require that disclosure, not federal election law.

And to answer the question in the headline honestly, we'd have to say that at this point we just don't know. According to the FEC, it's the Chamber. According the groups themselves, it's AFSCME. All we do know is what they are required to report to the FEC, which is the only tool we have right now for tracking outside spending.

Sunlight Live Recap: How We Did It

During the Health Care Summit on Thursday, Feb 25, Sunlight tried something new by connecting a live political event to the government data and information we work to make more accessible every day.

Dubbed "Sunlight Live," our coverage of the joint Republican and Democratic heath care summit as a pilot was a smashing success, thanks to all of you.

Read more

New York Times' Represent Feature

The New York Times just launched a new interactive feature called Represent. Represent allows New York City residents to type in their address and receive a stream of political information for all of their elected representatives from the City Council to the U.S. Senate. The information currently contained in Represent includes mentions in Times articles and congressional votes. It's very much like a political coverage EveryBlock (and it wouldn't be a bad idea for EveryBlock to integrate this data into their local data streams). The Open blog at the Times explains:

Using your address as a starting point, Represent figures out which political districts you live in and who represents you at different levels of government. It draws maps that show how where you live fits into the political geography of the city. And using information collected from around the Web, it presents a customized activity stream that tracks what the people who represent you are doing. Represent crawls a collection of New York Times stories and City Room blog posts, looking for references to public officials. It also draws from official data sources — currently, Congressional roll-call votes, which we collect by parsing feeds and scraping government Web sites. It evaluates each article, blog post and vote to find the stories most relevant to you. (Both our article search and our Congressional votes database will soon be available to outside developers through free, open APIs.)

The fact that the Times is launching something that serves not just as a supplement to coverage, but also as a public service, shows the direction that large, traditional media sources are heading as they shrink in print and expand online. Another example would be the Washington Post's congressional votes database. The Post also currently experiments with Apture to provide greater context in their political coverage. Here's Apture explaining their partnership with the Post: I can only imagine that we'll be seeing a lot more information integration from large traditional news organizations in the coming years.

New York Times Opens Archives Online

Update: For some reason it appears the Times has pulled this awesome research tool. I'll try to find out why.

The New York Times launched an amazing research tool, creating a great online browser for all their content from 1851-1922. The Times is also offering the data in API so that, if you can, you can create your own browser. The Times blog says:

"As part of eliminating TimeSelect, The New York Times has decided to make all the public domain articles from 1851-1922 available free of charge. These articles are all in the form of images scanned from the original paper. In fact from 1851-1980, all 11 million articles are available as images in PDF format. To generate a PDF version of the article takes quite a bit of work — each article is actually composed of numerous smaller TIFF images that need to be scaled and glued together in a coherent fashion."

If you do research - or are in any way in need of scanning the 1855 adverts for local New York haberdashers - this is not to be missed. Check out the TimesMachine. (There might be some kind of server problems right now.)

The article to the left references a large scale congressional investigation into lobbyist actions in an attempt to block President Woodrow Wilson's tariff bill, a key element of his New Freedom agenda. The investigation sought to discover if Senators had been bribed or received undue influence from these lobbyists and ultimately required every sitting Senator to testify to their personal finances, campaign contritbutions, and relationships with lobbyists and other company agents. This amounted to the first full disclosure by members of Congress in regards to the personal finances, their campaign contributors, and the nature of the lobby. A first for transparency in Congress.

Read more