Two principles to avoid common data mistakes
If David Brooks is correct, the “rising philosophy of the day” is “data-ism.” But you don’t have to believe David Brooks. Just look at the big data (e.g. Google Trends) on “big data.”
For the political junkies, data became sexy in 2012. First, the New York Times’ Nate Silver’s meta-analyses of polling data triumphed over the pundits’ “gut feelings.” Second, the Obama campaign successfully used data analytics to increase voter turnout. This caused people to pay attention (witness, for example, David Brooks’ new devotion to the subject as prime column-fodder).
Of course, for those of us in the transparency and accountability advocacy community, data has long been a prized commodity. And as governments around the world increasingly commit to open data promises, more and more data is becoming available.
At its best, data allows us to transcend our personal anecdotal experiences, giving us the big picture. It allows us to detect relationships and patterns that we wouldn’t otherwise see. Using data smartly can help us to make better decisions about both our own lives and our society.
But it’s important to understand that data and data analysis are merely tools. They can be used well, or they can be used poorly. It is remarkably easy both to mislead and to be misled by data. Hence the old adage: “There are three kinds of lies: lies, damned lies, and statistics.”
For many people, data can quickly overwhelm and confuse. It’s easy to misinterpret data, or to use it irresponsibly. We as humans are not particularly good at intuitively grasping large numbers, and our educational system generally does a poor job of helping us to counter this problem.
For that reason, I want to offer two basic principles that I think could prevent a majority of the data mistakes that I observe:
- Cherry-picking works better with fruit than data
- Correlation provokes questions better than it answers them
Let’s go at these one at a time.
Cherry-picking works better with fruit than data
It’s actually really easy to prove your point if you limit the cases to just those that prove your point. Problem is, it’s not really proving your point. It’s just selecting the cases that prove your point. Data scientists call this selection bias.
In this post, I’ll cover two common problems in selection bias: 1) Non-representativeness; and 2) Selecting on your outcome variable. Non-representativeness is the broader problem. Selecting on your outcome variable is a more specific type of non-representatives. So let’s start with the general problem of non-representativeness.
To discuss representativeness, I’m going to use an extended example that will be familiar to many people: polling in the 2012 U.S. presidential election.
Say we wanted to know how likely Barack Obama was to defeat Mitt Romney before the election. We could either ask a bunch of pundits what they thought, or we could take a nationally representative survey of likely voters. I’ll take the nationally representative sample any time.
A typical poll will sample about 1,000 adults. These 1,000 adults are supposed to stand in for an entire country of voters, and the law of large numbers makes it a pretty good bet that if the sample is representative, 1,000 observations is good enough for the whole country. But being representative is the key. And pollsters try very hard to make sure that their samples are representative – that is, that the sample looks like the country at large on the key variables that might be relevant, such as age, gender, ideology, income, location, ethnicity, etc. Still, different polling agencies have had different ideas about what a representative sample should look like, which sometimes leads to different results.
The now-famous Nate Silver did the pollsters one better. He aggregated all the polling data into one super-poll, getting the biggest sample possible, and thus taking even more advantage of the law of large numbers. He also looked at how well different polling agencies had performed in the past, and gave extra points for those whose predictions more closely matched election-day results, while devaluing those polling agencies that were consistently off. The assumption here was that the polling agencies that did better probably used more representative samples.
It’s key here to understand that the default assumption of most statistics is that things are basically random, like a flip of a coin. It’s only when the coin shows heads 19 out of 20 times that modern statistical analysis will allow you to say that this doesn’t look like a random coin flip anymore: Something else is probably going on. And the more you flip the coin and it turns up heads, the more certain you can be that something other than randomness is at work.
That’s why it’s good to have many observations: the more you can observe something happening over and over again, the more likely it is that you are observing something that is really happening, and not just based on chance. It is much more likely to get 10 heads in a row in a coin toss than it is to get 1,000 consecutive heads in a row. That’s why it’s better to poll 1,000 people than 10 people, and even better to combine 10 polls to get 10,000 people.
This goes for more than just polling. If you observe anything happen in just a few cases, you have no idea whether it was just a random occurrence. But the more you can document it happening, the surer you can be it’s not just a random occurrence.
Recall that Nate Silver’s outputs were all in terms of probability. In the final days, liberals were enthused as the chance of Obama’s victory rose to 90.9% on Election Day. How did Silver calculate this?
Silver knew that each poll taken was not perfect. Most polls reported a range of error. Look closely, for example, at the fine print in the final Gallup 2012 national tracking poll, showing Romney up 49% to 48%: “For results based on the total sample of national adults, one can say with 95% confidence that the maximum margin of error is ±2 percentage points.” What Gallup is admitting is that even with 3,117 adults surveyed in this poll, the results might be off a little. Probably (95% chance) they are off by less than 2 percentage points. But there’s also a 5% chance they’re off by more than this.
You can think of it this way: If Gallup ran this poll 100 times, they would have returned a range of results. Most common would be Romney up 49-48, but you’d also get a fair number of Obama 49-48 scores, and occasionally an even wider split (maybe a Romney 52-46 here and there, or an Obama 51-47).
What Silver did was to pay attention to these reported error ranges and then run a bunch of simulations to generate the likelihood of these different possible outcomes. What he asked was this: Given all the polls in all the states and their range of potential outcomes, what was the likelihood of Obama winning enough states to win the Electoral College? And even though most of the polls in the key swing states showed Obama ahead, on Election Day there was still a 9.1% chance that Romney would win based on polling data.
Hopefully this extended example helps to make one big point: the more data you have, the more likely it is to provide a true picture of the world. Of course, even with a very large set of data, it is still impossible to be 100% sure it is accurate, so if you plan to infer anything from it, you need to recognize that there is some chance you might be wrong. But that chance decreases as your sample size increases.
But it’s not just more data that we need. We also need representative data. For example, if you wanted to generate poll results showing Mitt Romney would handily defeat Barack Obama, you could simply sample visitors to FoxNews.com. Or if you wanted to prove Latinos support Romney, you could find 500 Latinos who were voting for Romney, and trot them out as proof. Both of these strategies would obviously fail the representativeness test.
Sometimes it’s not so obvious. Consider this example: In December 2011, Public Campaign put out a report entitled “For Hire: Lobbyists or the 99%: How Corporations Pay More for Lobbyists Than in Taxes.” It revealed 30 big corporations that had “paid more to lobby Congress than they paid in federal income taxes for the three years between 2008 and 2010, despite being profitable.”
If your mind makes the leap to think that these companies used their lobbying to pay lower taxes, you are following the natural path of intuition. Yet there is nothing in this study to help us to make such an inference in a statistically-defensible way.
It may indeed be an eye-popping fact that 30 companies pay more to lobby than in taxes over a given period of time. But how many companies that don’t lobby also paid very little in taxes? And how many companies that lobbied at equal or greater levels than the selected companies paid higher taxes? And was this a result of picking three years of a recession, when many corporations took many investment losses? This study does not tell us. We don’t know anything about the representativeness of these 30 companies, so we can’t say whether this is part of a larger pattern, or just a random occurrence.
Indeed, there could be many reasons why companies might pay little or no taxes. To know if lobbying is associated with lower taxes, you want to make sure you have a representative set of companies, not just those companies selected because they lobby and pay nothing in taxes. Selecting only these companies to prove that lobbying lowers taxes is the equivalent of selecting 500 Latinos supporting Mitt Romney and proclaiming that all Latinos support Mitt Romney.
If you want to make the case that lobbying lowers taxes, you would either want to take a pre-selected list of companies (say the Fortune 100) or full sample of all companies. As the studies in the previous two links show, you actually do see a relationship between more lobbying and lower tax rates if you follow either of these approaches.
In conclusion, if you want to be able to say something meaningful with your data, you need data that is actually representative. If you cherry-pick your data to only include the cases that prove your hypothesis, nobody should believe you. If you want to be able to infer something about how the world works, you need data that resembles the diversity of the actual world you care about. Otherwise, it’s just anecdotes.
Correlation provokes questions better than it answers them
Of course, just because two things are correlated does not prove that they are causal. To build from the example above, companies that pay less in taxes could also be the companies that lobby the most, and yet there might be another explanation: perhaps they have more money to lobby if they pay less in taxes. Perhaps they are in certain industries with both lower taxes and higher regulations. And so on.
In data, describing the world accurately is the relatively easy part. It mainly involves making sure you have good data that is representative of a broader population. (Of course, this is not always that easy, given widespread problems of poor data quality.)
But the why is the really hard stuff, the stuff that social scientists scratch their heads and pull their hair out and argue endlessly over. Explaining the why requires first showing that two trends are indeed correlated – that is, that they occur together in a predictable and reliable pattern. But much more difficult, it requires that we eliminate all other possible explanations for the observed pattern. Modern statistics has developed a number of ways of doing this, but a discussion of them is far beyond the scope of this post. Safe to say, it’s not hard to find examples of important things that statistical analyses disagree on.
Perhaps the most oft-quoted statistics wisdom is that correlation is not causation. That is, just because two things occur together does not mean that they are causally related. But sometimes they are. You can’t have causation without correlation.
How can you get from correlation to causation? It’s not an easy road, but there here are three basic questions to guide you along that road:
- Is it just random coincidence?
- Do I have a convincing story? If so, what else would be true?
- Are there other explanations I can rule out?
The first reason is just because things happen randomly together. A great example of this is the supposed “Redskins Rule,” which states that the outcome of the U.S. presidential election can be predicted by the performance of the Washington Redskins the Sunday before election day: As the Redskins perform, so does the incumbent party. Between 1940 and 2000, it was correct every year – a remarkable 15-year streak. Statistically improbable, but so are the other 31 “football” rules somebody was able to come up with to predict the Presidential outcome. None had any causal story. As it turned out, 18 of the 31 rules were wrong in 2012 – as was the “Redskins Rule.”
The simple explanation is that if you observe enough phenomena, you’re going to get things that happen together by chance randomly (this is a problem in medical science). In fact, the more unlikely outcome is to not observe occasional random patterns. One way to tell if a series of coin flips is being faked is if there are not enough long runs of heads or tails.
But here’s a quick rule of thumb: if there’s no reason you can think of why two things might occur together, it’s pretty likely you’re observing a random pattern.
Some things that happen together seem more likely than others. One reason we might assume that companies’ low tax rates reported above had to do with their lobbying is because we can believe that one reason companies lobby in Washington is to lower their taxes. Unlike the “Redskins Rule” – the lobbying for lower taxes story makes sense. But just because something makes sense does not mean it is necessarily true.
A convincing story requires evidence, very much in the same way that convicting a defendant requires evidence. Correlation is only Exhibit A. A good data scientist should get a little further into the alphabet. For example, to go on the tax and lobbying example a little further, if lobbying caused lower tax rates, what else might we want to know about these companies?
For example, we might want to see if they indeed lobbied on tax-related legislation, and if so whether they were at all successful. We might want to look at their accounting and see how much their low taxes were due to taking advantage of the tax code as opposed to legitimate business losses. We might want to know if these companies lobbied more and paid lower taxes than similar companies in their industry. We might want to know how these companies tax rates have changed over time. Some of these things are easier than others to get data on. But as more data become available, there are more opportunities for adding more evidence, and making a stronger case.
Take the classic problem of money in politics. We at Sunlight run plenty of stories describing how much money different politicians take from different interests. These relationships presumably tell us something. But the causality of money in politics is complicated. Is money intended to shift votes? Is it intended to gain or maintain access? Is it intended to reward friends? Is it given to win elections? Is it given as part of an arms race increasing contributions on two sides cancel each other out? Do industries that have a harder case to make contribute more money? These are difficult econometric problems to solve, and it’s almost impossible in any instance to show that money caused an outcome.
With any correlation, it’s usually pretty easy to spout out a number of explanations if you know something about the subject. Some of these will be more convincing than others. But to the extent that there are convincing theories out there, often there are also data out there that can disprove them
Correlation is important, because it provokes questions. It tells us that two things might be related, and that we might want to know more. There’s no causation without correlation. The responsible thing to do when we encounter correlation is to think through it honestly and reasonably. Admit that there are certain things we don’t know, offer our best educated guesses, and think about how we might know more. Ask good questions, rather than assume we have easy answers.
As data become more and more ubiquitous, it becomes more and more necessary to understand these basic principles. So, the next time you encounter data analysis you can ask: Is the data behind this analysis representative? Or were the cases selected just to prove a point? If somebody is claiming causation from correlation, do they have a convincing story? Have they tested out other elements of that story? Have they ruled out other explanations?
The important thing with data is to be humble. It’s easy to over-interpret, and to generate meaningless or even wrong conclusions. But there are also tremendous insights data can provide, especially if we are careful and responsible about it.
Ultimately, data analysis works best as an enterprise beyond our individual selves, a larger conversation where we share data and hypotheses. The more data we have and the people who are looking at the data – visualizing it, testing hypotheses, finding correlations – the closer we will come to understanding what it really tells us.