Quantifying Data Quality


You've already heard me complain about data quality -- how it's a bigger problem than most people realize, and a harder problem than many people hope. But let's not leave it there! Perfect datasets mostly exist in textbooks and computer simulations. We need to figure out what we can do with what we have. In this and other posts, I hope to give the developers in our community some idea of how they can deal with less-than-perfect data.

The first step is to figure out how bad things actually are. To do that, we'll use some simple statistics -- those of you with a strong stat background can skip to the next entry in your RSS reader (or better yet, correct my mistakes in comments).

