DEV Community

Henning
Henning

Posted on

Think bigger about data quality

There are a lot of thinking and writing about data quality nowadays, but a lot of the thinking considers data quality as something akin to a KPI.
Metadata quality score

In some respects, we have to look at it that way. Data contains multitudes, but we have very little capacity to understand let alone communicate all the nuances.

But quality is not a percentage. For all practical purposes, quality must be seen in through the lens of what you use the data for.

Fortunately, there is a trove of prior work on this, hidden in plain sight: National statistics, survey data, the Total Survey Error framework, and recent (read: last 10-15 years) work to adapt thinking on survey errors to new register data.

Total Survey Error

Total survey error model

Surveys have been around for a long time, and there is a lot of academic work around how to measure errors. I'm sure someone would be able to find a semantic distinction between quality and errors, but for our purposes, we can think of this as a model for evaluating data quality in a survey.

I know, we don't do surveys (usually), and if you do there is a good chance you outsource the job to someone who knows what they are doing.

But surveys are, conceptually, as good a place as any to start thinking about data quality - becayse they force you to answer two questions: What are you asking, and who are you asking.

The Total Survey Error framework helps you to think critically about what errors can be introduced as you go from formulating a question to surveying people to processing the results. Many of us are used to thinking about sampling errors and they are beautiful because we can do math and come up with a probability range and add error bars and look really smart. But the other errors don't usually come with error bars.

Adjacent to the sampling error is coverage error, basically if you ask the kind of people you want an answer from, or someone completely different.

Similarly, there is nonresponse error - basically, you need to correct for any pattern in who doesn't answer.

Then, there is the question itself. Is it understood correctly? Does the person know the answer? Misremember? Lie? And lastly, after you have gathered the responses, are they processed correctly? Does the OCR tend to register 1s as Is?

But what does any of this have to do with you?

From surveys to... that other thing

Most of the data we use today aren't surveys. Which means we probably don't have to deal with sampling errors. But most of the other errors have parallels in the business world.

There are still processing errors. The risk of processing errors may even be way higher in business, because survey data tends to have a straight-forward structure while business data can be organized in very complex data models optimized for something completely other than analysis.

Instead of validity, the measures in the business data might be different from what you want to answer. Maybe you sell furniture and want to know the size of people's house, but the gross square footage you have includes all areas covered by a roof - including garages, sheds etc. You will overestimate the potential sales for a number of customers with big garages and sheds, but it's still valuable information.

Measurement errors still exists, someone could have jotted down the wrong number or there could have been an error when the old physical records were digitized, or maybe a current owner tries to evade taxes by reporting a much lower square footage.

Similar for representation, the group of people you want to study might not be the group of people you have data on. If you want to know what proportion of a country's population has higher education, having graduation data from universities might be a really good start. But some people got their education abroad. And some people might have moved abroad after graduating. So you have not just a subset, not just a superset, but a largely overlapping set. For your purpose, this is a quality issue. For someone else, it might be perfect.

Time is also a potential problem. Data only goes back so far, or maybe there are unfixable breaks in the data rendering older data useless. This isn't coverage per se, and it isn't nonresponse, but it is a problem. Try coming up with a name for it.

There are a lot more complex attempts at adapting the TSE framework to administrative data, see Zhang 2012 (paywalled) or a brief overview in this slide deck.

An illustration of Total Survey Error adapted for administrative data, from Zhang:

Total administrative data error framework

Note that we have gone through all of this without calculating a single percentage, KPI or trying to quantify anything. This is all just conceptual.

Turning the table

But as a data producer, what do you do? Do you just throw your hands up when someone asks you about the data quality, because you don't know what they need the data for?

There are of course some things you can do. Measurement errors and quirks in the data collection can be described. Make sure to include special values and weird modes, like transactions of $0 or a negative house value. A negative house value can be a glaring data quality issue if it isn't explained, or a valuable feature of the data if you know what it means.

But... what about reality?

We like to say that high-quality data represents reality truthfully, but reality is often a red herring. Not always, of course, something like a person's age is fairly simple. If the data says someone is 58 trillion years old or if their zip code contains emojis you can assume there is a data quality issue. So yes, there are easy things that you can document and call data quality and be happy.

But data doesn't necessarily have poor quality just because it is more complex than someone thinks, or is intended for a different purpose.

One month a year, my salary is negative. I make a negative amount of money. Of course, I don't really. But it looks like that on my paycheck. I get paid for one month, but get deducted for 5 weeks of vacation - which is more than one month, and so I am paid negative money. Of course, this is in a way an artifact. Do I actually get a bill from work that month? No, of course not. Technically, I don't have paid vacations, but instead of of normal pay I get a vacation allowance - which is usually higher than my normal pay. But it isn't technically salary. The total amount of money paid to me that month is higher than normal, but the salary part of my paycheck is negative.

Is my salary that month a data quality issue? Or is the negative amount a truthful representation of reality?

Top comments (0)