Congratulations, you’ve successfully implemented data testing in your pipeline! Whether that’s using an off-the-shelf tool or home-cooked validation code, you know that securing your data through data testing is absolutely crucial to ensuring high-quality reliable data insights, and you’ve taken the necessary steps to get there. All your data problems are now solved and you can sleep soundly knowing that your data pipelines will be delivering beautiful, high-quality data to your stakeholders! But wait… not so fast. There’s just one detail you may have missed: What happens when your tests actually fail? Do you know how you’re going to be alerted? Is anyone monitoring the alerts? Who is in charge of responding to them? How would you be able to tell what went wrong? And… how do you fix any data issues that arise?
As excited as data teams might be about implementing data validation in their pipelines - the real challenge (and art!) of data testing is not only how you detect data problems, but also how you respond to them. In this article, we’ll talk through some of the key stages of responding to data tests, and outline some of the important things to consider when developing a data quality strategy for your team. The diagram below shows the steps we will cover:
- System response to failure
- Logging and alerting
- Alert response
- Root cause identification
- Issue resolution
- Stakeholder communication (across several stages)
The first line of response to a failed data test, before any humans are notified, are automated responses of the system to the test failure that decide whether and how to continue any pipeline runs. This could take one of the following forms:
- Do nothing. Continue to run the pipeline and simply log the failure or alert the team (more on that below).
- Isolate the “bad” data, e.g. move the rows that fail the tests to a separate table or file, but continue to run the pipeline for the remainder.
- Stop the pipeline.
The system response can also vary depending on the level of severity of the detected issue and the downstream use case: Maybe it’s okay to keep running the pipeline and only notify stakeholders for certain “warning” level problems, but it should absolutely not proceed for other, “critical”, errors.
While it is absolutely possible for data validation results to be simply written to some form of log, we assume that at least some of your tests will be critical enough to require alerting. Some things to consider here are:
- Which errors need alerting, and which ones can be simply logged as a warning? Make sure to choose the correct level of severity for your alerts and only notify stakeholders when it’s absolutely necessary in order to avoid alert fatigue.
- Which medium do you choose for the alerts? Are you sending messages to a busy Slack channel or someone’s email inbox where they might go unnoticed? Do critical alerts get mixed in with daily status reports that might be less relevant to look at? Using a tool such as PagerDuty allows you to fine-tune your alerts to match the level of severity and responsiveness required.
- What is the timeliness of alerts? Do alerts get sent out at a certain time or do they just show up at some point during the day? This is an important factor to consider when your alerting mechanism fails - would anyone notice?
Now that your alerting is nicely set up, you’re onto the next hurdle: Who will actually see and respond to those notifications? Some factors to take into account are:
- Who gets notified and when? Upstream data producers, downstream data consumers, the team that owns the data pipelines, anyone else? Make sure you have a clear map of who touches your data and who needs to know if there are any issues.
- Who is actually responsible for acknowledging and investigating the alert? This is probably one of the most crucial factors to consider when setting up data testing: Someone actually needs to own the response. This might not always be the same person or team for all types of tests, but you better have a clear plan in order to avoid issues going unnoticed or ignored, which in turn can cause frustration with stakeholders. I’m not saying you need an on-call rotation, but maybe… maybe, you need an on-call rotation. Having said that, please see the previous paragraph on fine-tuning the severity of your alerts: On-call does not necessarily mean getting a Pagerduty call in the middle of the night. It just means that someone knows they’re responsible for those alerts, and their team and stakeholders know who is responsible.
- Are your notifications clear enough for your stakeholders to know what they imply? In particular, do your data consumers know how to interpret an alert and know what steps to take to get more information about the problem or a potential resolution? (Hint: Having a clear point of contact, such as an on-call engineer, often helps with this, too!)
While it’s easy to jump right into responding to a test failure and figure out what’s going on, you should probably stop for a moment to think about who else needs to know. Most importantly, in most cases you’ll want to let your data consumers know that “something is up with the data” before they notice. Of course, this is not specific to data pipelines, but it’s often harder for downstream data consumers to see that data is “off” compared to, say, a web app being down or buggy. Stakeholders could either already be notified through automated alerting, or through a playbook that includes notifying the right people or teams depending on the level of severity of your alerts. You’ll also want to keep an open line of communication with your stakeholders to give them updates on the issue resolution process and be available to answer any questions, or if (and only if) absolutely necessary, make some quick fixes in case there are some urgent data needs.
At a high level, we think of root causes for data test failures as belonging to one of the following categories:
- The data is actually correct, but our tests need to be adjusted. This can happen, for example, when there are unusual, but correct, outliers.
- The data is indeed “broken”, but it can be fixed. A straightforward example for this is incorrect formatting of dates or phone numbers.
- The data is indeed corrupted, and it can’t be fixed, for example, when it is missing values.
One very common source of data issues that arise at the data loading or ingestion stage are changes that are mostly out of the control of the data team. In my time working with third party healthcare data, I’ve seen a variety of data problems that arose seemingly out of nowhere. Some common examples include data not being up-to-date due to delayed data deliveries, table properties such as column names and types changing unexpectedly, or values and ranges digressing from what’s expected due to changes in how the data is generated.
Another major cause of data ingestion issues are problems with the actual ingestion runs or orchestration, which often manifest themselves as “stale data”. This can happen when processes hang, crash, or get backed up due to long runtimes.
Now, how do you approach identifying the root cause of data ingestion issues? The key here is to be methodical about
- Identifying the exact issue that’s actually happening and
- Identifying what causes the issue.
Regarding the former, my recommendation is to not take problems and test failures at face value. For example, a test for NULL values in a column could fail because some rows have actual NULL values - or because that column no longer exists. Make sure you look at all failures and identify what exactly the problem is. Once the problem is clear, it’s time to put on your detective hat and start investigating what could have caused it. Of course we can’t list all potential causes here, but some common ones you might want to check include:
- Recent changes to ingestion code (ask your team mates or go through your version control log)
- Crashed processes or interrupted connections (log files are usually helpful)
- Delays in data delivery (check if all your source data made it to where it’s ingested from in time)
- Upstream data changes (check in the source data and confirm with the data producers whether this was intentional or not) And finally, while data ingestion failures are often outside of our control, test failures on the transformed data are usually caused by changes to the transformation code. One way to counteract these kinds of unexpected side effects is to enable data pipeline testing as part of your development process and CI/CD processes. Enabling engineers and data scientists to automatically test their code, e.g. against a golden data set, will make it less likely for unwanted side effects to actually go into production.
Now... how do I fix this? Of course, there is no single approach to fixing data issues, as the fix heavily depends on the actual cause of it - duh. Going back to our framework of the three types of root causes for test failures we laid out in the previous paragraph, we can consider the following three categories of “fixes” to make your tests go green again:
- If you determine that the data is indeed correct but your tests failed, you need to adjust your tests in order to take into account this new knowledge.
- If the data is fixable, some of the potential resolutions include re-running your pipelines, potentially with increased robustness towards disruptions such as connection timeouts or resource constraints, or fixing your pipeline code and ideally adding some mechanism to allow engineers to test their code to prevent the same issue from happening again.
- If the data is broken beyond your control, you might have to connect with the data producers to re-issue the data, if that’s at all possible. However, there may also be situations in which you need to isolate the “broken” records, data sets, or partitions, until the issue is resolved, or perhaps for good. Especially when you’re dealing with third party data, it sometimes happens that data is deleted, modified, or no longer updated, to the point where it’s simply no longer suitable for your use case.
Ha! You wish! Not to ruin your day here, but you might also want to consider that your data tests pass because you’re simply not testing for the right thing. And trust me, given that it’s almost impossible to write data tests for every single possible data problem before you encounter it the first time, you’ll likely be missing some cases, whether that’s small and very rare edge cases, or something glaringly obvious. I am happy to admit that I once managed a daily data ingestion pipeline that would alert if record counts dropped significantly from one day to the next, since that was usually our biggest concern. Little did I know that a bug in our pipeline would accidentally double the record counts in size, which besides some “hmm, those pipelines are running very slow today” comments aroused shockingly little suspicion - until a human actually looked at the resulting dashboards and noticed that our user count had skyrocketed that day.
So what do you do to make your tests more robusts against these “unknown unknowns”? Well, to be honest, this is a yet-to-be-solved problem for us, too, but here are some ideas:
- Use an automated profiler to generate data tests in order to increase test coverage in areas that might not be totally obvious to you. For example, you might not even consider testing for the mean of a numeric column, but an automatically generated test could make your data more robust against unexpected shifts that are not caught by simply asserting the min and max of that column. One option to consider is putting these “secondary” tests into a separate test suite and reducing the alerting level, so you only get notified about actual meaningful changes.
- Make sure to socialize your data tests within the team and do code reviews of the tests whenever they are added or modified, just like you would with the actual pipeline code. This will make it easier to surface all the assumptions the team working on the pipeline has about the data and highlight any shortcomings in the tests.
- Do manual spot checks on your data, possibly also with the help of a profiler. Automated tests are great, but I would claim that familiarity with data is always an important factor in how quickly a team can spot when something is “off”, even when there is no test in place. One last step of your data quality strategy could be to implement a periodical “audit” of your data assets to ensure things still look the way they should and that tests are complete and accurate (and actually run).
We really hope this post has given you a good idea of the different steps to consider when you’re implementing data validation for your pipelines. Keep in mind that developing and running tests in production is only one aspect of a data quality strategy. You’ll also need to factor in things like alerting, ownership of response, communication with stakeholders, root cause analysis, and issue resolution, which can take a considerable amount of time and effort if you want to do it well.
If you want some more concrete examples, check out our case studies on how some users of Great Expectations, such as Komodo Health, Calm, and Avanade integrate Great Expectations into their data workflows.