What is Data Lineage and How Can It Ensure Data Quality?

#database #architecture

Introduction

Are you spending too much time tracking down bugs for your C-level dashboards? Are different teams struggling to align on what data is needed throughout the organization? Or are you struggling with getting a handle on what the impact of a potential migration could be?

Data lineage could be the answer you need for data quality issues. By improving data traceability and visibility, a data lineage system can improve data quality across your whole data stack and simplify the task of communicating about the data that your organization depends on.

Hold on though—what exactly is data lineage?

What is Data Lineage?

Data lineage is a representation of the flow of data through different systems and transformations. In a modern data stack, data is not stored solely in application databases; this data flows from one application to another and from application databases to data warehouses, where it gets transformed and eventually consumed by any number of reporting tools and other downstream applications.

This flow of data allows each system to access data in a format that makes sense for it. The source applications can optimize towards improving the performance of read-write transactions. At the same time, reporting clients can access denormalized data which is convenient for querying.

Unfortunately, this convenience comes at the cost of traceability and visibility. Once the data leaves the source database and undergoes any number of transformations, an additional layer has been added which can obscure the underlying data. Reporting teams often struggle to understand where their data is coming from or determining the right data to use for a given report. When they ask the application team, the team might tell them that the data doesn’t exist, as—after going through the transformation process—the terms used to refer to a piece of data have changed.

Additionally, solving any bugs or problems takes longer and requires the involvement of three teams: the reporting team, the data warehouse team, and the application team. Typically, the burden of solving the issue falls onto the data team, who then need to dig through version control to try and understand why the problem arose in the first place. This slows down the development of new reports as well.

Data lineage solves these issues. Let’s discuss how.

Why use Data Lineage?

By providing a clear representation of the flow of data, a data lineage system essentially allows you to have your cake and eat it too. You can have both the separation of roles and the performance of a data warehouse while still having clear data understanding and traceability across all your systems and teams.

Clear data understanding and traceability allow you to trace important data across the system. For example, this can allow you to verify that no personally identifiable information (PII) is leaving the application systems and being consumed where it should not be. It also allows you to see what data is frequently consumed downstream, giving visibility into the impact of any potential changes or migrations. Similarly, you can identify any unused information, allowing simple cleanup of unused tables or columns.

By increasing data understanding, data lineage systems reduce incident response times and improve team communication. Instead of confused discussions about where a piece of data in a report comes from, the data lineage system makes it clear to all parties where the data comes from and how it is consumed. This speeds up both resolution of any errors and new development.

Now that we know why data lineage is critical for the modern data stack, let’s look at the various types of data lineage systems.

Types of Data Lineage

There are two main categories of data lineage systems: active and passive.

An active data lineage system is “active” because you must create it yourself. This is done by programming the relevant source and transformation information into the system or tagging your data with the appropriate metadata. One example of an active system is Apache Atlas. A properly configured active data lineage system can provide traceability for your data to a very fine degree of detail. However, in order to gain those benefits, constant updating and maintenance are required. This adds complexity to your overall data infrastructure and can be time-consuming.

In contrast, a passive data lineage system attempts to understand your data on its own. Some passive systems look at the data coming from the data warehouse. Through pattern recognition, a passive system attempts to recognize where that data is coming from and how it is being transformed. While this can work well for simpler data sets and transformations, it is inexact and can generate inaccurate results.

A parsing-based system is another kind of passive data lineage system which generates lineage data by reverse-engineering your data warehouse. Rather than entering in the lineage data manually (active systems) or guessing based on data patterns (pattern recognition), a parsing-based data lineage system can see exactly where the data came from and how it is being consumed. Datafold is an example of this type of system. Datafold analyzes all the DQL code in your data warehouse and generates column-level lineage graphs. This is significantly more detailed than table-level lineage and allows you to see exactly which column a given piece of data is sourced from and where it is consumed. This level of detail leads to improved outage response time, allows for faster troubleshooting, and decreases the frequency of breaking changes making it to production.

With numerous data warehouse integrations, Datafold is plug-and-play for many, and the generated lineage data is also accessible through the Datafold API. As long as it has support for your data warehouse and related systems, a parsing-based data lineage system is the easy choice from an implementation and maintenance standpoint.

This is all great, but what difference does data lineage make to my day-to-day? Let’s look at that.

How can Data Lineage Ensure Day-to-day Data Quality?

The improved visibility and traceability from a data lineage system has (at least!) three clear effects on your operational day-to-day.

First, it improves your team response time. Investigating the cause of an error in a report no longer requires hours and the coordination of several separate teams. With full visibility into the flow of data across your entire data stack, errors can be investigated and resolved in record time.

Second, it allows for the creation and maintenance of a common data vocabulary. When the report team talks about views, it is clear to the application team what that means and where that data comes from. Similarly, the application team can now see exactly what data is being aggregated for that dashboard which informs all the company decisions and outlook. Over time, discrepancies in terminology can be mitigated or removed, allowing for smoother communication across the organization.

Finally, the data lineage system allows teams to easily and effectively anticipate the effects of any potential changes or migrations. Data schema changes and migrations can be planned out with certainty. Full traceability makes it easy to understand the downstream impact of any changes and to notify the appropriate parties.

Wrap-up

In this article, we covered the basics of what data lineage is, why you might want to use one, the different types of data lineage, and how data lineage can improve your data quality each day. The addition of a data lineage system to your data stack can increase transparency and prevent headaches for your entire organization.