John Vester

Posted on Dec 13, 2021

When Big Data Goes Bad: Rehabilitating Data Quality

#bigdata #tutorial

We live in a data-driven world.

In the last ten years, the term Big Data came to the forefront of technology—despite the fact that the term was popularized by John Mashey over twenty years ago. The Big Data quest has prompted corporations to employ teams that use mathematical analysis and inductive statistics to reveal relationships and dependencies. The mission for this subset of Big Data technologists is to use data to predict outcomes and behaviors, leading to a corporate advantage.

In order to leverage data in this way, the data itself must be sound and reliable. Meaning: attempting to make decisions based upon bad data is actually worse than making a decision with absolutely no data.

“Good business decisions cannot be made with bad data.”

Uber Engineering

In this article, I reflect back on a lesson I learned when a former employer attempted to leverage data which it later realized was bad data. Based on that lesson, we’ll fast-forward to modern engineering approaches that maintain data quality as part of the development lifecycle.

Reflecting Back on the Real Estate Industry

Before Big Data, there was an effort to employ data warehouse (DW) and business intelligence (BI) technologies to gain insight into the state of a corporation’s business. Even before that, information technologists were often reinventing the wheel (in silos) in hopes of using custom code to yield a competitive advantage.

It was at this time that I found myself working with a leader in the real estate industry. While considered the frontrunner of their industry segment, maintaining distance from competitors became a challenge.

One of the company’s interest areas became the amount of time needed to define, justify, and protect the amount of money that they charged tenants. Instead of charging a base rate per square foot, there were additional data factors that played a role in the rent—a price deemed fair by both parties.

Consider these five data points as an example:

Quality of the property where the space exists
Location of the space within the property
Proximity to other tenants at the property
Tenants’ existing relationship with the real estate company
Stability of the tenant considering a new lease

The leasing team—accessing different systems—analyzed and answered each of these questions.

Providing the Ideal Rent Solution

The IT division took on a self-funded initiative to solve this problem. The goal was to introduce an application—let’s call it Ideal Rent—which would ask the user for a series of inputs, similar to the following:

Property and location(s) of the desired space
Start and end date of the proposed lease
Tenant name and information about the usage

Using this information, the system would gather and predict a rate that could be justified by the factors which provided equal value to the property and the tenant. At a high level, the Ideal Rent solution utilized the following design:

The effort to complete the logic behind the scenes was quite involved because data integration products were still in the Technology Trigger phase of the Gartner hype cycle.

Presenting the Ideal Rent Solution

When the leasing leadership reviewed the application for the first time, they were skeptical that a simple input form could produce a result that had formerly required a great deal of human-based analysis. Once they saw the application for the first time, the leasing teams quickly noticed aspects of the resulting recommendations which were not valid assumptions. Basically, the technology team felt that they had a better understanding than the owners of the leasing process.

The system did not become the single point to reach an optimal solution to provide a fair rate for a given lease. In fact, two key lessons were realized from this experience:

The leasing team was not fully involved in the effort, which led to a lack of understanding of the data.
The feature teams were not aware of upstream changes that were happening with the data. This impacted the quality of the data and the downstream results of the recommendations provided by the Ideal Rent application.

Data-Driven Decision-Making Requires Quality Data

The primary lesson learned from the leasing industry example is something I have discussed in prior articles on DZone.com. One of my favorites is the “The Secret to a Superior Product Owner” publication I wrote back in 2017. It focuses on a guy named Michael Kinnaird, who is still the best product owner I have worked with during my 30+ years in information technology.

The Uber Engineering quote from earlier provides a summary of the second lesson we learned in the Ideal Rent example.

Just like quality control efforts are put into place to test and validate program code before it reaches the hands of end-users, quality control around data is equally important. In the example noted above, changes in the design of the data were not known to the team utilizing the data for their application. This had a negative impact on the results provided.

I recall being surprised by this realization at the time because I felt like the data was sound. I also recognized the irony, as I had worked my entire career dealing with “change” as a primary driver for my feature design and development.

How Data Quality Should Be Done

As I thought back to the timing around the example use case, I realized something. If the Ideal Rent application was released before the revelation of the show-stopping data changes, the result would have been catastrophic. I can only imagine the impact non-ideal rent values would have had on future valuations for this corporation on Wall Street.

If, back then, we could have done data observability and data quality like it’s done today, we would have caught our data issues much earlier. This would have saved embarrassment, headache, frustration, and it would have prevented the possibility of huge exposure to risk.

Recently, I came across Datafold, which is a data reliability platform that helps companies prevent data incidents. Their Data Diff feature is laser-focused on locating data differences in source data being utilized by applications and processes. The product is even designed to work in the order of billions (not thousands or even millions) of records.

To illustrate the benefit of identifying data quality issues, let’s look at three simplified data quality challenges in the real estate industry that might be difficult to comprehend otherwise:

Adoption of a custom standard industrial classification (SIC) code system
Alterations to the tier structure of properties
Revisions to the space quality rating structure

In each case, if the consumers of this data were unaware of the data-affecting challenge, the result would negatively impact data quality.

Adoption of Custom SIC Codes

The standard industrial classification (SIC) code system was established to give each industry a four-digit code. As an example, if you decided to open up a bicycle shop, it would fall into the 3751 SIC code.

To simplify the example use case, consider the challenge where the SIC codes were too broad to reflect the true desire of the spaces being occupied. In other words, stores focused on providing different entertainment options (e.g. video stores, music stores, and musical instruments) all got the same SIC code.

To address this shortcoming, let’s assume the real estate company took time to introduce additional SIC codes. This would help provide more details about the underlying business that occupied space at the properties.

However, the team attempting to provide an optimized rent suggestion didn’t know about this change. As a result, those cases where the new custom SIC code was not found fell back to an unknown state, resulting in a sub-par computation. Furthermore, those cases where a SIC code was repurposed led to unfavorable results with the proposed rent value. As an example, if the custom SIC code mapped to a tire store (using the normal SIC code) instead of a custom jeweler, the monthly rent value would be much lower than expected.

Alterations to the Tier Structure

The real estate company utilized a tiered structure to help identify the quality of their properties. Basically, a Tier 1 property was reserved for those that were considered the best. As the tier level increased, the property was lower on the list—based upon corporate-wide evaluations.

While the Tier 3 and Tier 4 properties were on the lower end of the spectrum, they were still quite profitable entities. However, the ideal rent for those spaces was lower than the same space at a Tier 1 or Tier 2 property.

Another surprise to the IT team could have occurred when evaluation metadata was introduced at the Tier 1 level. Let’s assume that sub-tiers had to be added in order to answer the question, “Why is this property considered one of our best?” Possible answers may include items like location and proximity, quality of tenants, and financial revenue produced.

The sub-tier would impact the ideal rent recommendation differently when location and proximity were the rationales behind the tier decision. In that case, the tier level was typically a Tier 2 or Tier 3.

Revisions to Space Quality

Changes to the business rules behind space quality could also impact the computation of ideal rent. Imagine if the original design for space quality rating was on a scale from 1 to 5, where a value of 5 indicated the top of the class. Then, that design was updated to reflect a four-point scale, where 4 is now the maximum value.

Unless the feature team was aware of this decision or fully monitoring production data, they would not realize that the definition had been refactored. This would mean that the space quality aspect of the computation would be off by at least 20%, which would negatively impact the ideal rent being suggested.

Adding Data Diff Into the Development Lifecycle

The Ideal Rent application leveraged extract, transform, and load (ETL) services. In other words, it pulled the necessary data from the source systems and transformed it into a format the application recommending the ideal rent could consume. It was at this same level in which the changes to the underlying data went unnoticed, leading to a negative impact on the decisions driven by that data.

Introducing Data Diff into the process simply becomes one new step in the continuous integration (CI) process. After configuring data sources related to your integration, and then adding Datafold to your dbt configuration, the results of a Data Diff test show up as part of your pull request review process.

As a result, all of those participating in the PR process have insight into data quality analysis performed by Datafold.

But Wait, There’s More

At this point, you might be thinking that there’s still a gap here. Data quality steps can’t just be relegated to the CI/CD pipeline when there’s a code change and a pull request. What happens when the Ideal Rent application code hasn’t changed, but the rules behind the source data have?

This is where Datafold’s column-level lineage feature comes in. When the engineering team or the data team are just considering data rule changes, they might ask questions like, “How would the data used in our final calculations be affected if our query took into account values from the column in that table too?” Column-level lineage shows the team how data flows through the waterfall of queries and transformations. Make a change here, see how it will impact your data set there.

The team—whether that’s the data team or the engineering team—would use Datafold’s UI to visualize and understand how upstream changes to their data rules affect their downstream data. This analysis is done separately from the CI/CD pipeline, and separate from code changes.

Remember, you must have the ability to find data quality issues without a corresponding code change. After all, the Ideal Rent development environments may not have all of the changes that match the source systems, so there needs to be a safeguard to protect production users who are making data-driven decisions.

This is why maintaining data quality is critical for any applications which rely on data to make informed decisions. Data lineage tools—like this column-level lineage analysis from Datafold—help with that.

Conclusion

Starting in 2021, I have been trying to live by the following mission statement, which I feel can apply to any IT professional:

“Focus your time on delivering features/functionality which extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.”

J. Vester

In this article, the experience I encountered earlier in my career highlights the importance of data quality. A lack of data quality will always have a catastrophic impact on systems used for data-driven decision-making.

Corporations using data to make critical decisions should consider tooling focused on maintaining data quality, and that tooling should be part of the software development lifecycle.

Have a really great day!

DEV Community