SeattleDataGuy

Posted on Nov 11, 2020

5 Mistakes That Are Ruining Your Data-Driven Strategy

#datascience #database #analytics

Companies of all sizes have embraced using data to make decisions. However, according to a 2019 report from Goldman Sachs, it's actually quite difficult for businesses to use data to build a sustainable competitive advantage.

Our team has worked with and for companies across industries. We've seen the good, the bad, and the ugly of data strategy. We've seen teams implement successful data lifecycles, dashboards, machine learning models, and metrics. We've also had to come in and untangle, delete, migrate, and upgrade entire data systems.

Throughout these projects, we've seen several issues that pop up repeatedly: alack of data governance; bad data; complex Excel documents; a lack of alignment between data teams and the businesses; and an over abundance of dashboards, leading to confused decisions.

All of these data issues compound over time and slowly erode a team or company's ability to trust and use their data.

In this article, we'll discuss some of these issues as well as possible solutions your teams can implement to improve your overall data lifecycle.

Having Inconsistent Data and Sources of Truth

With all the various business workflows, it's inevitable that the same data gets entered in multiple places. One team might use Salesforce for one business process, while another might use Workday.

This, in turn, can lead to inconsistent data being entered at each step. Whether this is due to timing or human error isn't the point. Once these inconsistencies start entering a company's various data warehouses, they can wreak havoc on reporting.

Just ask any BI engineer or analyst who's delivered a report to a director only to find it mismatched another report that was reporting a slightly different slice of data.

Regardless of the size or data maturity of a company, they all suffer from this. Our team has worked at multiple companies and consulted across industries, and they all face this issue.

Solution #1: Set up a data-governance strategy

This is usually solved by a data-governance strategy. Data governance involves managing the data's availability, usability, integrity, and security.

How you decide to deploy this data governance strategy depends on if you want to have a tight, centralized data process or decentralized, separate processes that occasionally meet up to assess that their core data models don't overlap.

Data governance is far from a sexy term. It's not data science or machine learning, and yet it's the foundation of your data science and machine learning departments. Not having a solid handle on what your core source of truths are can lead to a lot of conflicting strategies.

Solution #2: Create a central data team to make decisions quickly

The other strategy is a little less recommended for large companies that want to move quickly, and that's to develop a core data team. Their only focus will be to develop and manage data sets. This strategy works when your company is small because the data team itself will also be small.

This means when decisions need to be made on sources of truth and data integrity, it'll happen quickly. There's no need to manage multiple teams or to have a centralized meeting every month.

Managing Complex Business Decisions In Excel

Excel and spreadsheets continue to drive billion-dollar decisions in companies across the globe. This reliance on Excel has also leads to million- and billion-dollar mistakes by even the smartest companies.

For example, in 2008, Barclays agreed to purchase Lehman Brothers, except spreadsheet errors led them to eat losses on contracts they didn't intend to buy. The detailed spreadsheet of Lehman assets contained approximately 1,000 rows that needed to be converted into a PDF. But the original Excel version had hidden rows with 179 items that Barclays didn't want. The notes that they didn't want those contracts weren't transferred to the PDF but the hidden rows were. As a result, they had to buy 179 contracts they didn't want.

And in 2012, JP Morgan lost nearly $6 billion due to what came down to mostly Excel errors.

An Excel document is a very versatile data tool that can help teams manage a lot of workflows. However, these documents are also very error-prone because of complex design, human error, and how Excel operates in general.

In order to avoid these errors, your team has a few strategies you can put into place.

Solution #1: Treat Excel like engineers treat code

If you're using Excel to make large decisions, then you should treat it the same way engineers treat code.

This means there should be Excel reviews and test cases. It may seem like overkill, but Excel is very much like code. In fact, some people even consider it a fourth-generation coding language. That means it's prone to errors caused by logic, the process, and the inputs provided.

In turn, it should be treated as such.

Don't just trust that an analyst, regardless of how smart they are, will make a perfect Excel sheet. Brilliant programmers make bugs and so do brilliant analysts.

Thus, a logic review, test cases, and sanity checks should be put into to place to reduce these kind of errors. This all might seem unnecessary until your company loses massive amounts of money due to bad information.

Solution #2: Automate Excel using Python and SQL

Automate and develop clean data workflows for business processes that are well defined and can be converted into SQL and code.

A lot of reporting requires copying and pasting data from a report or a database table. Instead of copying and pasting, there are methods your team can utilize to develop solutions that automatically provide the data outputs.

This can be done using a combination of SQL, code, and spreadsheets with limited manipulation. Code is still error-prone. However, it can often be written in ways that limit the amount of errors and inputs. This isn't the case with Excel. Excel's flexibility is what makes it error-prone.

Which solution is best depends on the complexity of the process, its repetitiveness, and how large of decisions are being made off of the data solution.

Too Many Dashboards and Not Enough Purpose

Dashboards can be very powerful tools to help distill information and provide insights and guidance.

However, the ease of development and popularization of dashboards has led to companies often being overwhelmed by dashboards. These dashboards don't always all align with the company priorities --- they, in themselves, might not be concise enough to make decisions on, and sometimes they're just filled with vanity metrics.

Just because dashboards are easy to make, doesn't mean teams should push them out quickly. Each dashboard should be meaningful and useful.

Dashboards should drive some sort of action as well as inform.

Solution #1: Ask why you're actually building the dashboards

In order to avoid developing too many dashboards, teams should develop a process to figure out why they're actually building dashboards. Often we find that dashboards are just built because a manager or director asked, without having a thought-out plan.

These dashboards end up in the dashboard graveyard. They'll be used once --- or maybe for a few months. Then suddenly they're forgotten. In those cases, an Excel report might've been better.

This can be avoided by stopping and asking why your team is actually building a dashboard.

Trying to Analyze Too Much Data Without Enough Action

One of the more common patterns we've noticed is that most companies after a certain size and complexity have more data than they know what to do with.

This can lead to a couple of problems.

Some companies try to pull all their data in from every data source at once in a massive data-warehouse project. This is very expensive, time consuming, and can be discouraging.

Companies will end up spending hundreds of thousands to millions of dollars developing a data warehouse only to realize it'll take two years and twice the budget they thought.

Solution #1: Shoot for small, quick data victories

Instead of trying to bring in all of the data at once, figure out a few key areas your company wants to know more information about.

From there, your teams should develop a plan for how you'll answer those questions. Will it be a dashboard, a report, or something similar?

This will then drive a lot of the first versions of your data warehouse.

Companies should spend a good amount of time planning their data-warehouse design, their main entities, their data governance, and so on. However, if you spend too much time planning and only get the framework up, then eventually business owners will get restless.

Thus, having some little victories along the way can keep people motivated.

Using Low Quality and Incomplete Data

Garbage in, garbage out.

If you've worked in the data industry for a while, you'll be accustomed to hearing this saying.

A more verbose way of saying this is: Making good decisions based on bad data actually makes for bad decisions.

Putting bad data into your dashboards, reports, and models will only provide invalid results.

We've seen companies spend months and sometimes years using data that everyone assumed was right and no one questioned. But with a little digging, they'd have seen incomplete, inaccurate, and untrustworthy data.

Every company has bad data. The only question is: Are they using it to make million-dollar decisions?

Solution #1: Create an automated QA system

Data is error-prone. Data is often input by humans and thus is very prone to error.

In turn, your team will need to develop some form of automated-data QA system to manage those errors.

Data, in most systems, is too large to manually check, and in this case, there's a solution. Thankfully, most of the data quality checks that a user does can be written down and placed into an automated QA system. This can save hundreds to thousands of hours company-wide and increase accuracy.

These checks can be straightforward data checks, where you're making sure all the countries being input are real, or much more dynamic checks, like testing to make sure the total spending every month stays within a specific range.

Overall, having these initial column-level and aggregate checks can be very helpful.

Solution #2: Always use integration tests

When developing dashboards --- especially dashboards with calculated fields, blended data sets, and filters --- teams should create integration tests to make sure the logic inside the dashboard tool didn't change the initial data.

Personally, we feel you should limit as much of the logic in one place as possible, such as in the query itself. Since this isn't what usually happens, teams should then make sure the data they see in the dashboard matches the data they input.

There are lots of simple integration tests you can use. For example, perhaps you know that the data set should have 1,000 people in it. Then, have a basic integration dashboard that counts how many people are in the data set. This is especially important once you start using functionality like blending in Tableau.

This can cause a lot of problems if you don't think through your final data. To avoid these problems, it's best to think through some basic integration tests.

Conclusion

Using data to make better decisions provides companies a competitive advantage. However, the quality of the data and the robustness of the data processes set up does play a role in whether or not your data strategy will be successful.

Simply creating dashboards, data warehouses, and machine learning models isn't sufficient to make data-driven decisions. Teams need to consider their data lifecycles and the processes used to manage each step. This means creating test cases, clear goals, and processes that can help improve your team's performance and strategy.

No one wants to get bogged down with too many processes and bureaucracy, but having no plan or strategy for your team's data lifecycle will set you up for failure.

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate

5 Great Libraries To Manage Big Data With Python

What Is A Data Warehouse And Why Use It

Kafka Vs RabbitMQ

SQL Best Practices --- Designing An ETL Video