We live in a data world that wants to move fast.
Just shove your data into a data lake and we'll figure it out later.
Create this table for this one dashboard that we will only look at once and then it will join the hundreds(if not thousands) of other dashboards that are ignored.
All of which can work in the short term but often leaves behind a lot of technical debt. One project I recently worked on literally dumped all of their raw data into an S3 bucket with no form of breakdown source or timing and it was quite chaotic to grasp what was going on.
All of this is driven by executives who need their data yesterday and don't want to wait as well as by the software and analysts teams who are often driven by very different motivations.
With all this pressure to move fast coming from all sides, an interesting solution I have come across a few times is...Let's just get rid of data engineering and governance.
A recent conversation I had with a head of data forced me to pause.
They brought up how at several organizations they had worked for made the decision to cut data engineering. This allowed their analysts and data scientists to use dbt and cloud composer to build tables as need be. After all these end-users knew what data they needed and were now empowered to get it.
Of course the end result for these companies was they did need to re-invest in data engineering.
But why cut data engineering in the in the first place?
Costs aside. Data engineering is often viewed as a bottleneck.
If you're a software engineer you're likely getting chased down by data engineers who want you to tell them before you update tables on the application. Maybe, they are even asking you to add in extra layers like data contracts.
Yikes. You don't want that. This means you won't be able to deliver that feature before your next review cycle.
If you're on the data science and analyst side, well then you often have to wait for data engineering to find time to pull your data into "core tables". Some of these data engineers might even insist on making sure to implement some level of standardization and data governance.
All of which means slower access to data.
There is a reason that data engineering and data governance exist. But if you really do want to get rid of your data engineering teams then here are some points you must consider.
On a few projects I completed earlier this year I noted there was a lack of tracking historical data. This will eventually lead to problems when your management asks:
How much revenue did we have per customer in NYC year over year?
Because all you will be reporting on is the current state of the data and not how it has changed over time.
The classic way to track changes in dimensions and entities is to use SCD or slowly changing dimensions.
But there are other options.
And at some other companies just store snapshots of every day in separate partitions in a table(not my favorite). The result of all of these is the ability for an end-user to track historical data over time. Meaning that when an analyst is asked about a year over year comparisons, they can answer accurately.
All that said, I would say there is a far more difficult problem that most companies need to deal with.
As pointed out by Bill Inmon earlier this year:
In order to be a data warehouse an organization must integrate data. If you don't do integration of data then you are not a data warehouse...
...But when it comes to integrating data I don't see that Snowflake understands that at all. If anything Snowflake is an embodiment of the Kimball model.
There are many cases where in order to move fast a team might load data from each data source as is without thinking about how it integrates with all the other various sources.
While working at Facebook, yes we were spoiled by the free lunches.
But truthfully, as a data engineer, I felt spoiled by how well integrated all the data was. Compared to when I worked at a hospital and was trying to integrate a finance system into a project management system where the project id field was an open text field that allowed for users to put multiple project ids into it(as well as the occasional budget number), Facebook was amazing.
It probably didn't help that the field was called project number not project id...
This was because the data sources all interacted with each other on an application level. Meaning there had to be ids that the sources themselves shared.
In fact, instead of having to figure out how we would join data together, we often had to figure out which IDs we would remove to avoid confusing analysts on which ID is the core ID to join on.
Data integration is often skipped when your team is just ingesting data via EL. Why think about integration? There is a current report that needs to be answered and it doesn't require bringing in another data source. However, the problem is eventually it will. Eventually, someone will want to ask about data from multiple sources.
In the end, you not only start running into integration issues but governance issues.
Finally governance. Now at large companies, data governance is usually pretty obvious in terms of necessity. There are committees that spend hours deciding how they will implement the various processes and policies to protect, standardize and better manage the use of data.
Entire departments are dedicated to these task. In these cases, the data engineering team is usually not involved in these decisions or at the very least they are not the SME. They might be the ones that programmatically implement the policies made by data governance. But are not the actual expert.
Of course, these companies are also dealing with tens if not hundreds of applications(many duplicates).
But there are a lot of data systems in the SMB and mid-market space too and not having some form of data governance strategy in a world that is becoming more data-aware poses a lot of risks.
Moving fast can work out early on because much of the modeling and data pipeline debt isn't apparent(and I didn't even reference data quality). However, as a company grows and its data needs mature limitations will bubble to the surface.
An executive will ask a question and the data team won't be able to answer it.
The ML team will invest 500k into a model only to figure out all of the data was wrong or from a table, which no one supports anymore.
All of which will expose the true cost of moving fast.