The last 10 to 15 years has seen an explosion in data volumes that traditional on-premise relational solutions have been unable to handle. No platform bound to a single machine could hope to do so and a system distributed across multiple compute and storage nodes was necessary. Hadoop emerged as the premier choice to tackle the problem and became synonymous with the term Big Data. However, it was primarily an on-premise solution and while organisations built their data stacks on Hadoop, AWS and Snowflake began to rebuild the data warehouse to take advantage of the distributed compute and storage of the cloud. The rise of the cloud data warehouses has seen a swing back to a SQL experience built on a relational storage engine. As these data warehouses became popular, vendors were incentivised to build new tools and services to work with these warehouses. Vendors of existing tools and services also added support for these new data warehouses. In aggregate, this stack of offerings is called the Modern Data Stack. Every week, a new company with a data solution competes to become the latest member of the stack. It's easy to see this as a new era in data but as I try to make sense of where it is going, I thought it was worthwhile looking at how we got here. Using Wardley maps, I will demonstrate how the landscape has changed in the last decade and how commoditisation of different components of the stack is allowing new patterns to evolve.
If you don't know what Wardley maps are, you should stop reading this article and go and study them here. You'll get more out of learning about maps than reading this article.
Before continuing, I want to point out these changes are not laid out in chronological order. They may have happened at the same time or evolved closely together. The point of this article is to show that they happened and I've tried to show how they are linked together.
Back in the ancient times before cloud, the data setup in most companies was very simple. The number of data sources was limited to an operational database or two that supported your companies bespoke applications. If you were lucky, you could pull data out of the backend of any purchased software packages that your company operated. However, a lot of these systems expressly forbade you to extract data as part of their licensing. If you could access the data, it was pulled into another database called a data warehouse. This could be a standard Oracle or SQL Server cluster or something more targeted at being a data warehouse like a Teradata or later a Vertica or Netezza appliance.
Bash scripts could be used to move data around or there were ETL tools like Informatica, IBM Datastage or Oracle ODI. Microsoft SQL Server Integration Services came later. Streaming technologies were rare and primitive.
Reports were built on top of the data warehouse with tools like Business Objects or MicroStrategy. These tools generally had their own data modelling layer that was used to define and compute KPIs and measures. These definitions were generally hard to access and couldn't be used by other applications. Your customer for these reports and dashboards were senior or executive level managers. Adoption and active users were generally low. These stacks were called Decision Support Systems (DSS) or Executive Support Systems (ESS) indicating that they were built for a small number of senior managers to support high level decision making.
OLAP Cubes could also be built to extract and model data from these data warehouses. This also served the purpose of removing complicated and CPU intensive queries from the data warehouse. Excel was used extensively to interact with these OLAP cubes. Logic to calculate KPIs was generally duplicated between the OLAP cube and the data visualisation data models leading to lot of why is this number different to this number questions.
Overall, your data architecture could look something like this.
CPU and storage were expensive and difficult to scale in an on-premise environment. DBAs were protective of their databases and who could run what on them. Therefore access was generally restricted to a few trusted individuals and systems.
The first major step in the Modern Data Stack was the cloud data warehouse. Amazon Redshift was the first but was followed closely by Snowflake. Snowflake was also hosted on AWS although they have recently expanded to include an Azure and GCP offering. Azure Synapse and GCP BigQuery also compete in this space. The unifying factor for all of these is a fully ACID compliant system with a SQL query engine on top of distributed data storage. More recently Databricks has started to pivot from the Spark company to Data Lakehouse company and is pushing their SQL interface also.
The ability to provision a data warehouse in the cloud was/is a game changer. Previously, it would take months to order and install the necessary hardware and licenses before you write a SQL statement. Now, it could be done in minutes. With unlimited CPU and storage, data teams didn't have to be so protective of their hardware. As a result, access to the data started to open up to more and more people and use cases.
Separately, data visualisation tools built in the cloud began to appear. Software vendors took two approaches to this. Newer players like Looker and Mode targeted the cloud data warehouses as their primary data source. Microsoft and Tableau looked to host their offerings in the cloud and allow customers to connect to a myriad of data sources, both cloud and on-premise. They treated cloud data warehouses as another data source to be supported. Power BI was built for the cloud and with its low cost entry pricing was a big step in democratising data analysis. Tableau chose a lift and shift of their existing on-premise product to the cloud. Others followed with Google Data Studio and smaller players like Sisense and Yellowfin. Tableau's offering has adapted and is still one of the largest players.
For all the advancements in the Modern Data Stack, this is still the one area that has not changed as much as others. We've seen no major evolutions in this space in years. Sisense and Domo will push augmented BI. Sisu will push their ML capabilities. I admit that I am cynical after seeing so many dashboards and reports gather cobwebs once they had been built for their initial use case. For the most part, 90% of what a data visualization tool can do has been commoditised for a long time.
The fact that excel is still the most popular data analysis tool shows the failure of such tools.
Outside of the data platform, the other big shift happening was rise of SaaS. Salesforce, Zendesk, Google Analytics and Hubspot are all examples of this trend. There are services for every business process now and they all hold an organisations data outside of the organisations network. These vendors sold their APIs and ability to access data as a competitive advantage. They didn't try to block organisations accessing their data and actively provided ways for them to do so. It was a new paradigm and the need for a platform to consolidate data from multiple disparate sources became more urgent. Streaming data technology became more mature and consolidated around technologies such as Kafka.
With the commoditisation of services and business data into SaaS based tools and the number of data warehouses consolidating around a small number of players, the commoditisation of tools and connectors to extract and load data was inevitable. Tools like Fivetran, Stitch Data and Airbyte have made it easy for data teams to extract and load data from SaaS tools like Salesforce into your cloud data warehouse.
Data sharing is another data integration capability that Redshift, Snowflake and Databricks now support. It's a bit of a confusing name but it allows for users to access their data without moving it. For example, if one of the SaaS providers is setup within your data warehouse vendors marketplace, you can now subscribe to your own data from within your data warehouse and maintain an up to date view of your data without using any data integration tools. Imagine having your Amplitude data available as a table in your Snowflake instance and this data automatically updating without a line of code for you to maintain. As this is a recent feature, I haven't used it this way or talked to anyone who has but it sounds extremely powerful.
However the data is loaded into your data warehouse, it is being loaded in the same structure as it was extracted out of the source system. This is useful and the extra compute and storage available with the cloud data warehouses meant that it can be processed. However, having the ability to join data from these disparate sources together and transform it into a persisted table or view is still the very essence of a data warehouse. Especially so if you are paying for your compute by the minute. In that case, you'll want to run your transform once and persist the results for all your customers. This is where dbt comes in and changes the game. dbt has democratised this part of the data cycle so much that it has lead to the rise of a new role called Analytics Engineer.
As the modern data platform has shifted from one all-encompassing ETL tool to separate EL and Transform, you may need to orchestrate between these. For example, once Fivetran or Airbyte finishes loading the source data, you would want to kick off a dbt job to transform it into the more valuable data model for your reporting and insights. You may have subsequent steps to send out emails and alerts or push data back to consumer tools via Reverse ETL. This is why we have seen the rise of tools like Airflow, Dagster, Prefect and others to orchestrate the running of pipelines end-to-end.
We are also starting to see standardisation of transforms where a repeatable process can be applied to everyones Salesforce or other SaaS data. Fivetran will now supply dbt models for customers to run once they loaded their data. There are even models that join data from different SaaS products together to create a composite view of customer data.
Companies like Trifacta and Mozartdata are taking this a step further with integrated tool that combines the EL and Transform steps into a single tool.
The transform stage is also where you see ML integrate into the Modern Data Stack. As part of your transform, you can call out to ML models to transform your data. This transformation could be sentiment analysis on surveys, translations, churn predictions, etc. This could be a bespoke model developed in house or an existing ML service like Google Translate or Amazon Forecast. Most of the cloud data warehouses will now allow you to call inference from within a SQL statement, passing features from each row in the query out to an externally hosted model. Or vendors like Continual AI will run your ML inference for you as data is landing in your cloud data warehouse.
In the past, data teams would have typically split into developers and administrators (DBAs). DBAs would take the code from developers and get it running in production. DBAs would be kept busy trying to keep this code from bringing production down, making sure there was enough space for growth and maintenance of indexes and archiving data if they had time. If your organisation was trying to implement a business continuity plan, your DBAs may be occupied maintaining a disaster recovery copy of your database. Everything was on-premise so DBAs had to deal with other operational teams for the OS, network and storage support. It was a very specialised role and not to be taken lightly. Every DBA and database instance was different and therefore it was harder to commoditise. DBAs would often act as protectors and gate keepers for the data system.
Cloud data warehouses have reduced the operational burden so much that the traditional DBA role is no longer necessary. In addition the rise of DevOps movement has seen developers take on the "you build it, you run it" mentality. As such, developers have moved to take over any remaining operational tasks still needed for a cloud data warehouse.
In addition, data teams embraced automation with regards to CI/CD and monitoring reducing or eliminating the need for manual intervention for deployments and monitoring.
Simon Wardley contends that the commoditisation of existing technologies leads to further innovation as new solutions are built on top of these technologies. The commoditisation of data storage and compute, data integration and transformation and data visualisation tools are now leading to new means of extracting value from data. Up to this point, most of the successful parts of the data stack have been commoditised versions of older patterns. Snowflake, Fivetran and dbt are all just new, albeit better ways of doing the old thing faster. Therefore I contend that we are only at the start of this innovation cycle.
We are already starting to see new patterns start to emerge. Three examples that I can give are Reverse ETL, the Metrics layer and a whole plethora of new tools dealing with data trust.
Similar to how FiveTran and Airbyte allowed the commoditisation of data ingestion from SaaS tools into the cloud data warehouse, Reverse ETL tools like Hightouch and Census have allowed organisations to push data from the cloud data warehouses back into your SaaS systems. I debated whether this was actually a new pattern or just another data integration. The primary reason I chose to see this as a new pattern is that these tools are getting data into the hands of a new set of customers. Customer service reps who use Salesforce throughout their day can now see details about the customer revenue or NPS or other details that can help the rep in their day to day communication with the organisations customers. Direct marketers can see more details and metrics about their customer directly within the tools they use to target customers. Personally, I think this is a great evolution. Organisations can get more return from their data and data teams can see their products being used every day contributing directly to the organisations bottom line.
It has so quickly become an established part of the stack that I have shown it as commoditised already.
The metrics layer is getting a lot of press recently. What I think a metrics layer can be defined as is a centralised store of definitions that can be accessed by an api and therefore any tool within an organisation.
As I referenced earlier, these definitions were often hidden in a data visualisation's data model, within an OLAP cube or Excel. Separating them out into their own accessible layer is a great idea. Transform.co, Metriql and Supergrain are early innovators in this space. LookML from Looker has been around for a number of years but wasn't seen as separate to the core Looker data vizualisation tool. And dbt Labs have announced their own version as part of dbt v1.0 at the Coalesce 2021 conference.
These products still belong to a very nascent part of the stack and need to establish themselves in terms of value and need to data consumers.
There are a lot of new vendors working on solutions to tackle Data Discoverability, Quality, Observability, Reliability or Lineage. Regardless of what question these services are proposing to tackle, I see all them all working to improve trust in an organisations data. For as long as there have been data systems and insights being produced, data teams are always asked to defend their numbers and ensure they are correct. These questions are valid and necessary as important decisions are often made based on those numbers. However, they do take time to answer and issues are often caused by data changing rather than any software bug. As data volumes grew and and all parts of the data stack expanded, it has become more difficult for teams to keep a track of all the data in their platform. Companies like Monte Carlo, Datafold, Bigeye and Metaplane all have products to help data teams keep on top of state of the data in their data warehouses.
These tools all operate by tracking where data is sourced from, how it gets into the data warehouse, transformation and then profiling it at rest. Combined with open frameworks like Open Lineage and Open Metadata, these tools have the potential to improve organisations confidence in their data.
Up to now, the vendors for these tools are targeting the data teams as their customer. Data teams understand the problem and are happy to outsource it if possible. I think the real potential for these tools will come when data customers are the ones using them. If data customers can bypass the data teams and check how much trust they can put in the insights being generated, that would be a game-changer. Self-service data trust if you will.
Taking this a step further, if our industry ever gets to the point where the modern data stack starts to drive customer facing applications, tools that automate and verify data trust will become essential.
Why are you ignoring Hadoop?
The main reason why Hadoop is excluded from the Modern Data Stack is that it hasn't enabled this new set of data tooling and processes that the cloud data warehouses have. We can draw a direct line from the introduction of Amazon Redshift and Snowflake to where the Modern Data Stack is now. Hadoop did not enable that. There definitely was some cross-pollination between the two. Spark and Presto both came from Hadoop and are used widely in data stacks. Even as Hadoop was commoditised for the cloud with AWS EMR or Azure HDInsight or GCP Dataproc, it hasn't had the same impact as for Snowflake, Amazon Redshift and other cloud data warehouses.
We have seen how the commoditisation of the OLAP data warehouse technology has caused an explosion in new data tools and productivity. My wish for the future is for other data storage technologies to be converged under a single endpoint. We have already seen this with most of the cloud data warehouses consolidating the data lake storage into the data warehouse by now supporting storing semi-structured data types. This enables organisations to store this type of data alongside your relational data. Previously, you would have to store this type of data on object storage making it harder to join with.
If a customer could use this endpoint for multiple data uses, that could drive the post-Modern Data Platform. Build a search index or graph endpoint all on the one platform.
TDWI have called out Multimodel platforms as their number 1 trend in their Data Management: A Look Back At 2021 article. They define multimodel as
Multimodel databases combine relational and nonrelational data and seamlessly execute analytics, transactions, and other workloads in a single platform with scalability, performance, high availability, and unified management.
They reference the NoSQL vendors driving into this space but Snowflake is also driving at this where you can seamlessly use SQL for search with their Search Optimisation Service feature. TigerGraph providers a Snowflake Connector that allows customers to analyse their Snowflake data with a graph. Imagine now that you could run a graph workload all through the same endpoint.
AWS should the obvious frontrunner in this space given that they have a managed service for every database engine known to humanity. Their Amplify service allows application developers to model data for transactions and for search by specifying the @searchable directive as part of your GraphQL schema. This will automatically deploy an Opensearch index fed from your DynamoDB table. However their official approach is for purpose-built databases, allowing their customers choose the right database engine for their use case. Amplify is for application development but if we could see something similar for data analytics, I think it could be a massive win. I do not believe that developers should only have to deal with this complexity if they want to or need to. It should not be by design.
Any of the major cloud vendors could provide this type of solution. They have all the different database technologies and a single abstraction layer over these could in theory work. For the likes of Snowflake or Firebolt to build this, they would need to standup these different technologies. Databricks also has a lot of experience building low-level technology and has shown that they are far more than the Spark company.
I see great potential in streaming but in reality most streaming systems have limited use cases. The inability of streaming systems to join to other state data is a severe limiting factor. Most streaming systems are built against a single event and have limited scope. An event may be decorated with extra information before it is published onto a stream but this can slow the publishing process down. The real power of a data warehouse is its ability to join state from multiple different sources and create new information and insights. If a streaming platform can be built that could join quickly and easily to all state within a data warehouse, that would be a game changer. Or it could happen the other way around. A process from within a data warehouse could read a stream of data in realtime and look up state within the wider data warehouse and take appropriate actions within milliseconds. I see Materialize, Upsolver and Clickhouse going in this direction and it could eventually bring a realtime data warehouse into existence.
To be honest, before I started this exercise I was filled with a lot of excitement for the Modern Data Stack and the potential around it. I still am but mapping has made it clear to me that we are still only in the early days of seeing any new patterns emerge. Up to now, we have seen the commoditisation of existing patterns rather than new patterns. As more and more customers move onto the commoditised platforms, we'll see new patterns emerge and start to move from genesis to custom built to product and eventually commoditised themselves.