Data Platforms Vs Data Warehouses Vs Data Virtualization And Federated SQL Engine

#sql #database #datascience #analytics

There is a lot of discussion on how you should implement your data stacks today. Should you bundle or unbundle, should you use a traditional data warehouse, lakes, lake houses, or mesh?

Or a hybrid of all of them?

With all these different ways you can put together your data stack, the question becomes where to even start? How can your team set up a successful data stack and set of processes that ensure you can build reliable reporting quickly?

Well, it all starts with data. Data is the base layer for a company's analytics projects. Whether you're building data science models, standardized reporting, dynamic pricing API end-points, or KPIs, these are the appliances that will run on the electricity that is data.

With all these different ways companies can set up their data, tools to help them set up their core data layer have risen to answer every problem.

In this article, we will discuss some of the popular routes data solution providers are talking in terms of types of data layers.

Strategies for developing your data layer

Looking back a little over a decade ago, there were only one or two ways that were considered best practices in terms of storing data for analytics.

Often most companies would use some version of a standardized EDW. However, much of the standard best practices are being challenged. For example, data meshes have risen in popularity under the perception that companies can reduce the time from raw data to reportable data.

Besides data meshes, there are also streaming data-warehouses such as Matarlized and Vectorized, virtual data layers, and federated SQL engines like Denodo, Trino, and Varada.

Not to mention that concept of a data platform becoming more prominent as tools like Snowflake and Databricks fight for dominance in the data world.

Data Platforms

When Snowflake first came out they described themselves as the Cloud's first data warehouse. But that was a decade ago. Now Snowflake needed to change with the times and they described themselves as a Cloud Data Platform. Their data warehouse is still the core of their offerings. However, their recent purchase of Streamlit is a great example of how they are trying to morph into a data platform. They don't just want to own the storage and management, they want to own the value add as well.

This is likely due to several large competitors coming into the market. Databricks being one of those major players.

On the other side Databricks originated in academia and the open-source community, the company was founded in 2013. The original creators of Apache Spark™, Delta Lake, and MLflow wanted to create a unified, open platform for all your data.

The goal of both of these tools is to go beyond just a data warehouse or data lakehouse. Instead, they want to be the platform everyone else will build their applications and data stacks off of. Essentially the Microsoft or iPhone of data. Everything builds on their platform layer. Whether it be internal company data products or third party solutions like Fivetran and Rudderstack.

But neither Snowflake or Databricks would be anywhere without the classic data warehouse.

Data Warehouse

Data warehouses remain a standard choice when picking how to store your company's data. In particular, companies like FireboltDB have decided to not get involved in all the new buzzwords and decided to make a stand as a data warehouse.

In addition, there are plenty of other solutions that can be utilized in a data warehouse context. Some are more standard databases like, Postgres, others are possibly modern cloud data warehouses like BigQuery, Teradata, or Redshift.

In general, all of these systems have their nuances and some aren't necessarily a "data warehouse" but they are often built like them, and for now, none of them have visions to go much beyond the data warehouse. Data warehouses remain popular as a design choice due to the fact that we, as a community, have been building them forever. Most of us understand concepts such as fact and dimension tables as well as slowly changing dimensions. Even with most modern data warehouses taking a less strict approach to much of these design concepts they still play a role in developing many company's core data layers.

But these are usually built as a more classic data warehouse (even if BigQuery isn't a fan of standard data warehouses).

Streaming Data Warehouses

Still, another form of analytical data storage systems is utilizing streaming databases. For example, Materialize, is a SQL streaming database startup built on top of the open-source Timely Dataflow project. It allows users to ask questions of living, streaming data, connecting directly to existing event streaming infrastructure, like Kafka, and to client applications.

Engineers can interact with Materialize using a standard PostgreSQL interface, enabling plug-and-play integration of existing tooling. From there data professionals can build data products such as operational dashboards without ETLs.

In addition, Vectorized and Rockset are two other similar streaming database systems that allow developers to connect directly to data sources and run queries on top of them.

All of these options are attempting to approach data management and reporting in a very different way.

Data Virtualization/Federated SQL Engine

Some tools are working to directly query data across multiple data sources.

One such example is Varada. Varada indexes data in Trino in such a way that reduces the time the CPU is used for data scanning (via indexes) and frees the CPU up for other tasks like fetching the data extremely fast or dealing with concurrency. This allows SQL users to run various queries, whether across dimensions, facts as well as other types of joins and data lake analytics, on indexed data as federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well, resulting in highly effective SQL analytics. Paired with Varada's Control Center data engineers and system admins can more easily monitor their Presto/Trino cluster's and make educated decisions on how to manage them.

Another example is Denodo which allows you to connect to various data sources from Denodo itself. This includes Oracle, SQL Server, Hbase, etc.

It also uses VQL which stands for the virtual query language. This is similar to SQL and allows developers the ability to create views on top of all the databases that Denodo is connected to. This means you can connect different databases and tables across them and create new data sets.

In a world that is pushing more towards a Data Mesh every day, tools such as Trino, Varada do make a lot of sense. Instead of having to constantly get access to new sources of data in different types of data warehouses why not just query from all the different siloed data warehouses. That's easier said than done.

What Will Be Your Core Data Layer

There are many different ways companies can set up one of their most important assets and how a company's data is set up is crucial to its success.

Whether you pick a data warehouse or utilize tools like Trino, the end goal is to create a robust data system that all end-users can trust.

Of course, there are many challenges with every choice. If you decide to build a data warehouse, it'll take time, or if you pick a virtual data layer, then there can be a lot of confusion in terms of what data you should pull from.

Why is this all-important? Setting up an easy-to-use data layer is crucial for businesses to succeed. So do take your time and consider which data tool is right for you. In the end, it's about creating a sustainable, robust, and scalable data system that is cost-effective.