Kostas Pardalis

Posted on May 2, 2024 • Originally published at cpard.xyz

A glimpse into the future of data processing infrastructure.

#database #bigdata #snowflake #spark

Three weeks ago, VeloxCon took place in San Jose. The event was a great opportunity for people who are interested in execution engines and data processing at scale to learn about the current state of the project.

Most importantly, though, it was an amazing opportunity to get a glimpse of what the future of data processing will be like. From what we saw at the event, this future is very exciting!

Let's get into more details of what happened at the event and why it's important.

First, what is Velox?

Velox is an open-source unified execution engine created and open-sourced by Meta, aiming to commoditize execution in data management systems.

You can think of the execution engine as the component of a data management system that is responsible for processing the data.

This part is usually one of the most time-consuming to build and also the one that is the most demanding regarding correctness. We might assume that whenever we run an SUM aggregation function, the result will always be correct. This is only possible because dedicated engineers invest countless hours guaranteeing that your database functions correctly.

Why is it that Velox could commoditize execution in data management systems? Because execution is such a hard problem to solve. If we had a library that always does the right thing and also does it in the best way possible, we could build the other services around it..

Why does this matter? Because if we can succeed in that, then systems and database engineers can design modular data management systems and iterate faster. As a result, users of these systems can enjoy better products and more innovation reaching them faster than ever before.

What happened at VeloxCon this year

The conference happened in the period of two days and it was packed with interesting, and as expected, very technical talks. The high-level structure of the event was the following:

Day 1 was all about updates on the current state of the project
- New features that have been delivered to Velox
- Updates on who is currently using it in production and how
- Open Source Project management and governance updates
Day 2 was all about the future of Velox and data management systems
- A shift into use cases and workloads for data management systems
- A lot of hardware, which might sound surprising for such a conference, however, it's one of the most interesting signals for the future of data management

Every single presentation that was presented at VeloxCon deserves its own post. Instead, I'm going to share the takeaways that I believe are the most important. I'd suggest you then go to the YouTube channel of the conference and watch all of the presentations.

Take #1: For analytical workloads, it's all about performance optimization

Today, we know how to manage these workloads. We also need to make them accessible to everyone. The market demands their commoditization. For analytical workloads at any scale, it's all about optimization.

Optimization comes in two forms. I would argue that these two forms are just different sides of the same coin; one is performance optimization and the other is cost optimization.

The good news is that there's huge opportunity for delivering value here.

My take is that in the coming years we will see more performance and auto-tuning coming from the systems themselves. As a result, practitioners will spend more time building than operating.

What's new in Velox - Jimmy Lu, Meta

Velox I/O optimizations - Deepak Majeti, IBM

Take #2: Defining and measuring performance is a very hard problem

This is not just an engineering problem. It's a market problem, too.

TPCH and TPCDS are useful for understanding the operators a data management system supports. However, they are not enough to decide if a system is the best for your workloads.

Just take a look at the presentation about what's new in Velox to see the optimizations that are being discussed and the improvements they are talking about.

The space for possible optimizations for data processing is just enormous. So how do you choose where to focus and what to go after?

My take is that although TPC-* benchmarks are important tools, there's a lot more to be done on this front. Unfortunately, bench-marketing has been a plague in this industry and it's quite easy for benchmarking suites to turn into marketing tools without much substance.

Because of that, a different approach to benchmarking is important. It should be more of defining frameworks and tooling to create custom benchmarks that fit the workloads of each user.

Now, the question of when to benchmark a data processing system will not be about whether it's complete and accurate. Systems like Velox will make sure of that. Instead, we'll focus on how well a particular system performs based on benchmarks that come from the user's actual work.

An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang

Take #3: Databricks moats are holding strong

When I first learned about the Gluten project from Intel, I thought Databricks was going to be in trouble.

Photon was a great advantage for them when they split from Apache Spark. If Apache Spark now gets a similar execution engine, it'll be harder for people to switch to Databricks.

Especially if something like EMR Spark gets an execution engine comparable to Photon.

We are far from seeing widespread production use of WebAssembly. Although it is gaining momentum, and we have seen initial production deployments, there are still significant gaps that need to be addressed.

Initially, there were missing Spark functions in Velox. However, the gap is narrowing quickly.
Second, implementing a function is one thing. Guaranteeing that the implementation is semantically, and performance equivalent to Spark is another thing. I'll get back to this later because there are some very interesting insights from the event regarding workload migrations to new engines.
Third, PySpark and dataframe support is not there yet and that's a big issue as these two APIs have become an important driver of Spark adoption.
Finally, UDFs for Spark need to be figured out. A lot of business logic is done as User-Defined Functions on Spark. Moving these functions to a new system while making sure they still work and perform better isn't easy.

We are getting closer to having a similar to Photon execution engine on Apache Spark, however it will take more time before we have something that can threaten Databricks.

Accelerating Spark at Microsoft using Gluten & Velox - Swinky Mann and Zhen Li, Microsoft

Unlocking Data Query Performance at Pinterest - Zaheen Aziz

Take #4: Databricks might not be in trouble, but maybe Snowflake is?

One of the interesting things I learned at the event was that Presto's integration of Velox is much more mature than that of Spark.

Although I initially expected Databricks to be the primary focus, it turned out to be different. The implementation of Velox in Spark is not progressing as quickly as it is in Presto.

Currently, Prestissimo inside Meta has been fast replacing the old good Presto and it has reached a great level of maturity.

Traditionally, Presto is employed for conducting interactive analyses on a data lake. It operates similarly to a data warehouse such as Snowflake, except it uses a more flexible infrastructure and also offers query federation capabilities.

At the same time, the performance improvements that are being reported for Prestissimo are quite amazing. Being more performant means more flexibility for trading off cost over latency while enabling new workloads that were not possible before. Previously, it was uncommon to perform ETL on Presto, but now it is becoming increasingly common.

If Presto performs as well as Snowflake while having a more open design, what will stop people from using it instead of Snowflake?

I would argue here that the main obstacle for this to happen is the operational burden on Presto and Trino. Setting up and running these systems is often a harder task than doing the same for Spark.

Migrating away from Snowflake might not be a difficult decision. Systems like Athena and Starburst Galaxy are improving their developer experience, and the performance of systems like Velox is on par with Snowflake.

This makes me think that data warehousing will become a commodity much quicker than Spark and its workloads.

Prestissimo at IBM - Aditi Pandit, IBM

Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta

Take #5: We need more developer tooling around data infrastructure

In the past, data tools were built for data analysts, not data engineers. Data quality platforms were designed with attractive user interfaces, while catalogs were created to rival the user experience of top-notch SaaS applications.

However, the future of data infrastructure is going to be a little bit different.

Like in app development, the key to making data teams more productive lies not in UX, but in DX.

There's a great need for tooling for developers who are responsible for building and maintaining data platforms. One great example of that is the lack of good fuzzers for testing data platforms.

Jimmy Lu explained how Velox uses fuzzers to make sure Prestissimo-Velox performs as expected, compared to Presto's standard execution engine.

This type of tooling is important for any team that is trying to do a migration. Not just from one vendor to another, but even between different versions of the same piece of infrastructure.

Just ask anyone involved in migrating from Hive to anything else how long it took to properly migrate away from it with confidence.

Talks to check

An update on the Apache Gluten project incubator and its use of Velox - Binwei Yang

Prestissimo Batch Efficiency at Meta - Amit Dutta, Meta

Unlocking Data Query Performance at Pinterest - Zaheen Aziz

Take #6: There's a tectonic swift in the importance of data workloads

Nimble's presentation has over 7K views, while others have just over 1K.

Parquet has been the foundation of large-scale data processing for over 10 years. Now, people are starting to build something to replace it. This is significant for the industry.

ML workloads are becoming more and more important and mainstream than they used to be. That's the shift we are talking about here.

Analytical workloads will keep growing, but the market demands that ML workloads run more efficiently and reach more people.

To connect with what we've said earlier about analytical workloads, markets are looking for infrastructure that can deliver efficiencies for them. At the same time, it is looking for technologies that can enable ML workloads at scale, which are new types of workloads and use cases.

What is an interesting observation, though, is that some of the fundamental parts of data infrastructure do not have to change to serve both workloads. Velox is a great example of this!

Theseus: A composable, distributed, hardware-agnostic processing engine - Voltron Data

Nimble, A New Columnar File Format - Yoav Helfman, Meta

Take #7: Hardware is still the catalyst for innovation in data processing

Hardware has always been a catalyst for data processing technologies. (If you don't believe me, just listen to Dhruba Borthakur, the creator of RocksDB).

Over time, database systems have changed a lot. First, there was Hadoop, which used cheap hard disk drives (HDDs). Then, low-latency systems like rocksdb came along because of cheap solid-state drives (SSDs). Now, we have cloud warehouses, which are possible because of the massive and cheap block storage on the cloud.

The main difference between the above and what is happening today, though, is that today the main driver of innovation is not storage but processing. GPUs, TPUs, FPGAs, any sort of on-chip accelerator is what is going to drive the next wave of innovation in data management systems.

VeloxCon's second day focused on hardware accelerators. Talks covered what hardware vendors are bringing and what query engines need to support this new hardware.

Velox, Offloading work to accelerators - Sergei Lewis, Rivos

NeuroBlade's SPU Accelerates Velox by 10x - Krishna Maheshwari, Neuroblade

Outro

VeloxCon has allowed me to take a glimpse of a future that is fast-coming.

There are many challenges, but with them also many amazing opportunities for building new technologies and delivering immense amounts of value.

I personally cannot wait to see what will happen in the next couple of months with Velox and the industry. Exciting times ahead!

DEV Community

A glimpse into the future of data processing infrastructure.

First, what is Velox?

What happened at VeloxCon this year

Take #1: For analytical workloads, it's all about performance optimization

Related talks

Take #2: Defining and measuring performance is a very hard problem

Related Talks

Take #3: Databricks moats are holding strong

Related Talks

Take #4: Databricks might not be in trouble, but maybe Snowflake is?

Related Talks

Take #5: We need more developer tooling around data infrastructure

Talks to check

Take #6: There's a tectonic swift in the importance of data workloads

Related Talks

Take #7: Hardware is still the catalyst for innovation in data processing

Related Talks

Outro

Top comments (0)