Iceberg, Delta Lake, and Hudi, oh my!

#dataengineering #datascience #iceberg #deltalake

The rising popularity of the data lakehouse has led many to try to compare the merits of the open table formats underpinning this architecture: Apache Iceberg, Delta Lake, and Apache Hudi. If you look between the lines, the conversation is mostly driven by hype, making it hard to parse reality from marketing jargon.

This article isn’t going to solve that problem. Instead, the goal is to introduce you to a new way of thinking about table formats – as a use case-level choice rather than an organization-level decision.

Choosing a table format

When deciding between table formats, it’s important to understand the similarities and differences that may impact performance and scalability.

For example, Iceberg is currently the only table format with partition evolution support. This allows the partitioning scheme of a table to be changed without requiring a rewrite of the table, and it enables queries to be optimized by all partition schemes.

On the other hand, Iceberg’s streaming support is lagging behind Delta Lake and Hudi. So the question to pick a table format becomes – which is more important to your business? Partitioning or streaming?

Now, any seasoned data engineer knows that it’s not that simple. You don’t just have a single type of data in your systems or a single way you’re looking to interact with that data. Instead, you’re dealing with streaming pipelines, batch jobs, ad hoc queries, and more – all at the same time. And you don’t get to control what is added to that mix in the future.

All of these factors make the binary decision – partitioning or streaming, Iceberg or Delta Lake – almost impossible to get right at the organization-level. But most vendors require you to do just that.

Starburst’s approach

With Starburst, everything is built with openness in mind. We designed Starburst Galaxy to be interoperable with nearly any data environment, including first-class support for all modern open table formats.

This means that you can use the table format that is right for each of your workloads and change it when new needs emerge. You don’t need to worry about limited support for external tables or being locked into an old table format when new ones come along (and it will).

How it works

We wanted to make it as easy as possible to write to and read from different table formats, so we built Great Lakes connectivity – an under-the-hood process that abstracts away the details of using different table formats and file types.

This connectivity is built into Starburst Galaxy, and is available to all users that are working with the following data sources:

Amazon S3
Azure Data Lake Storage
Google Cloud Storage

To create a table with one of these formats, you simply provide a “type” in the table ddl. Here is a simple example of creating an Iceberg table:

CREATE TABLE customer(
name varchar,
address varchar,
WITH (type='iceberg');

That’s it! An Iceberg table has been created.

To read a table using Great Lakes connectivity, you simply issue a SQL select query against it:

SELECT * FROM customer;

Again… that’s it! End users shouldn’t need to worry about file types or table formats, they just want to query their data.

DEV Community

Iceberg, Delta Lake, and Hudi, oh my!

Choosing a table format

Starburst’s approach

How it works

Top comments (0)

Read next

New AI Breakthrough Makes Self-Driving Cars 15x Faster and Safer with Truncated Diffusion Model

Your ML/AI Success Begins Here: Data Ingestion & Storage on AWS

10 Future Apache Iceberg Developments to Look forward to in 2025

Negative Eigenvalues Boost Neural Networks' Memory and Pattern Recognition Abilities