DEV Community

Iddo Avneri
Iddo Avneri

Posted on

Data Version Control vs. Open Table Formats: Differences And Use Cases

Disclaimer: I work for the open-source data version control tool lakeFS.

Following webinars or conference presentations, people often ask me: 

What's the difference between data version control like Git, lakeFS, Git LFS, or versioning in DVC and open table formats (OTF) like Apache Iceberg, Hudi, and DeltaLake?

The short answer is: They're two different technologies that solve entirely different use cases.

But if answering this question seems so easy, why does it keep coming up over and over again?

I guess it's time for a more detailed answer. Let's dive in!

TL;DR:
I describe the differences between Open Table Formats and data version control systems, dive into how they work, and show that the two approaches are compatible, potentially delivering fantastic results.


What is an Open Table Format (OTF)?

When you migrate structured data from a relational database to object storage, one thing is inevitable: 

You lose typical database guarantees.

Databases offer CRUD operations with assured transactionality. This makes them different from object storage, which is supposed to be immutable by design.

To put it simply: if you want to alter or expand a data file in object storage, you must rewrite it. 

And you won't get any transactionality assured here. This gets more difficult if your table comprises several files on a disc (think sharding and partitions).

Open table formats (OTF) address this issue by providing a table abstraction that lets you create, insert, update, and remove information. They also help in the overall management of table schema evolution and open the door for some concurrency and transactionality.

How do Open Table Formats work?

Note: The description below is highly simplified for the sake of argument.

While each format operates somewhat differently, the basic premise is the same. In the following example, I'm going to refer to Delta Lake, "the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse Platform." (source)

When the table's data is stored in immutable Parquet files, changes are saved in extra data files called delta files. You also get information on how to use the delta files, which is stored in log files (delta logs).

When you access the data, you can get the most recent status by reading data files and delta files - and then computing a version of the table using Spark. You can achieve this for any time window for which the delta files are still available.

All readers and writers must agree on the log order to prevent corruption or discrepancies.

Time travel is a by-product of Open Table Formats

You can examine different data versions indicating changes over time by iterating through the sorted log, as shown below.

Graph showing how open table formats work

Moreover, you can switch between multiple table versions by reading up to a specific log point. This may result in a performance penalty or a restricted time period allowed for this activity, but it still works like time travel (though only for a single table).

OTFs come with a soft copy

OTFs support both hard and soft copies of a table.
A hard copy is a physical copy, while a soft copy is a metadata action that enables read-only access to the table at a specific point in time.

The name "branch" is used in Apache Iceberg to denote soft copy. Since it's a term borrowed from the world of version control, you can assume it works like a branch in Git. In practice, it's a soft copy that can only be read.

Conclusion

Open Table Formats provide two functionalities that resemble version control at first glance:

  • Per table time travel,
  • Per table branching/soft copy.

But data version control is a little different. Let me show you why.


What is data version control, and what problems does it solve?

A data version control system allows you to perform data lifecycle management from development to production. Yup, you can manage data just like software developers manage code.

Data version control opens the door to actions such as:
Isolating data

  • pipeline development,
  • Testing pipeline modifications,
  • Testing data quality automatically,
  • Data promotion gatekeepers (data CI/CD),
  • Data set reproducibility,
  • Transactional functionality for several tables,
  • Resolving data quality challenges in production.

Many data version control systems operate over a repository of data sets. They also have a versioning mechanism. Tools such as Git LFS, DVC, Dolt, or lakeFS use some form of Git-like actions such as branching, committing, merging, etc.

This allows you to travel in time to any existing branch, commit, or merge recorded by the system. Such points in time represent a repository state rather than the state of a single table or data set - as in the case of Open Table Formats.

Many such systems implement Git-like actions using metadata to help teams avoid all the trouble that comes with duplicating data. Instead of using format settings like in Open Table Formats, users choose a commit or branch operation that determines the system's versions.

Git-like actions on a database

Conclusion

Data version control tools operate on a data repository of any format. Compare that to the one table that a given Open Table Format can operate on.

They also provide Git-like operations and avoid data duplication, allowing teams to implement engineering best practices across the entire data pipeline: development, testing, staging, and production.

Can you combine Open Table Formats and data version control?

OTFs provide plenty of features besides "time travel" (or versioning), which data versioning solutions don't provide at all: per-table transactional consistency, optimized query performance, scalability, and parallelism.

OTFs automatically track changes, so there's no need to manually commit them, making the process transparent (at the level of a single table). 

Data version control systems, on the other hand, can provide the service for data in any format (including unstructured data like videos, images, etc.). 

Actually, they often support managing data set repositories saved in Open Table Formats. 

Since OTFs keep track of changes at the level of a single table, a data version control system can take advantage of this to deliver highly elaborated diff operations and smart merge capabilities.

As I showed above, data version control is used operationally for different use cases (for example, CI/CD pipelines for data, similar to the CI/CD pipelines we run for code).

In conclusion, these two approaches address different use cases, and when applicable, combining these two approaches could be the most powerful option. For example, you can optimize your ETL with an OTF, while developing and testing it against a zero-copy clone of your production data using a data version control.

Top comments (0)