If you've been following Apache Iceberg™ at all, you've no doubt heard whispers about "the small file problem". So what is it? And why does it matter when building the data lakehouse of your dreams?
You've come to the right place! Let's dive in!
Small Files, Big Problem
To start, the small file problem is exactly what it sounds like at the surface. We have some dataset. In the case of Iceberg, our dataset is bunch of data files bound together through metadata as a single Iceberg table. The issue is that the dataset is made up of many, smaller files rather than fewer bigger ones.
Having more small files might not sound like a big deal, but it actually has quiet a few implications for Iceberg and can ultimately negatively impact performance, scalability, and efficiency in a number of ways, for a number of reasons:
🗄️ High Metadata Overhead: As we know already, an Iceberg table IS its metadata. So in Iceberg, we're constantly tracking every file in metadata for each table version. More small files increases the size of metadata files and, in turn, the cost of maintaining table snapshots.
🐢 Inefficient Query Planning and Execution: When it comes time to interact with our data, query engines like #apacheSpark, #Trino, or #Snowflake need to read those many small files, which results in higher I/O overhead, slower data scanning, and reduced parallelism.
💰 Costs of Object Storage Operations: We've all experienced the frustration of unexpected cloud bills! In cloud object stores like S3 or GCS, frequent API calls for listing or retrieving many small files incur significant latency and cost.
🔊 Write Amplification: If you're unfamiliar, write amplification just means that more data is written, touched, or modified than originally intended. So, for Iceberg, many small writes will eventually generate unnecessary work for compaction and cleanup processes down the line.
Now that you know a bit more about it, you can see how the small file problem is actually a problem. But what can we do about it? 🤷♀️
Taking Action
The good news is that the broader Iceberg community isn't just sitting on this issue. You just have to know what's out there and how to take advantage of it!
🤖 The biggest fix is to eliminate existing small files through compaction and expiring snapshots. Iceberg already has compaction built-in to Spark through the rewriteDataFiles Action. The v2 Apache Flink Sink that was released as part of Apache Iceberg 1.7 includes support for small-file compaction, as well!
⚙️ Check your configs! You can set the target file size during writes in Iceberg with the configuration parameter,
write.target-file-size-bytes
.🔀 Leveraging the Merge-on-Read (MoR) paradigm (also controlled by a few Iceberg configurations), is also helpful to avoid write amplification and get around some of the headaches of small files without needing frequent compaction.
📋 Query engines are also stepping up with smarter query planning that still allows for the existence of small files but optimizes how data is accessed.
Conclusion
That was kind of a big post on a what's otherwise a... small problem 😂 . But now you have a better idea of what the small file problem is, why it's important for folks building out a data lakehouse with Apache Iceberg, and what your options are for tackling it.
If you're interested in more Apache Iceberg content, like, follow, and find me across social media.
Top comments (0)