DEV Community

Cover image for Data Is a Wild West
bronifty
bronifty

Posted on

Data Is a Wild West

Data is a wild west even more than the web. We have different engines coming from different vendors doing different things. And even very sophisticated practitioners have difficulty categorizing all these things with abstractions that would make sense to a layperson. It even confuses practitioners who do nothing but data all day.

Iceberg is metadata; Parquet is data. Iceberg is metadata about a file pointing to the physical file in object storage. A catalog like Arctic organizes the metadata and the metadata files provide query plans or pointers to the files and locations in those files where the data lives.

Alex Merced shows us how to convert a csv file into Iceberg, which is the same process we use in query tools like mstr to map a semantic layer to a physical layer or a 'logical table' to a physical warehouse schema. In programming terminology, we are creating and assigning a struct or a tuple type to the data, which will help categorize it for queries related to the html form input, api format over the wire, and ultimately inside the data column in the db.

Here is a reference to Dremio Artic, a catalog for the metadata which provides organization and what the modern data stack presentation refers to as 'awareness', which, when coupled with the Iceberg metadata file mapping to Parquet, provides version control to the object storage, allowing in-place transformation and time-travel debugging.

As a personal anecdote, I have installed Dremio both on AWS and in Docker (have yet to try it in kubernetes on Okteto free tier). I do not know how to control my AWS bill when I fire up the cluster required to run this machinery. And the Docker instance is quite limited in terms of features, especially vis-a-vis various integrations. But it is great for a demo.

Top comments (0)