Image by: Joel Ambass
Three paradigm shifts when working with a Data Lake
There are several key conceptual differences between working with databases and Data Lakes.
In this post, let’s identify some of these differences which may not be intuitive at first sight, especially for people with a strong relational database background.
The server is disposable. The data is in the Cloud.
Decoupled storage and compute: This is a classic when talking about Data Lakes.
In traditional database systems (and initial Hadoop-based Data Lakes), storage is tightly coupled with computing servers. The servers either have the storage built-in or are directly connected to the storage.
In modern cloud-based Data Lake architectures, the data storage and compute are independent. Data is held in cloud object storage (ex: AWS S3, Azure Storage), usually in an open format like parquet, and compute servers are stateless, they can be started/shut down whenever necessary.
Having a decoupled storage and compute enables:
- Lower computing costs: The servers are running when necessary. When unused, they can be shut down thus lowering compute costs.
- Scalability: You don’t have to acquire the hardware for peak usage. The number of servers/CPUs/memory can be scaled up/down dynamically according to current usage.
- Sandboxing: The same data can be read simultaneously by multiple compute servers/clusters. This allows you to have multiple teams, in separate clusters, working in parallel reading the same data without affecting each other.
RAW data is king! Curated data is just derived.
In the database paradigm, after the data from source systems is transformed and loaded into database tables, it is no longer useful. In the Data Lake paradigm, RAW data is kept as the source of truth, eventually forever because it is the real asset.
RAW data, however, is typically unsuitable for consumption by business users, therefore it goes through a curation process to improve its quality, provide structure and ease consumption. Curated data is finally stored for feeding data science teams, data warehouses, reporting systems, and general consumption by business users.
Typical Data Lake consumers only see the curated data and therefore they value curated data much more than the RAW data which generated it.
However, the true asset of the Data Lake is the RAW data (along with the curation pipeline) and, in a sense, curated data is similar to a materialized view that can be refreshed at any time.
Key takeaways
- Can be recreated from RAW at any time.
- Can be recreated with an improved curation process.
- We can have multiple curated views, each for a specific analysis.
Schema decisions taken today don’t constrain future requirements
Often the information requirements change and some piece of information not originally collected from the source/operational system needs to be analyzed.
In a typical scenario, if the original RAW data isn’t stored, the historical data is lost forever.
However, in a Data Lake architecture, the decision taken today that a field is not to be loaded on the curated schema can be reversed later, because all the detailed information is safely stored in the RAW area of the Data Lake and the historical curated data can be recreated with the additional fields.
Key takeaways
- Don’t spend a lot of time trying to create a generic one-size-fits-all curated schema if you don’t need it right now.
- Create a curated schema iteratively, start by adding the fields you need right now.
- When additional fields are required, add them to the curation process and reprocess.
Final Thoughts
Data Lakes are not a replacement for Databases, each tool has its sweet spots and Achilles heels.
It is probably as much a bad idea to use a Data Lake for OLTP, as it is to use a database to store terabytes of unstructured data.
I hope this post helped to shed some light on some of the key design differences between both systems.
Top comments (0)