This post was originally posted on my personal blog.
In this post we will explore the implementation difficulties and how a data lake fits into the organization, topics that other articles might tend to oversee.
Data lakes, it’s all in the name. Not drops but lakes, data lakes usually holds millions to billions of different structured and unstructured data (images, emails, JSON documents). The data lake is seen as the source of truth or the "master dataset" and has many benefits.
One of these benefits are schema on read, which boils down to just storing the data and then when doing analytics build your schema and read from it. You don’t need to know the schema and won’t be constrained to it when writing. This gives a lot of flexibility for you to ask questions later and draw meaningful business value from the heaps of data that was collected. AWS describes most of the benefits in this post
Warning: This post is biased to building data lakes on AWS and using serverless where possible.
Designing a data lake from scratch can be a daunting task, it looks easy on paper, just dump all the data into a S3 bucket under different keys and then use Athena to run SQL-like queries on your data. Athena uses Presto for SQL like queries and Appache Hive for the schema.
Wait, now we are talking about schemas, this is where the line gets blurry between Data Lakes and Data Warehouses. Data Warehouses are more structured as they are schema on write, they require you to know relationships between data, indexes and query patterns for optimal usage.
So in order to run SQL queries on the data it must be stored in some schema/structured pattern, you can’t just dump a click stream record for example with 10 columns in one row and then 3 columns in the next row. Data needs to be semi-consistent for that part of the lake, else you will end up with a Data Swamp.
You don’t want to find yourself in the situation where you have millions of records. Then when running a SQL query that selects 10 columns just to have a single row with 3 columns preventing the query from running, good luck finding that needle.
This is also where the distinction between data lakes and warehouses are made. In a data lake you can just store raw data independent of schema, then normalize the schema by processing it and loading (ETL) it into the data warehouse for further analysis.
Then there is tenancy, there might be several reasons for grouping data in the lake. One of these being data retention, try explaining in court that you cannot delete a single tenant/client’s data because you have been found guilty of violating GDPR.
Also consider that you are paying for what you are storing, if a client leaves it might be beneficial to keep their data for a period of time, but after that the data loses its value. Data loses its value almost exponentially for certain companies.
Consider streaming data, for example when you buy something, your bank records and analyzes the transaction. This data is most useful within the first few seconds maybe minutes to do something like fraud detection. It loses value if you get a SMS warning you about possible fraud the next day, your bank account could have been emptied by then.
So the point is, not all data is born equally important and some might retain their value while other data might not. In a data lake you usually store almost everything for analyzes in the future but this gets expensive and you might be storing data that has no value. So having methods to delete parts of the data lake is important and must be kept in mind for the design process.
Your data lake also needs to be able to distinguish between hot and cold data. This is how you will be able to identify data that has lost importance and needs to be archived for optimizing storage cost. There is no reason to keep years of user clicks in hot storage. As soon as your data retention policies are met, this data can be archived and stored for a further X amount of time before finally being deleted.
When handling personal identifiable information (name, mobile number, address, etc.) it gets even harder. The data stored needs to be fully complainant to the vast amount of regulations and compliances out there. It is not enough to have data encrypted at rest if anyone can view the data (S3 Bucket leaks). Personally identifiable information that is going to be used for long term storage needs to be anonymized or at least sudo-anonymized (partially anonymized) to be complainant. So you need process in place to do this before storing the data.
Many posts talk about data lakes and the benefits there of, not a lot talk about the implementation difficulties and how this fits into the organization. Here are some key things that make designing data lakes difficult:
- Tenancy. Many organizations are multi-tenant, meaning, one application, one DB, one Data Lake and many clients. Isolating a single tenant/client data for say updating or deleting becomes difficult at scale.
- Semi consistent/structured data must be written and the retrieval patterns must be kept in mind else you will end up with a Data Swamp.
- Archiving also falls into the tenancy problems, storing data in such a way to move it between hot and cold storage.
- Privacy, regulations and compliance. Depending on the data you are storing, part of the Lake might need to be anonymized.