A data warehouse is like a producer of water where you are handed bottled water in a particular size and shape of a bottle. Whereas a data lake is a place where many streams of water flow into it and it's up to everyone to get the water the way he wants it.
The data lake is the new data warehouse. It shares the goals of the data warehouse of supporting business insights beyond the day-to-day transaction data handling. The main factors for the evolution of the data warehouse are the following ones:
As we are collecting more and more data (text, xml, json, song, voice, sensor data...). That is why we need to find a better way to process it.
It is possible to transform data during the ETL process. But if we are deciding on a particular form of transformation, we might not have the flexibility we need afterwards for analysis. This applies for deep json structures, where we do not want to distill only some elements. Another example would be text/pdf documents that need to be stored as blobs, but are useless unless processed to extract some metrics.
The HDFS (Hadoop Distributed File System) made it possible to store Petabytes of data on commodity hardware. It has less cost per TB compared to a MPP (Massive Parallel Programming) database, like Redshift.
Thanks to new processing tools like MapReduce or Spark we can process data at scale on the same hardware used for storage.
Schema-On-Read makes it possible to do data analytics without inserting into a predefined schema or process unstructured text.
As data is treated as the new oil, people want to get out the most value of it. A data scientist often needs to represent and join data sets together from external sources. For this case the clean consistent and performant model a data-warehouse architecture provides for business users does not work. The data lake needs to cope with these agile and ad-hoc nature data exploration activities. Also machine learning or natural language processing needs to access the data in a different form than ie a star schema provides.
- ETL Offloading: Same hardware for storage and processing, a big data cluster. There is no more need for a special ETL grid or additional storage for an staging area.
- Dimensional modelling with conformed dimensions or data marts for high/known-value data
- Low cost per TB makes it possible to store low/unknown value data for analytics
With the big data tools in the hadoop ecosystem, like Spark, it is as easy to work with a file as it is to work with a database, but without creating and inserting into a database. This is called schema-on-read, as for the schema of a table it is either inferred or specified and the data is not inserted into it, but upon read the data is checked against the specified schema.
dfExample = spark.read.csv("data/example.csv", inferSchema=True, header=true, sep=";" )
The schema is inferenced, but we want to make sure the type is set correctly. For a example a date field should not be set as a string.
To better control types and malformed data, we can specify a schema (StructType), to make sure everything is correct. It is still schema-on-read though.
We also can specify what should happen to a row that is not conform to our schema. The options are drop it, replace with null or fail.
exampleSchema = StructType([ StructField("id", IntegerType()), StructField("amount", IntegerType()), StructField("info", StringType()), StructField("date", DateType()) ]) dfExample = spark.read.csv("data/example.csv", schema=exampleSchema, sep=";", mode="DROPMALFORMED" )
With that we can do direct querying on the fly without database insertions.
dfExample.groupBy("info")\ .sum("amount")\ .orderBy(desc("sum(amount)"))\ .show(3)
We can also write SQL with creating a temporary table. Nothing will be written to a database here.
dfExample.createOrReplaceTempView("example") spark.sql(""" SELECT info, sum(amount) as total_amount FROM example GROUP BY info ORDER BY total_amount desc """).show(3)
Spark can read and write files in
- text-based formats
- binary formats like Avro (saves space) and Parquet, that is a columnar storage and
- compressed formats like gzip and snappy
dfText = spark.read.text("text.gz") dfSample = spark.read.csv("sample.csv")
Spark can read and write files from a variety of file systems (local, HDFS, S3...) and a variety of databases (SQL, MongoDB, Cassandra, Neo4j...)
Everything that is exposed in a single abstraction - the dataframe - can be processed with SQL.
- A data lake can easily transform into an chaotic data garbage dump.
- Data governance is hard to implement as a data lake can be used for cross-department data and external data.
- Sometimes it is unclear for what cases a data lake should replace, offload or work in parallel with a data warehouse or data marts. In all cases dimensional modelling is a valuable practice.
- All types of data are welcome.
- Data is stored "as-is", transformations are done later. Extract-Load-Transform - ELT instead of ETL.
- Data is processed with schema-on-read. There is no predefined star-schema before the transformation
- massive parallelism and scalability come out of the box with all big data processing tools. We can use columnar storage (parquet) without expensive MPP databases.
|Data Warehouse||Data Lake|
|Data form||Tabular format||All formats|
|Data value||High only||High-value, medium-value and to-be-discovered|
|Data model||Star- and snowflake with conformed dimensions or data-marts and OLAP cubes||All representations are possible|
|Schema||Schema-on-write (Known before ingestion)||Schema-on-read (On the fly at the time of the analysis)|
|Technology||MPP databases, expensive with disks and connectivity||Commodity hardware with parallelism as first principle|
|Data Quality||High with effort for consistency and clear rules for accessibility||Mixed, everything is possible|
|Users||Business analysts||Data scientists, Business analysts & ML engineers|
|Analytics||Reports and Business Intelligence visualisations||Machine Learning, graph analytics and data exploration|