Barbara

Posted on Jan 11, 2022

Data Lake explained

#bigdata #spark #analytics #schemaonread

WHAT - the famous analogy

A data warehouse is like a producer of water where you are handed bottled water in a particular size and shape of a bottle. Whereas a data lake is a place where many streams of water flow into it and it's up to everyone to get the water the way he wants it.

WHY

The data lake is the new data warehouse. It shares the goals of the data warehouse of supporting business insights beyond the day-to-day transaction data handling. The main factors for the evolution of the data warehouse are the following ones:

Abundance of Unstructured Data

As we are collecting more and more data (text, xml, json, song, voice, sensor data...). That is why we need to find a better way to process it.
It is possible to transform data during the ETL process. But if we are deciding on a particular form of transformation, we might not have the flexibility we need afterwards for analysis. This applies for deep json structures, where we do not want to distill only some elements. Another example would be text/pdf documents that need to be stored as blobs, but are useless unless processed to extract some metrics.

The Rise of Big Data Technologies

The HDFS (Hadoop Distributed File System) made it possible to store Petabytes of data on commodity hardware. It has less cost per TB compared to a MPP (Massive Parallel Programming) database, like Redshift.
Thanks to new processing tools like MapReduce or Spark we can process data at scale on the same hardware used for storage.
Schema-On-Read makes it possible to do data analytics without inserting into a predefined schema or process unstructured text.

New Roles and Advanced Analytics

As data is treated as the new oil, people want to get out the most value of it. A data scientist often needs to represent and join data sets together from external sources. For this case the clean consistent and performant model a data-warehouse architecture provides for business users does not work. The data lake needs to cope with these agile and ad-hoc nature data exploration activities. Also machine learning or natural language processing needs to access the data in a different form than ie a star schema provides.

BENEFITS

Lower costs - more possibilities

ETL Offloading: Same hardware for storage and processing, a big data cluster. There is no more need for a special ETL grid or additional storage for an staging area.
Dimensional modelling with conformed dimensions or data marts for high/known-value data
Low cost per TB makes it possible to store low/unknown value data for analytics

Schema-on-Read

With the big data tools in the hadoop ecosystem, like Spark, it is as easy to work with a file as it is to work with a database, but without creating and inserting into a database. This is called schema-on-read, as for the schema of a table it is either inferred or specified and the data is not inserted into it, but upon read the data is checked against the specified schema.

Spark - Example Schema Inference

dfExample = spark.read.csv("data/example.csv", 
                            inferSchema=True,
                            header=true,
                            sep=";"
                           )

The schema is inferenced, but we want to make sure the type is set correctly. For a example a date field should not be set as a string.

To better control types and malformed data, we can specify a schema (StructType), to make sure everything is correct. It is still schema-on-read though.
We also can specify what should happen to a row that is not conform to our schema. The options are drop it, replace with null or fail.

exampleSchema = StructType([
                 StructField("id", IntegerType()),
                 StructField("amount", IntegerType()),
                 StructField("info", StringType()),
                 StructField("date", DateType())
                ])

dfExample = spark.read.csv("data/example.csv", 
                            schema=exampleSchema,
                            sep=";",
                            mode="DROPMALFORMED"
                           )

With that we can do direct querying on the fly without database insertions.

dfExample.groupBy("info")\
         .sum("amount")\ 
         .orderBy(desc("sum(amount)"))\
         .show(3)

We can also write SQL with creating a temporary table. Nothing will be written to a database here.

dfExample.createOrReplaceTempView("example")
spark.sql("""
          SELECT info, sum(amount) as total_amount
          FROM example
          GROUP BY info
          ORDER BY total_amount desc 
          """).show(3)

Unstructured data support

Spark can read and write files in

text-based formats
binary formats like Avro (saves space) and Parquet, that is a columnar storage and
compressed formats like gzip and snappy

dfText = spark.read.text("text.gz")
dfSample = spark.read.csv("sample.csv")

Spark can read and write files from a variety of file systems (local, HDFS, S3...) and a variety of databases (SQL, MongoDB, Cassandra, Neo4j...)

Everything that is exposed in a single abstraction - the dataframe - can be processed with SQL.

ISSUES

A data lake can easily transform into an chaotic data garbage dump.
Data governance is hard to implement as a data lake can be used for cross-department data and external data.
Sometimes it is unclear for what cases a data lake should replace, offload or work in parallel with a data warehouse or data marts. In all cases dimensional modelling is a valuable practice.

The Data Lake - SUMMARY

All types of data are welcome.
Data is stored "as-is", transformations are done later. Extract-Load-Transform - ELT instead of ETL.
Data is processed with schema-on-read. There is no predefined star-schema before the transformation
massive parallelism and scalability come out of the box with all big data processing tools. We can use columnar storage (parquet) without expensive MPP databases.

COMPARISON

	Data Warehouse	Data Lake
Data form	Tabular format	All formats
Data value	High only	High-value, medium-value and to-be-discovered
Ingestion	ETL	ELT
Data model	Star- and snowflake with conformed dimensions or data-marts and OLAP cubes	All representations are possible
Schema	Schema-on-write (Known before ingestion)	Schema-on-read (On the fly at the time of the analysis)
Technology	MPP databases, expensive with disks and connectivity	Commodity hardware with parallelism as first principle
Data Quality	High with effort for consistency and clear rules for accessibility	Mixed, everything is possible
Users	Business analysts	Data scientists, Business analysts & ML engineers
Analytics	Reports and Business Intelligence visualisations	Machine Learning, graph analytics and data exploration

DEV Community