In previous generation of data analytic organizations had their Data Warehouse, and when they reach their limit of Data Warehouse they were not able to load all that data. With the advent of Data Lake organizations have the ability to store any amount of data and able to use the best fit analytic or Big Data system to analyze that data. But Data Warehouses still provided the best way to solve complex SQL problems. Organizations found they still need to use data lake and data warehouse together, so lake house approach combines the benefit of both of them.
Following image illustrates this Lake House approach in terms of customer data in the real world and data movement required between all of the data analytics services and data stores, inside-out, outside-in, and around the perimeter.
From your data warehouse you can query the data lake and you can use the best fit analytics engine that your multiple personals want without any compromises. Lake house approach allows you to eliminate data silos. You keep one copy of data in your S3 based data lake and you use red shift or some other service to query the data and you can also combine the data from your operational databases and use machine learning and other analytic services.
Organizations see their data will continue to grow; for some of them, doubling a year. If data Doubles every year in the next 10 years that 1 terabyte of data will grow to a petabyte in 10 years. To address fast speed the data warehouse approach will not skill well and that's why you need the lake house approach because this will be able to keep up with the growth. Organizations are analyzing their data as soon as it lands. As a matter of fact they want to run real time Analytics and want to analyze data without loading it first.
Organizations also want to empower all their personas not just data scientist or SQL experts, but ad hoc analysts and want to enable everyone to use the best fit analytic engine without any compromise. With the growth in different analytic engine comes the growth in open data format and they want to use the best format is needed for their analytics service, and same time not be locked into vendor's proprietary format. Organizations understand down side of data silos and want to have a single source of truth and unified Analytics. The data should be easy to secure across the data lake, and also while travelling from one analytic engines to another. Organizations want unified security and governance approach so it can secure the data once and use it across across analytic services.
For this Lake House Architecture, you can organize it as a stack of five logical layers, where each layer is composed of multiple purpose-built components that address specific requirements.
We describe these five layers in this section, but let’s first talk about the sources that feed the Lake House Architecture.
The Lake House Architecture enables you to ingest and analyze data from a variety of sources. Many of these sources such as line of business (LOB) applications, ERP applications, and CRM applications generate highly structured batches of data at fixed intervals. In addition to internal structured sources, you can receive data from modern sources such as web applications, mobile devices, sensors, video streams, and social media. These modern sources typically generate semi-structured and unstructured data, often as continuous streams.
The ingestion layer in the Lake House Architecture is responsible for ingesting data into the Lake House storage layer. It provides the ability to connect to internal and external data sources over a variety of protocols. It can ingest and deliver batch as well as real-time streaming data into a data warehouse as well as data lake components of the Lake House storage layer.
The data storage layer of the Lake House Architecture is responsible for providing durable, scalable, and cost-effective components to store and manage vast quantities of data. In a Lake House Architecture, the data warehouse and data lake natively integrate to provide an integrated cost-effective storage layer that supports unstructured as well as highly structured and modeled data. The storage layer can store data in different states of consumption readiness, including raw, trusted-conformed, enriched, and modeled.
The catalog layer is responsible for storing business and technical metadata about datasets hosted in the Lake House storage layer. In a Lake House Architecture, the catalog is shared by both the data lake and data warehouse, and enables writing queries that incorporate data stored in the data lake as well as the data warehouse in the same SQL. It allows you to track versioned schemas and granular partitioning information of datasets. As the number of datasets grows, this layer makes datasets in the Lake House discoverable by providing search capabilities.
Components in the data processing layer of the Lake House Architecture are responsible for transforming data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. The processing layer provides purpose-built components to perform a variety of transformations, including data warehouse style SQL, big data processing, and near-real-time ETL.
The data consumption layer of the Lake house Architecture is responsible for providing scalable and performant components that use unified Lake House interfaces to access all the data stored in Lake House storage and all the metadata stored in the Lake House catalog. It democratizes analytics to enable all personas across an organization by providing purpose-built components that enable analysis methods, including interactive SQL queries, warehouse style analytics, BI dashboards, and ML.
The following diagram illustrates our Lake House reference architecture on AWS.
Another example that is not AWS specific
Concept of Data LakeHouse is at an early stage, so there are some limitations to be considered before completely depending on the Data LakeHouse architecture such as query compatibility, Data Cleaning complexity, etc. But Data Engineers can contribute to the issues and limitations on open-source tools. Bigger companies like Facebook, Amazon has already set the base for Data LakeHouse and open-sourcing the tools they use.