What is the Lakehouse, the latest Direction of Big Data Architecture?

#opensource #dataengineering #bigdata #database

1. Explanation of nouns
Because there are many nouns in the article, the leading nouns are briefly introduced to facilitate everyone to read.

Database:
In the sense of the word, Databases have been used in computers since the 1960s. However, the database structure at this stage is mainly hierarchical or mesh, and there is an extreme dependence between data and programs, so the application is relatively limited.
Databases are now commonly referred to as relational databases. A relational database is a database that uses a relational model to organize data. It stores data in the form of rows and columns and has the advantages of high structuration, strong independence, and low redundancy. In 1970, the birth of the relational database, which truly completely separated software data and programs, became an integral part of the mainstream computer system. The relational database has become one of the most important database products. Almost all the new database products of database manufacturers support relational databases, even if some non-relational database products also almost have the interface to support relational databases.
Relational databases are mainly used for Online Transaction Processing (OLTP). OLTP mainly processes basic and routine transactions, such as bank transactions.

Data warehouse:
With the large-scale application of databases, the data of the information industry grows explosively. To study the relationship between data and excavate the hidden value of data, more and more people need to use ONLINE Analytical Processing (OLAP) to analyze data and explore deep-seated relationships and information. However, it isn't easy to share data between different databases, and data integration and analysis are also very challenging.
To solve the problem of enterprise data integration and analysis, bill Enman, the father of Data Warehouse, proposed Data Warehouse in 1990. The primary function of a data warehouse is to OLAP the large amount of data accumulated by OLTP over the years through the unique data storage architecture of the data warehouse and help decision-makers quickly and effectively analyze valuable information from a large amount of data and provide decision support. Since the emergence of data warehouse, the information industry began to develop from relational database-based operational systems to decision support systems.

Compared with a database, the data warehouse has the following two characteristics:
Data warehouse is subject-oriented integration. The Data warehouse is built to support various businesses and data from scattered operational data. Therefore, the required data must be extracted from multiple heterogeneous sources, processed and integrated, reorganized according to the topic, and finally entered into the data warehouse.
Data warehouse is mainly used to support enterprise decision analysis, and the data operation involved is mostly data query. Therefore, the data warehouse can improve query speed and reduce overhead by optimizing table structure and storage mode. Although warehouses are well suited for structured data, many modern enterprises must deal with unstructured, semi-structured, and data with high diversity, speed, and volume. Data warehousing is not suitable for many of these scenarios and is not the most cost-effective.

Data lake:
The essence of a data lake is a solution composed of "data storage architecture + data processing tools." The data storage architecture must be scalable and reliable enough to store massive data of any type, including structured, semi-structured, and unstructured data. Data processing tools fall into two broad categories. The first type of tool focuses on how to "move" data into the lake. It includes defining data sources, formulating data synchronization policies, moving data, and compiling data catalogs. The second type of tool focuses on how to analyze, mine, and utilize data from the lake. Data lake needs to have perfect data management ability, diversified data analysis ability, comprehensive data life cycle management ability, safe data acquisition, and data release ability. Without these data management tools, metadata will be missing, the data quality of the lake will not be guaranteed, and eventually, the data lake will deteriorate into a data swamp.

It has become a common understanding within the enterprise that data is an important asset. With the continuous development of enterprises, data keeps piling up. Enterprises hope to keep all relevant data in production and operation completely, carry out effective management and centralized governance, and dig and explore data value. Data lakes are created in this context. The data lake is a large data warehouse that centrally stores structured and unstructured data. It can store original data from multiple data sources and various data types. Data can be accessed, processed, analyzed, and transmitted without structural processing. The data lake can help enterprises quickly complete federated analysis, mining, and exploring data value of heterogeneous data sources.

With the development of big data and AI, the value of data in the data lake is gradually rising and being redefined. The data lake can bring a variety of capabilities to enterprises, such as centralized data management, help enterprises build more optimized operation models, and provide other capabilities for enterprises, such as predictive analysis, recommendation models, etc., which can stimulate the subsequent growth of enterprise capabilities.
The data warehouse and a data lake can be likened to the difference between a warehouse and a lake: a warehouse stores goods from a specific source; Lake water comes from rivers, streams, and other sources and is raw data. Data lakes, while good for storing data, lack some key features: they do not support transaction processing, do not guarantee data quality, and lack consistency/isolation, making it almost impossible to mix append and read data and to do batch and streaming jobs. For these reasons, many of the data lake capabilities are not yet implemented, and the benefits of a data lake are lost.

Data lakehouse:
Wikipedia does not give a specific definition of the lakehouse. It considers the advantages of both data lake and data warehouse. On the low-cost cloud storage in an open format, it realizes functions similar to data structure and data management functions in the data warehouse. It includes the following features: concurrent data reads and writes, architecture support with data governance mechanism, direct access to source data, separation of storage and computing resources, open storage formats, support for structured and semi-structured data (audio and video), and end-to-end streaming.

2. Evolution direction of big data system:
In recent years, many new computing and storage frameworks have emerged in the field of big data. For example, a standard computing engine represented by Spark, Flink, and an OLAP system described by Clickhouse emerged as computing frameworks. In storage, object storage has become a new storage standard, representing an important base for integrating data lake and lake warehouse. At the same time, Alluxio, JuiceFS, and other local cache acceleration layers have emerged. Several key evolution directions in the field of big data:

Cloud-native. Public and private clouds provide computing and storage hardware abstraction, abstracting the traditional IaaS management operation and maintenance. An important feature of cloud-native is that both computing and storage provide elastic capabilities. Making good use of elastic capabilities and reducing costs while improving resource utilization is an issue that both computing and storage frameworks need to consider.
Real-time. Traditional Hive is an offline data warehouse that provides T+1 data processing. It cannot meet new service requirements. The traditional LAMBDA architecture introduces complexity and data inconsistencies that fail to meet business requirements. So how to build an efficient real-time data warehouse system and realize real-time or quasi-real-time write updates and analysis on a low-cost cloud storage are new challenges for computing and storage frameworks.
Computing engine diversification. Big data computing engines are blooming, and while MapReduce is dying out, Spark, Flink, and various OLAP frameworks are still thriving. Each framework has its design focus, some deep in vertical scenarios, others with converging features, and the selection of big data frameworks are becoming more and more diverse.

In this context, the lakehouse and flow batch emerged.

3. What problems can be solved by integrating the lakehouse?
3.1 Connect data storage and computing
Many companies have not diminished the need for flexible, high-performance systems for a wide range of data applications, including SQL analysis, real-time monitoring, data science, and machine learning. Most of the latest advances in AI are based on models that better handle unstructured data (text, images, video, audio). The two-dimensional relational tables of a completely pure data warehouse can no longer handle semi-/ unstructured data, and AI engines cannot run solely on pure data warehouse models. A common solution is to combine the advantages of the data lake and warehouse to establish the lakehouse and then solve the limitations of the data lake: directly realize the data structure and data management functions similar to those in the data warehouse on the low-cost storage for the data lake.

The data warehouse platform is developed based on big data demand, and the data lake platform is developed based on the demand for AI. These two big data platforms are completely separated at the cluster level, and data and computation cannot flow freely between the two platforms. By the Lakehouse, the seamless flow between data lake and data warehouse can be realized, opening up different data storage and computation levels.

3.2 Flexibility and ecological richness
Lakehouse can give full play to the flexibility and ecological richness of the data lake and the growth and enterprise capability of the data warehouse. Its main advantages are as follows:
Data duplication: If an organization maintains a data lake and multiple data warehouses simultaneously, there is no doubt that there is data redundancy. At best, this can lead to inefficient data processing, but it can lead to inconsistent data at worst. The Lakehouse can remove the repeatability of data and truly achieve uniqueness. Data lakehouse has the following advantages:
High storage costs: Data warehouses and data lakes are designed to reduce the cost of data storage. Data warehouses often reduce costs by reducing redundancy and integrating heterogeneous data sources. On the other hand, data lakes tend to use big data file systems and Spark to store computational data on inexpensive hardware. The goal of the lakehouse integrated architecture is to combine these technologies to maximize cost reduction.

Differences between reporting and analysis applications: Data science tends to work with data lakes, using various analytical techniques to deal with raw data. On the other hand, reporting analysts tend to use consolidated data, such as data warehouses or data marts. There is often not much overlap between the two teams in an organization, but there are certain repetitions and contradictions between them. Both teams can work on the same data architecture with the all-in-one architecture, avoiding unnecessary duplication.

Data stagnation: Data stagnation is one of the most severe problems in the data lake, which can quickly become a data swamp if it remains ungoverned. We tend to throw data into the lake easily but lack effective governance, and in the long run, the timeliness of data becomes increasingly difficult to trace. The lakehouse for massive data management can help improve the timeliness of analysis data more effectively.
Risk of potential incompatibilities: Data analytics is still an emerging technology, and new tools and techniques emerge every year. Some technologies may only be compatible with data lakes, while others may only be compatible with data warehouses. The lakehouse means preparing for both.

Conclusion:
In general, the lakehouse has the following key characteristics:
Transaction support:

Data is often read and written concurrently to business systems in an enterprise.
ACID support for transactions ensures consistency and correctness of concurrent data access, especially in SQL access mode.
Data modeling and data governance: The lakehouse can support the realization and transformation of various data models and support DW mode architecture, such as the star and snowflake models. The system should ensure data integrity and have robust governance and audit mechanisms.
BI support: The integration of lakehouse supports the use of BI tools directly on the source data, speeding up the analysis efficiency and reducing the data delay. In addition, it is more cost-effective to operate two copies separately in lakehouse.
Memory separation: The architecture of memory separation also enables the system to scale up to more significant concurrency and data capacity. (Some newer data warehouses have adopted this architecture.)
Openness: With open, standardized storage formats (such as Parquet, etc.) and rich API support, various tools and engines (including machine learning and Python/R libraries) can provide efficient direct access to data.
Support for multiple data types (structured and unstructured): Lakehouse provides data warehousing, transformation, analysis, and access for many applications. Data types include images, video, audio, semi-structured, and text.
Support for various workloads: Support for various workloads, including data science, machine learning, SQL queries, and analysis. These workloads may require multiple tools, but they are all supported by the same database.
End-to-end flow: Real-time reporting has become a normal requirement in the enterprise. Building a dedicated system for real-time data services is no longer the same as before with the support of flow.

4.Four best open-source data lake warehouse projects
Hudi
Hudi is an opensoure procject providing tables, transactions, efficent upserts/deletes, advanced indexes, streaming ingestion services, data clustering/compaction optimizations, and concurrency all while keeping your data in open source file formats.
Apache Hudi brings core warehouse and database functionality directly to a data lake, which is great for streming wokloads, making users create efficient incremental batch pipelines. Besides, Hudi is very compatible, for example, it can be used on any cloud, and it supports Apache Spark, Flink, Presto, Trino, Hive and many other query engines.

Iceberg
Iceberg is an open table format for huge analytic dataset with Schema evolution, Hidden partitioning, Partition layout evolution, Time travel, Version rollback, etc.
Iceberg was built for huge tables, even those that can’t be read with a distributed SQL engine, used in production where a single table can contain tens of petabytes of data. Iceberg is famous for its fast scan planning, advanced filtering, works with any cloud store, serializable isolation,, multiple concurrent writers, etc.

Lakesoul
LakeSoul is a unified streaming and batch table storage solution built on the Apache Spark engine. It supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and streaming & batch unification.
LakeSoul specializes in row and column level incremental upserts, high concurrent write, and bulk scan for data on cloud storage. The cloud-native computing and storage separation architecture makes deployment very simple while supporting huge amounts of data at a lower cost.

delta lake
Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python, providing ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.

Hudi focuses more on the fast landing of streaming data and the correction of delayed data. Iceberg focuses on providing a unified operation API by shielding the differences of the underlying data storage formats, forming a standard, open and universal data organization lattice, so that different engines can access through API. Lakesoul, now based on spark, focuses more on building a standardized pipeline of data lakehouse. Delta Lake, an open-source project from Databricks, tends to address storage formats such as Parquet and ORC on the Spark level.

DEV Community

What is the Lakehouse, the latest Direction of Big Data Architecture?

Top comments (0)

Read next

Two more events about free and opensource software worth visiting in 2024

How to save datetime data that is relevant to multiple countries or time zones?

I released A Physics Engine in GoLang 🔥

Open-source timepicker components for Tailwind CSS