Open Source first Anniversary Star 1.2K! Review on the anniversary of LakeSoul, the unique open-source Lakehouse

#database #datascience #programming #opensource

LakeSoul, the only Chinese open-source lakehouse framework, has been open source for one year since the end of December 2021. During this year, two versions, 1.0 and 2.0, have been released successively, and many surprising functions have attracted the attention of technology lovers worldwide, which obtained 1.2K Stars.

LakeSoul’s design concept is to create a simple, high-performance cloud-native data lake supporting BI and AI application scenarios. Version 1.0 is built around the Spark engine to realize the relatively useful function of the Lakehouse. Version 2.0 focuses on ecosystem construction and underlying framework reconstruction to further enrich functions, support more usage scenarios, and get closer to actual usage requirements. Our previous article briefly introduced LakeSoul’s design concept: “Lakehouse is the future of data Intelligence? Then you must know about the only open-source LakeSoul in China”. This article summarizes the open-source project construction process over the past year and gives some recent Benchmark comparison results for your reference.

Key features of LakeSoul version 1.0

Merge On Read (MOR), and copy-on-write (COW) write modes are supported. MOR is implemented using Upsert semantics for writing more than read scenarios. In contrast, COW is implemented using Update semantics for scenarios such as read more than write.
Compaction allows high write throughput in MOR mode and allows efficient read-time merges with multipath ordered merges, allowing Compaction to optimize read performance.
In MOR mode, a user-defined MergeOperator is supported. Users can customize the UDFs that combine multiple values of the same primary key when reading, which can flexibly implement some common merge logic that is not simply overwritten, such as summation, fetching the latest non-null value, and so on. For specific usage scenarios and methods, see: Case Sharing: LakeSoul’s unique MergeOperator function
It supports multi-stream concurrent updates, and multiple streams can be written into different columns, which only need to have the same primary key column. The write operation does not need to do any additional configuration. This function can quickly realize multi-stream wide table splicing (eliminating Join), machine learning sample splicing, etc. For specific application scenarios, please refer to our previous article: Using LakeSoul to build real-time machine learning sample library
Besides reading and writing data, it supports Delete of tables and partitions, DropTable, DropPartiton, multilevel partitioning, and primary key hash bucket splitting. It supports API interfaces such as SparkSql, DataFrame, and Streaming.
Extend Spark logical and physical plans to realize semantics such as CDC lake entry and Merge Into; A real-time lake entry system based on Debezium+Kafka+Structed Streaming+LakeSoul is introduced in CDC.

Interpretation of LakeSoul 2.0 version construction route

In version 1.0, LakeSoul implemented basic lakehouse read and write capabilities, with high write and throughput performance in MOR scenarios (see Benchmark below) and good read-time merge performance. In the 1.0 architecture, there was some unreasonable design of the metadata layer and IO layer and high coupling degree, which restricted the further expansion of new functions by LakeSoul. In the 2.0 version, we made a large reconstruction and added many practical function points.

The metadata layer of the Catalog is reconstructed and decoupled to become a single module that supports access to multiple engines. We replaced Cassandra with PostgresSQL. Postgres’s powerful transaction capability enables complex concurrency control, conflict detection, and two-phase commit conformance protocols. At the same time, through the well-designed metadata table structure, the primary key index can be used for the read-and-write operations of metadata to achieve high performance. The minimum Postgres instance of 2 core 4G on the cloud can reach thousands of QPS. The efficient performance metadata layer implementation means that LakeSoul can support large-scale LakeSoul file management and efficient concurrency control.
Support Flink stream and batch integrated engine, expand upstream lake access capacity, and focus on constructing multi-source real-time synchronization capability based on Flink CDC to meet enterprise-class thousand-meter real-time lake access requirements. LakeSoul Flink CDC supports synchronizing thousands of tables in the entire library. It only needs to configure the library name of the online library to synchronize all tables in the library automatically. LakeSoul realizes automatic new table awareness and automatic DDL change synchronization and is automatically compatible with old version data when reading, making real-time lake access more simple and fast. The two-phase commit protocol is implemented through metadata layer transactions and idempotent, thus ensuring the semantics of Exactly Once in the lake.
Automatically mount Hive partitions on downstream lakes. By configuring the LakeSoul partition field and the Hive external name, every Compaction allows you to automatically mount a partition to the Hive partition with the same name. In addition, users can define Hive partition fields to synchronize partitions with different names. Apache Kyuubi also supports direct access to LakeSoul lakehouse via Hive JDBC. It will be connected and exported to more data warehouses to create a richer upstream and downstream ecology.
More complete ecological functions, including snapshot-read, incremental read, rollback, and data clearing. The incremental read can use the Streaming mode, specify the start time stamp, and Trigger periodically through the trigger to continuously read the incremental data after the last read, including CDC data. In snapshot read mode, data entering the lake within the specified start timestamp is read. The rollback will roll back to the latest version before the specified timestamp. The data cleansing will clear all metadata and data content before the specified timestamp. These functions apply to scenarios such as operation and maintenance, troubleshooting, version rollback, and streaming pipeline.

From version 1.0 to 2.0, LakeSoul further opens up the upstream and downstream links, thus providing richer scenarios, including real-time lake entry of the multi-source online database, stream-and-batch integrated ETL, multi-stream integration large-width table construction (eliminating Join), real-time machine learning feature stream construction, etc., greatly improving the performance and usability.

Benchmark result

At the beginning of its design, the goal of LakeSoul was to build a cloud-native, high throughput, and efficient concurrency stream-batch integrated lakehouse storage framework and elaborate concurrent write transaction mechanism and efficient MOR mechanism. In some competitions and public data sets, performance has been impressive.

CCF BDCI 2022 — Data Lake Batch Integrated Performance Challenge Benchmark

Recently, in the Big Data and Intelligent Computing Competition (BDCI 2022) jointly held by DMetaSoul and CCF (China Computer Society), “Data Lake, Stream and Batch Integrated Performance Challenge,” 11 batches of data need to be Upsert into the lakehouse table. The first was 20 million, and the last ten were 5 million. When upserting, it needs to sum some fields or filter null values based on the primary key. Finally, the total Write and Read time is counted. Contestants can choose any data lake storage framework and use Spark as the computing engine. The final evaluation results of several open-source data lakehouse frameworks are as follows:

In contrast, LakeSoul has obvious performance advantages in the case of multiple updates of large batches of data. In MOR mode, high write throughput capacity can be obtained, and MOR read performance is near.

Review code reference: https://github.com/meta-soul/ccf-bdci2022-datalake-contest-examples

CHBenchmark: CDC enters the lake and queries Benchmark in real-time

CHBenchmark is a public benchmark that combines TPC-C and TPC-H. It uses 21 query SQL statements to measure query time. Use Flink CDC to synchronize the data of 12 tables, whose initial full data is 1600w, and the incremental data is over 700,000 after 30 minutes. The Checkpoint interval is 10 minutes. Record the query time after full-time and 30 minutes, respectively.

In contrast, LakeSoul shows advantages in reading performance under the continuous update of small batch incremental data, benefiting from an efficient metadata management mechanism and MOR high-performance design. But in terms of individual statements, there is room for optimization, mainly the time spent on IO.

Review code reference: https://github.com/meta-soul/LakeSoul/pull/115

In response, the LakeSoul community has started a new project, NativeIO, decoupling the IO layer to allow more computing frameworks to benefit from the Lakehouse. The NativeIO layer, implemented by Rust, adopted the Parquet IO layer in the Arrow-rs project and optimized the object storage in asynchronous parallel. In our preliminary test, the performance of accessing the object storage was more than doubled. At the same time, the NativeIO layer encapsulates IO, Merge, and other logic and provides C API up, which can be called through Java, Python, and other languages and can easily provide unified LakeSoul lakehouse access capability for a variety of computing engines.

Generally speaking, this year, through community feedback and three-party evaluation, LakeSoul continuously pursues more efficient and reasonable design, continuously expands upstream and downstream links gradually improves ecological system construction, and builds a BI+AI integrated lakehouse platform. Finally, here is our trial link: https://github.com/meta-soul/LakeSoul

Welcome to use LakeSoul and feedback to build a domestic lakehouse open-source community ecology!