Details of 4 best opensource projects about big data you should try out（Ⅰ）

#opensource #dataengineering #bigdata #spark

Two weeks ago, I published 4 best opensource projects about big data you should try out, in which I mentioned that I would go through each of the open-source products in detail and compare them next. Starting today, I’ll look at each of the four open source products mentioned in this article. Since I’ve been using LakeSoul lately, I’ll introduce it first. Next week, I’ll introduce Iceberg.

1.Introduction
LakeSoul is a streaming batch integrated table storage framework built on The Apache Spark engine. It has highly extensible metadata management, ACID transactions, efficient and flexible UPSERT operations, Schema evolution, and batch integration processing.
LakeSoul specifically optimizes the row and column level incremental updates, high concurrent entries, and batch scan reads for data on top of the Data Lake cloud storage. The storage separation architecture of cloud-native computing makes deployment very simple while supporting huge data volumes at a very low cost. LakeSoul supports high-performance write throughput in hashed partition primary key UPsert scenarios through lSM-tree, which can reach 30MB/s/core on object storage systems such as S3. The highly optimized Merge on Reading implementation also ensures Read performance. LakeSoul manages metadata through Cassandra to achieve high scalability of metadata.
LakeSoul’s main features are as follows:

Elastic framework:
The computing and storage are completely separated. Without fixing nodes and disks, computing and storage have their own elastic capacity. Many optimizations for the cloud storage have been done, like concurrency consistency in the object storage, incremental update, etc. With LakeSoul, there is no need to maintain fixed storage nodes, and the cost of object storage on the cloud is only 1/10 of the local disk, which significantly reduces storage and operation costs.
Efficient and scalable metadata management:
LakeSoul uses Cassandra to manage metadata, which can efficiently handle modification on metadata and support multiple concurrent writes. It solves the problem of slow metadata parsing after long-running in data Lake systems such as Delta Lake, which use files to maintain metadata and can only be written at a single point.
ACID transactions: Undo and Redo mechanism ensures that the committing are transactional and users will never see inconsistent data.
Multi-level partitioning and efficient upsert:
LakeSoul supports range and hash partitioning and a flexible upsert operation at row and column levels. The upsert data are stored as delta files, which significantly improves the efficiency and concurrency of writing data, and the optimized merge scan provides efficient MergeOnRead performance.
Streaming and batch unification:
Streaming Sink is supported in LakeSoul, which can handle streaming data ingesting, historical data filling in batch, interactive queries, and other scenarios simultaneously.
Schema evolution: Users can add new fields at any time and quickly populate the new fields with historical data.
LakeSoul mainly applies to the following scenarios:
Incremental data need to be written efficiently in large batches in real-time and concurrent updates at the row or column level.
Detailed query and update on an extensive time range with huge amount of historical data, while hoping to maintain a low cost.
The query is not fixed
, and the resource consumption changes significantly, which is expected that the computing resources can be flexible and scalable independently.
High concurrent writes are required,
and metadata is too large for Delta Lake to meet performance requirements.
For data updates to primary keys,
Hudi’s MergeOnRead does not meet update performance requirements.

2. Operating environment
Hardware (Recommended)
CPU: at least robust quad-core 2.0 GHz or other CPUs of the same level
Memory: more than 4 GB
Hard disk: S AS Hard disk 300 GB or larger
software
Operating system: Linux
Big-data system: Spark 3X

3.Main functions of Lakesoul
Lakesoul implements the Spark streaming batch integration framework. The main functions include batch write, streaming write, create and delete tables, delete data partitions, automatic file merge, insert or update tables, change data capture CDC, and hash operations on primary keys.

Batch
Log in to the system, specify the data storage path, and use Spark Write to save data to the specified path. In Write, you need to specify the storage format as Lakesoul and specify the primary key, partition key, and the number of partition buckets.
Stream write
Using Spark WriteStream to write data in the Lakesoul format, you can also set the Trigger interval, range partition key, hashBucketNum, and checkpoints path.
Read data
To read data, you need to specify a data read path. There are two data read modes. One is spark. You need to select the read format as Lakesoul and use load to load data in a customized path. The other is Lakesoul ForPath directly loads the specified path data.
Insert or update operation (Upsert)
Upsert (Update or insert) Updates data when it exists and inserts data when it does not exist. Using LakeSoulTable Upsert to perform the function requires passing in the data source.
Update
To update the data table contents in LakesOUL, import the LakesoulTable package directly and pass in the data source to be updated using the Update function.
drop table
Delete the data table in LakesOUL and import the LakesoulTable package to use the drop table function
Delete table data
To delete the data of the specified table in LakesOUL, use the delete function in LakesoulTable, pass in the selected delete condition to delete the data under the specified condition, or delete all data without specifying it
Delete the partition
To drop a specified partition in the LakesOUL table, use the drop partition function in LakesoulTable and pass in the specified partition key.
File merge
In the running process of the system, there will be a large number of incremental files, especially when the writing is lost. Lakesoul provides file merge capabilities, which can be specified for partition merge, all merge, or automatic merge
SparkSql
To query data in the LakesOUL table using Spark SQL, set schema to LakesOUL. For example, select * from Lakesoul. Table.
Hash primary key operation
If data is partitioned and sorted based on hash keys, Lakesoul optimizes Join, Intersect, and Except operations to reduce data shuffle and sort time and improve operational efficiency. When used, hash key and hashBucketNum are specified.
Automatic merging of metadata
Lakesoul supports metadata merge automatically, and you need to set up the spark. Dmetasoul. Lakesoul. Schema. The auto merge. Enabled = true, so the user specifies the metadata content when writing the data manual.

- the CDC
The operations (add, delete, modify) of relational databases such as Mysql and Oracle can be accessed into Lakesoul through data Change capture (CDC) and stored in real-time. Mysql-> Debezium-> Kafka-> SparkStreaming-> Lakesoul: After building a complete framework, the system can add, delete and modify data in real-time, and get the latest data when querying. The upSERt function is required.

The above is the detailed information about LakeSoul, and there is more information on its Github homepage for reference. In the following story, I will introduce the detailed information about Iceberg and make a comparison between them, which is beneficial for me to learn about data lake better. If it is helpful to you, please read it or share it. I also hope you can give me guidance and suggestions for my study. Thank you.

DEV Community

Details of 4 best opensource projects about big data you should try out（Ⅰ）

Top comments (0)

Read next

Kubernetes Through the Developer's Perspective

End-to-End Basic Data Engineering Tutorial (Spark, Dremio, Superset)

Lessons from opensource: How to use pipeline in Nodejs?

Simple Tailwind & Stripe eCommerce