DEV Community

DMetaSoul
DMetaSoul

Posted on

A new unified streaming and batch table storage solution similar to iceberg/hudi/delta lake but with several new functions

I have seen a new unified streaming and batch table storage solution named Lakesoul on Github, which is similar to Iceberg, Hudi and Detalake, but with several new functions, such as upsert, meta-data Management and so on. However, there are also some shortcomings, such as Flink is not supported. But its roadmap shows Flink integration in progress. Have any of you used Lakesoul before?

I used upsert, which can be interpreted as a combination of update and insert, which is a great time saver. Lakesoul supports range and Hash partitions, which can be used to add, delete, and modify rows and columns simultaneously using upsert.

Here's an official description of Upsert from Lakesoul
Multi-level partitioning and efficient upsert: LakeSoul supports range and hash partitioning, and a flexible upsert operation at row and column level. The upsert data are stored as delta files, which greatly improves the efficiency and concurrency of writing data, and the optimized merge scan provides efficient MergeOnRead performance.

Code Examples

import com.dmetasoul.lakesoul.tables.LakeSoulTable
import org.apache.spark.sql._
val spark = SparkSession.builder.master("local")
  .config("spark.dmetasoul.lakesoul.meta.host", "cassandra_host")
  .config("spark.sql.extensions", "com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension")
  .getOrCreate()
import spark.implicits._
val tablePath = "s3a://bucket-name/table/path/is/also/table/name"
val lakeSoulTable = LakeSoulTable.forPath(tablePath)
val extraDF = Seq(("2021-01-01",3,"chicken")).toDF("date","id","name")
lakeSoulTable.upsert(extraDF)
Enter fullscreen mode Exit fullscreen mode

The official of Lakesoul, DMetaSoul, introduces that LakeSoul is a unified streaming and batch table storage solution built on top of the Apache Spark engine by the DMetaSoul team, and supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and streaming & batch unification.

I want to learn more about Lakesoul and be a contributor to it. IF you want to learn more about lakesoul, please click its GitHub page, or follow me in my next post, where I'll also share what I learned about the Data Lake warehouse.

Top comments (0)