DEV Community

DMetaSoul
DMetaSoul

Posted on

A new unified streaming and batch table storage solution similar to iceberg/hudi/delta lake

I have seen a new unified streaming and batch table storage solution named Lakesoul on Github, which is similar to Iceberg, Hudi and Detalake, but with several new functions, such as upsert and meta-data Management. However, there are also some shortcomings, such as Flink is not supported. But its roadmap shows Flink integration in progress. Have any of you used Lakesoul before?

I used upsert, which can be interpreted as a combination of update and insert, which is a great time saver. Lakesoul supports range and Hash partitions, which can be used to add, delete, and modify rows and columns simultaneously using upsert.

Here's [an official description of Upsert from Lakesoul](https://github.com/meta-soul/LakeSoul/wiki/03.-Usage-Doc#3-upsert-lakesoultable):
Multi-level partitioning and efficient upsert: LakeSoul supports range and hash partitioning, and a flexible upsert operation at row and column level. The upsert data are stored as delta files, which greatly improves the efficiency and concurrency of writing data, and the optimized merge scan provides efficient MergeOnRead performance.

Code Examples

import com.dmetasoul.lakesoul.tables.LakeSoulTable
import org.apache.spark.sql._
val spark = SparkSession.builder.master("local")
  .config("spark.dmetasoul.lakesoul.meta.host", "cassandra_host")
  .config("spark.sql.extensions", "com.dmetasoul.lakesoul.sql.LakeSoulSparkSessionExtension")
  .getOrCreate()
import spark.implicits._

val tablePath = "s3a://bucket-name/table/path/is/also/table/name"

val lakeSoulTable = LakeSoulTable.forPath(tablePath)
val extraDF = Seq(("2021-01-01",3,"chicken")).toDF("date","id","name")

lakeSoulTable.upsert(extraDF)
Enter fullscreen mode Exit fullscreen mode

The official of Lakesoul, DMetaSoul, introduces that LakeSoul is a unified streaming and batch table storage solution built on top of the Apache Spark engine by the DMetaSoul team, and supports scalable metadata management, ACID transactions, efficient and flexible upsert operation, schema evolution, and streaming & batch unification.

Top comments (0)