DEV Community

Cover image for Datasource enabling multidimensional indexing and sampling pushdown
Paola Pardo
Paola Pardo

Posted on • Edited on

Datasource enabling multidimensional indexing and sampling pushdown

Do you wonder how a multidimensional index would look like in Spark?

Recently we launched the Qbeast Open Source Format, a Data Lakehouse enhancement to speed up your queries!

Based on Delta Lake and available for Apache Spark, it allows indexing your data by different columns and read a representative sample directly from storage 🔥

Quick example of how you can boost your query performance using Qbeast:

This is a Normal Query with Spark and Delta format.
Normal query

This is the same query but with Qbeast Sampling of 1%
Qbeast Sample Query

The gifs are cool, right? Let's compare both executions:

Format Execution Time Result
Delta ~ 2.5 min. 37.869383
Qbeast ~ 6.6 sec. 37.856333

As you can see, 1% sampling provides the result x22 times faster compared to using Delta format, with an error of 0,034%.

If you want to play with it, check out the Qbeast-Spark github

And don't forget to give us a star!

Your support means a lot ❤️

Top comments (0)