Datasource enabling multidimensional indexing and sampling pushdown

#opensource #datascience #news #github

Do you wonder how a multidimensional index would look like in Spark?

Recently we launched the Qbeast Open Source Format, a Data Lakehouse enhancement to speed up your queries!

Based on Delta Lake and available for Apache Spark, it allows indexing your data by different columns and read a representative sample directly from storage 🔥

Quick example of how you can boost your query performance using Qbeast:

This is a Normal Query with Spark and Delta format.

This is the same query but with Qbeast Sampling of 1%

The gifs are cool, right? Let's compare both executions:

Format	Execution Time	Result
Delta	~ 2.5 min.	37.869383
Qbeast	~ 6.6 sec.	37.856333

As you can see, 1% sampling provides the result x22 times faster compared to using Delta format, with an error of 0,034%.

If you want to play with it, check out the Qbeast-Spark github

And don't forget to give us a star!

Your support means a lot ❤️

Top comments (0)

The Cost of Clinging to Legacy Software: Risks and Realities

BekahHW - Dec 10

Github Actions Full Guide

Harsh Mishra - Dec 13

👋🏻Goodbye Power BI! 📊 In 2025 Build AI/ML Dashboards Entirely Within Python 🤖

Rym - Dec 17

Qwen2.5-Coder-32B-Instruct vs. Claude 3.5 Sonnet vs. GPT-4o: Coding LLM Comparison

Dl - Nov 12

DEV Community