Apache Doris

Posted on Feb 3, 2023

Apache Doris Roadmap 2023

#blockchain #web3 #ethereum #cryptocurrency

Hey Guys, The Roadmap of 2023 has just arrived in time with clear vision of our release schedule and features in progress. We are open to discuss now and you can comment on GitHub. Don't forget to STAR us and we are close to 7k now!

https://github.com/apache/doris/issues/16392

This is Apache Doris Roadmap 2023.

Our Main Focus

We plan to optimize Apache Doris for these scenarios:

Blazing fast OLAP
- Internal reporting
- Ad-hoc query
- Customer or User facing analytics (high-concurrency)
Blazing fast query engine for datalake and lakehouse
- Query acceleration for Hive
- Query acceleration for open table format (Iceberg, Hudi, DeltaLake)
Semi-structured data storage and analysis
- Log storage, retrieval, and analysis
- Time series data storage, retrieval, and analysis
High-speed data processing (data engineering)
- ETL/ELT acceleration
- Streaming data warehouse

Release Schedule

We plan to upgrade Apache Doris at the following pace:

	V 1.2.x	V 2.0.x	V 2.1.x	V 2.2.x
Jan.	1.2.1
Feb.	1.2.2	2.0.0 preview
Mar.	1.2.3	2.0.0
Apr.	1.2.4	2.0.1
May	1.2.5	2.0.2	2.1.0 preview
Jun.		2.0.3	2.1.0
Jul.		2.0.4	2.1.1
Aug.		2.0.5	2.1.2	2.2.0 preview
Sept.			2.1.3	2.2.0
Oct.			2.1.4	2.2.1
Nov.			2.1.5	2.2.2
Dec.				2.2.3

Features

We plan to develop or continuously optimize these features:

Hybrid Workloads

Query excution engine
- [ ] Pipeline task parallelism
- [ ] CodeGen
- [ ] Adaptive execution enhencment
Spill To Disk
- [x] Sort Node
- [ ] HashJoin Node
- [ ] Aggregation Node
- [ ] Sort Merge Join
- [ ] Sort Aggregation
- [ ] Optimize Spill To disk like compression, encryption, spill disk managment
- [ ] New query management framework by using Spill To Disk
Workload manager for hybrid workloads
- [ ] Resource isolation based on pipeline engine （CPU, Memory, IO）
- [ ] Resource queue
- [ ] Async excution
- [ ] Query priority
- [ ] Query scheduler

Semi-Structure Data Analysis

Complex Data Type
- [x] Array data type & functions
- [x] Jsonb data type & functions
- [ ] Map data type & functions
- [ ] Struct data type & functions
- [ ] IPv4 & IPv6 data type & functions
- [ ] GEO data type & functions
Index Enchencment
- [x] Ngram bloomfilter index
- [x] Full-Text index for string/number/date
- [x] BKD numeric index for string/number/date
- [x] Full-Text & BKD index for Array
- [ ] Full-Text & BKD index for Map
- [ ] Full-Text & BKD index for Struct
- [ ] BKD index for IPv4 & IPv6
- [ ] BKD index for GEO
Dynamic Schema Table
- [x] Dynamic Schema Table syntax
- [x] Dynamic Schema Table write and read
- [x] Dynamic Schema Table index

Lakehouse & Data Integration

Query acceleration for datalake and lakehouse
- [x] Parquet, csv, orcfile
- [x] Iceberg
- [x] Hudi MOW
- [ ] Hudi MOR
- [ ] DeltaLake
- [ ] Flink Table Store
Catalog & Cloud Storage integration
- [x] Hive Meta Store
- [x] AWS Glue
- [x] Alibaba Cloud DLF
- [x] Object Storage of AWS , Azure, GCP, Alibaba Cloud, Tencent Cloud, Huawei Cloud
Managed lake engine
- [ ] Parquet writer
- [ ] ORC writer
- [ ] Doris Catalog for Iceberg
- [ ] Managed Iceberg lake engine
Data Security
- [x] Keberos
- [x] KMS
- [x] Apache Ranger integration
- [ ] Public Cloud (Alibaba Cloud, AWS) IAM Role
New Spark/Flink Load
- Writing Doris data format file externally.
- Refractor the framework of Spark/Flink Load to support batch load.
Hive/Presto/Spark function compatibility
Graph database federated query support

New Optimizer （Nereids）

Features
- [x] Fully feature support or replace the old query optimizer
- [ ] DML （insert, update, merge）
- [ ] Query cache
Performance
- [x] Optimize the time consumption of the plan stage
- [ ] RBO Rules enhancement
- [ ] CBO Rules enhancement, inline CTE, etc.
Support for hybrid workloads
- [ ] Optimize rules for datalake engine
- [ ] Adaptive query plan
- [ ] Adaptive sort/agg algorithm
Statistics enhancement
- [ ] Statistics derivation optimization, improve accuracy, support complex expressions
- [ ] Richer statistics to support non-uniform distribution data
- [ ] Optimize statistics persistence and caching mechanism
- [ ] Auto collect statistics
- [ ] Optimiza cost model that is more adaptable to distributed scenarios

Cost Efficiency & Performance

Cloud Native
- [x] Cold & Hot Data Separation
- [x] Elastic Compute Node
Low-latency, high-concurrency point query
Aggregating index & projection
Performance Self Tunning
Multi-Table Materialized View
- [ ] Automatic Incremental refresh
- [ ] Automatic query rewriting

Data Modeling & Storage Engine

Cross Cluster Replication (CCR) & Binlog
- [ ] CCR to enable higher HA
- [ ] Binlog to enable streaming computing
Unique Key Constraint
- [x] Merge-on-Write (MoW) Unique Key Table
- [ ] Partial Column Update on MoW UNIQUE Key Table
DDL Simplification
- [x] Support functions in partitioning
- [x] Auto Bucket Number
Unified Data Model
General Delete, Update, Merge Support
Light Schema Change
- [ ] Do not effect on historical data and work on newer data

Ecosystem

Enhance BI tools compatibility
- Matebase
- Superset
- Tableau
Enhance doris-dbt
Enhance Doris-Airbyte
Enhance integration with cloud data integration tools

Utility & Stability

RBAC (Roll-Based-Access-Control) enhancement
Profiling / Tracing enhancement
Doris Manager enhancement
Multi-language UDF
More Fuzzy tests
All HTTP APIs support HTTPS and authorization
Full support for K8s deployment

Apache Doris

Apache Doris is a real-time analytical database based on MPP architecture, known for its high performance and ease of use. It supports both high-concurrency point queries and high-throughput complex analysis. (https://github.com/apache/doris)

DEV Community

Apache Doris Roadmap 2023

Our Main Focus

Release Schedule

Features

Hybrid Workloads

Semi-Structure Data Analysis

Lakehouse & Data Integration

New Optimizer （Nereids）

Cost Efficiency & Performance

Data Modeling & Storage Engine

Ecosystem

Utility & Stability

Apache Doris

Follow us:

Top comments (0)