At QuestDB, we've built an open-source time series database focused on performance. We started QuestDB so that we could bring our experience in low-latency trading and the technical approaches we developed in this domain to a variety of real-time data processing use cases.
The journey to today's version of QuestDB began with the original prototype in 2013, and we've described what happened since in a post published during our HackerNews launch last year. Our users deploy QuestDB to make time series analysis fast, efficient, and convenient in financial services, IoT, application monitoring, and machine learning.
In the early stages of the project, we were inspired by vector-based append-only systems like kdb+ because of the advantages of speed and the simple code path this model brings. QuestDB’s data model uses what we call time-based arrays which is a linear data structure. This allows QuestDB to slice data during ingestion in small chunks and process it all in parallel. Data that arrives in the wrong time order is dealt with and reordered in memory before being persisted to disk. Therefore, data lands on the database ordered by time already. As such, QuestDB does not rely on computationally intense indices to reorder data for any given time-series queries.
This liner model differs from the LSM trees or B-tree-based storage engines found in other open source databases such as InfluxDB or TimescaleDB.
Beyond ingestion capabilities, QuestDB’s data layout enables CPUs to access data faster. Our codebase leverages modern CPU architecture with SIMD instructions to request that the same operation be performed on multiple data elements in parallel. We store data in columns and partition it by time to lift the minimal amount of data from the disk for a given query.
We saw the Time Series Benchmark Suite (TSBS) regularly coming up in discussions about database performance and decided we should provide the ability to benchmark QuestDB along with other systems. The TSBS is a collection of Go programs to generate datasets and then benchmark read and write performance. The suite is extensible so that different use cases and query types can be included and compared across systems.
Here are our results of the benchmark with the
cpu-only use case using up to fourteen workers on an AWS EC2
m5.8xlarge instance with sixteen cores.
We reach maximum ingestion performance using four workers, whereas the other systems require more CPU resources to hit maximum throughput. QuestDB achieves 959k rows/sec with 4 threads. We find that InfluxDB needs 14 threads to reach its max ingestion rate (334k rows/sec), while TimescaleDB reaches 145k rows/sec with 4 threads. ClickHouse hits 914k rows/sec with twice as many threads as QuestDB.
When running on 4 threads, QuestDB is 1.7x faster than ClickHouse, 6.5x faster than InfluxDB and 6.6x faster than TimescaleDB.
When we run the suite again using an AMD Ryzen5 processor, we found that we were able to hit maximum throughput of 1.43 million rows per second using 5 threads. This is compared to the Intel Xeon Platinum that's in use by our reference benchmark
m5.8xlarge instance on AWS.
Re-ordering data which is "out-of-order" (O3) during ingestion proved particularly challenging. It is a new approach that we wanted to detail a little bit more in this article. Our idea of how we could handle out out-of-order ingestion was to add a three-stage approach:
- Keep the append model until records arrive out-of-order
- Sort uncommitted records in a staging area in-memory
- Reconcile and merge the sorted out-of-order data and persisted data at commit time
The first two steps are straightforward and easy to implement, and handling append-only data is unchanged. The heavy out-of-order commit kicks in only when there is data in the staging area. The bonus of this design is that the output is vectors, meaning our vector-based readers are still compatible.
This pre-commit sort-and-merge adds an extra processing phase to ingestion with an accompanying performance penalty. We nevertheless decided to explore this approach and see how far we could reduce the penalty by optimizing the out-of-order commit.
Processing a staging area gives us a unique opportunity to analyze the data holistically where we can avoid physical merges altogether and get away with fast and straightforward
memcpy or similar data movement methods. Such methods can be parallelized thanks to our column-based storage. We can employ SIMD and non-temporal data access where it makes a difference.
We sort the timestamp column from the staging area via an optimized version of radix sort, and the resulting index is used to reshuffle the remaining columns in the staging area in parallel:
The now-sorted staging area is mapped relative to the existing partition data. It may not be obvious from the start but we are trying to establish the type of operation needed and the dimensions of each of the three groups below:
When merging datasets in this way, the prefix and suffix groups can be persisted data, out-of-order data, or none. The merge group is where more cases occur as it can be occupied by persisted data, out-of-order data, both out-of-order and persisted data, or none.
When it's clear how to group and treat data in the staging area, a pool of workers perform the required operations, calling
memcpy in trivial cases and shifting to SIMD-optimized code for everything else. With a prefix, merge, and suffix split, the maximum
liveliness of the commit (how susceptible it is to add more CPU capacity) is
Being able to copy data fast is a good option, but we think that heavy data copying can be avoided in most time series ingestion scenarios. Assuming that most real-time out-of-order situations are caused by the delivery mechanism and hardware jitter, we can deduce that the timestamp distribution will be contained by some boundary.
For example, if any new timestamp value has a high probability to fall within 10 seconds of the previously received value, the boundary is then 10 seconds, and we call this boundary lag.
When timestamp values follow this pattern, deferring the commit can render out-of-order commits a normal append operation. The out-of-order system can deal with any variety of lateness, but if incoming data is late within the time specified by lag, it will be prioritized for faster processing.
We have opened a pull request (Questdb benchmark support) in TimescaleDB's TSBS GitHub repository, to add the ability to run the benchmark against QuestDB. In the meantime, users may clone our fork of the benchmark and run the suite to see the results for themselves.
tsbs_generate_data --use-case="cpu-only" --seed=123 --scale=4000 \ --timestamp-start="2016-01-01T00:00:00Z" --timestamp-end="2016-01-02T00:00:00Z" \ --log-interval="10s" --format="influx" > /tmp/bigcpu tsbs_load_questdb --file /tmp/bigcpu --workers 4
Pushing database performance further while making it easy for developers to get started with our product motivates us every day. This is why we are focused on building a solid community of developers who can participate and improve the product through our open source distribution model.
Beyond making QuestDB easy to use, we want to make it easy to audit, review, and make code or general project contributions. All of QuestDB's source code is available on GitHub under the Apache 2.0 license and we welcome all sorts of contributions from GitHub issues to pull requests.