Selecting the right open source DB for your workload

#database

So. Many. Databases.
The latest DB-Engines ranking lists 370 databases according to popularity. And the sheer number may lead some to go back to what they know, or rely on other developers’ recommendations. The more intrepid may even spend some time researching.

It takes some navigation to understand the landscape across all the database flavors. And understanding doesn’t necessarily lead to the right choice. Martin Kleppman literally maps this out in his book, Designing Data-Intensive Applications, O’Reilly. (Highly recommended read)

Ye olde capture-first
Early days of databases had fewer choices, there was Oracle, SQL Server, MySQL and PostgreSQL. You just had to choose between them, as they all had one thing in common, they were built for capture-first. Get the data.

Pick a database
Define your schema
Start loading and integrating
Tune normalization and queries
Add materialized view and query caching
Get ready to switch if this doesn’t work as intended

In this workflow, queries arrive too late to influence the database choice. Finding out later on that certain queries can’t be supported is problematic.

Tackling the paradox of choice
Through this process we quickly find out the chosen database and associated effort doesn’t meet the requirements. Here are a few options:

Stick with what you know
Bribe a trusted DBA or architect
Look at relevant TPC benchmarks
See what database experience is missing from your résumé
Consult a soothsayer, Magic 8 ball, dartboard

Mind-shift to query-first
Start with the end in mind. Choose your database by working backwards from your target read workloads. Think of this as TDD for database architecture.

I’m not saying you can ignore write performance (you can’t) or that write benchmarks don’t matter (they do). But don’t use write capabilities as the most important criteria for which database to use.

Read workloads
Getting to the data is where the value lives in a database. Applying a Zen koan:

“What is the value of a write that can’t be read?”

Writes are simply the cost of expected reads. From this view, re-consider the database options in terms of reads, as each database will have different tricks for managing and accessing the reads. Indexes are extra writes to accelerate reads, and replication are extra writes to ensure reads.

Think query-first. What kinds of reads can you expect and how to support them most efficiently and effectively?

Pick the right database for query-first

Define seed data that approximates a working system
Run read workloads for seed data on multiple DBs
Select the database with best workload fit (ops/sec)
Optimize for loading/maintaining data

If no single database platform is a match, use a distributed query engine like Trino, then replicate data through queues like Kafka or consider other tricks (share your tricks!)

Read Workload	Category	Description
Fetch value for single key	Key/Value	Returns unstructured value
Fetch values for related keys	Key/Value	Returns collection of values
Find single row with criteria	OLTP	Returns tuple (row of named columns) using column indexes
Find group of rows with criteria	OLTP	Returns collection of tuples using column indexes
Read rows within transaction	OLTP	Returns value based on transaction isolation level
Join subset of rows & related rows	OLAP	Returns collection of tuples joined across multiple tables
Join/summarize for few columns	OLAP	Returns count/histogram on a limited set of columns
Find/join/summarize for all columns	DSS	Returns data transformation computed against all available columns

Cut to the chase
For our own use, here’s how we evaluated the options for our read workloads:

READ WORKLOAD	CACHE Redis, Memcached	LSM **** Cassandra, HBase, RocksDB, LevelDB	BTREE **** MySQL, Postgresql, SQLite, SQL Server	COLUMNAR **** Druid, Iceberg, Parquet, Orc	M/R **** Hadoop, Resurface
Fetch value for single key	🥳	😀	🤔
Fetch values for related keys	😀	🥳	😀
Find single row with criteria	😀	😀	🥳
Find group of rows with criteria		😀	🥳
Read row within transaction		🤔	🥳
Join subset of rows & related rows			🥳	😀
Join/summarize for few columns			😀	🥳	😀
Find/join/summarize for all columns			😱	😱	🥳

Bonus! A few advanced read optimizations

🧐
Move data in-memory to eliminate device I/O
Use local/embedded store to eliminate network

😍
Use computed (virtual) columns
Use optimized storage when table scanning

🤩
Immutable writes as safe transactions
In-memory storage via page cache
Push queries closer to data

🤓
I could nerd out on this all day. I often do. Reach out if you want to talk database, query-first architectures, APIs or anything else.

DEV Community

Selecting the right open source DB for your workload

Top comments (0)

Read next

Unlocking DuckDB from Anywhere - A Guide to Remote Access with Apache Arrow and Flight RPC (gRPC)

How I Went From Being a DBA to DevOps Engineer

DynamoDB-style Limits for Predictable SQL Performance?

Everyone Uses Postgres… But Why?