Margo McCabe for HarperDB

Posted on Jun 30, 2020 • Edited on Aug 15, 2022 • Originally published at harperdb.io

Database Architectures & Use Cases - Explained

#database #architecture #webdev #beginners

With over 300 databases on the market, how do you determine which is right for your specific use case or skill set?

We continue to see the common debate of SQL vs. NoSQL and other database comparisons all over social media and platforms like dev.to. In most cases, it’s not that one database is better than the other, it’s that one is a better fit for a specific use case due to numerous factors.

A couple months back, our CTO Kyle Bernhardy, led an awesome talk titled A Deep Dive Into Database Architectures. You can watch this talk at the link, but since this is such a prominent discussion topic we thought it might be helpful to summarize. This article will provide an overview on database architectures, including use cases and pros & cons for each of them.

Let’s start with general considerations when selecting a database. It’s important to understand things such as data type / structure, data volume, consistency, write & read frequency, hosting, cost, security, and integration constraints. The more you know about these factors, the easier it will be to pick the right database for your project.

You may already know that there are generally 3 database hosting options:

On-Premises
Database fully maintained by organization on servers running within their data center(s)
More control, but usually more expensive and time consuming
Cloud Hosted
Servers are maintained by cloud providers, organizations maintain database software and operating system running on the machine
Flexible scaling and no server upkeep, but no control over physical server and potential network limitations
Database-as-a-Service (DBaaS)
Database maintained by service provider, organizations only charged for usage of service
Cost effective and zero upkeep, but data stewardship and potential network limitations

Now for the part you’ve been waiting for - database architectures.

Relational Databases

We’ll start with the most commonly used. Relational (SQL) Databases such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, and SQLite, organize data into tables with columns, each with a specified name and datatype. Additionally:

Rows are identified with a unique attribute, or grouping of attributes, called a primary key (typically a single column)
Relationships between tables are defined through foreign keys which reference primary keys
Strict schema (data model) enforcement
Data accessed via Structured Query Language (SQL)
ACID compliance (Atomicity, Consistency, Isolation, and Durability)
Extra features like Triggers & Stored Procedures

On the plus side, relational databases use mature technology that is widely understood and well-documented, SQL standards are well-defined, defined constraints enforce data integrity, they avoid data duplication and are highly secure and ACID-compliant. However on the negative side, SQL databases cannot handle unstructured or semi-structured data, their tables don’t necessarily map to objects, they require complicated ETL (Extract, Transform, Load) and maintenance, have row locking, and pricing for some products (Oracle, SAP) are out of reach for developers and some organizations.
Note: While some RDBMS systems can now handle JSON, they are not purpose built to do so.

There are several awesome use cases for relational databases; situations where data integrity is absolutely paramount (financial applications, defense and security, private health information), highly structured data, and automation of internal processes.

Then comes everything else…

Relational databases are the most common database in production today, but they were not designed for the scale and agility of modern applications. About 10 years ago the NoSQL movement caught on to address these concerns and changed the database landscape forever. Note:

NoSQL doesn’t mean anything (Non-SQL, Non-relational SQL, Not Only SQL)
Designed to address structure, performance, data volume, and scalability
Relational databases have since addressed many of these concerns

While databases are typically categorized as SQL or NoSQL, there are many intricacies to NoSQL databases. Let’s get into it.

Key-Value Stores

Key-value stores are often used as the underlying storage for higher level databases. For example, MongoDB uses a key value store called WiredTiger as their default storage engine. Key-Value Stores, such as Redis, DynamoDB, and Cosmos DB, are:

Simple, basic, & schemaless
Provide basic functionality for retrieving arbitrary data via a specific key
Values can be anything: single values, arrays, objects, files, etc.
Database does not evaluate the data it is storing
Data structure can be referred to as a dictionary or hash table

For pros, key value stores provide fast, low-complexity access to data, are flexible, and can scale quickly and cheaply. However, they have extremely limited functionality, cannot handle complex structures or query or search by anything other than key, do not scale well as data models grow, and they require more programming overhead for complex implementations.

Example use cases for key value stores would be embedded systems, URL shorteners, configuration data, application variables and flags for web applications, state information, and data represented by a dictionary or hash.

Document Stores

Document stores, such as MongoDB, DynamoDB, Couchbase, and Firebase, are similar to key-value stores, but the value is a document.

Typically document formats are JSON, BSON, or XML documents
Schemaless, no data structure enforcement (documents can be different)
Data accessed and modified via NoSQL (or proprietary language)
Well-suited for unstructured and semi-structured data
Seen as easier for development

The pros of document stores include flexibility and scalability, schemaless, fast writes, ideal for semi-structured and unstructured data, and developers do not need to know data structure ahead of time / it can change overtime without downtime. The cons are that they are not ACID compliant, limited to querying within a document, relationships/cross references are not enforced, slow searching, cannot join documents/collections in a single query, lack of database enforcement requires developer discipline and vigilance for application level enforcement, and they typically result in data duplication.

Great use cases for document stores are unstructured or semi-structured data, content management, rapid prototyping, and collecting of high traffic data.

Graph Databases

Graph databases, such as Neo4j, OrientDB, and TitanDB, are ideal for when relationships or connections are top priority.

Based on mathematical graph theory
Represent data as a network of related nodes, edges, and properties
Database stores data items within nodes and relationships in edges that connect nodes
Nodes are connected by relationships and grouped according to labels
Facilitate data visualizations and graph analytics
Each node contains free-form data

On the plus side, graph databases have advanced features for relationship querying, traversing, and tracking, are optimized for querying related data, and they avoid row locking. As for the negatives, graph databases have a large ramp up time for developers, high overhead for simple use cases, lack of standardization, poor performance of aggregate queries, and devs typically need to learn a custom query language.

Graph databases are great for analysis of heterogeneous data points, fraud prevention, advanced enterprise operations, social networking, payment systems, and GeoSpatial routing/visualization.

Time Series Databases

We’re almost there! Next up is time series databases, such as InfluxDB, Kdb+, and Prometheus, which are:

Focused on datasets that change over time
Heavily write oriented
Designed to handle constant streams of data
Typically append-only (no modification after ingestion)
Rollup/aggregation/down sampling features to lower archive data footprint

On the positive side, time series databases are designed for dealing with linear data over time, can handle high ingestion rates, have built-in features specifically for dealing with time-based data, a schema optimized for time-series arrays, and batch delete features. As far as negatives, time series databases only deal with time-series data, do not support full SQL, their read speed suffers compared to writes, they have no transaction capability and are append-only (not optimized for updates).

Great use cases for time series databases are managing infrastructure, IoT sensor collection, and log monitoring and alerting.

Search Engines

Last but not least is search engines, such as Elasticsearch, Splunk and Apache Solr, which are:

Built for non-relational, document-based data
Arranged and optimized for storage and rapid retrieval of data
Indexes data across a variety of sources including: file systems, intranets, document -Management systems, e-mail, and databases

Search engine pros include their focus on optimized searching, highly scalable and schemaless, and they have advanced search options like full text search, suggestions, and complex search operations. The cons are that they are expensive, have low durability and poor security, have no transaction support, are not efficient for writing and retrieving data outside of searching, and are difficult to manage.

Search engines are great when search results are top priority, logging, product catalogs, and blogs.

Multi Model Considerations

It should be noted that many real life implementations choose “polyglot persistence,” which is the concept of using different data storage technologies to handle different data storage needs within a given software application. Many database technologies implement this within a single software, referred to as multi-model or data lake technology. This can be ideal for specific use cases, but poses inherent risk such as instability, data inconsistency and corruption, expensive / resource intensive, and data replication on disk and memory.

A few examples of multi-model technologies are:

Hadoop: Software tools running on top of multiple databases
MongoDB: Multiple data storage options in a single database software platform
PostgreSQL: Row-oriented, column-oriented, key-value, and document-oriented data storage options

Finally, as a database company...

It probably makes sense to discuss where HarperDB fits into the mix.
If for no other reason than to highlight that these database “categories” are not all black and white, and some databases take more of a “hybrid” approach by subscribing to several methodologies. Instead of falling into a single bucket, HarperDB can be considered as a structured object store with SQL capabilities. It features:

Built-in REST API
NoSQL & SQL operations including joins
Dynamic schema
Advanced publish-subscribe data replication
Self-service management studio
Traditional drivers/interfaces
Custom Functions (serverless, highly customizable API endpoints that interact with our HarperDB Core operations)

We built HarperDB from the ground up to expand and blend the best capabilities of SQL, NewSQL, and NoSQL products because we felt there were certain use cases that could be better served with another solution. We believe that providing developers with the ability to choose the right tool for the job empowers developers and spurs innovation. Some examples where we feel that HarperDB is a better fit include cases where you need both NoSQL & SQL, rapid application development, integration, edge computing, distributed computing, and real-time operational analytics. Essentially, the SQL vs. NoSQL debate becomes irrelevant with HarperDB because you no longer have to choose!

Hopefully this database architecture overview is helpful in finding the right database for your use case. Please reply below with questions or comments - we would love to discuss!

Top comments (11)

Yaron Levi • Jul 4 '20

Great article!
You can add more though. Regarding real-time OLAP workloads you have Druid and Rockset.
Regarding document DB you have FaunaDB which is fully ACID and also pure "serverless" (pay per read, no memory, no cpu, no connections, no capacity planning)

Margo McCabe • Jul 6 '20

Thanks for the feedback and insights, we really appreciate it!

Mišo • Jul 2 '20

Hi, nice post. Don't get it wrong but I think there are some strong words about what some DBs can/can't do. I don't know all the technologies but 2 sentences resonated in me.

SQL databases cannot handle unstructured or semi-structured data - this could be misleading for the beginner. For example PostgreSQL supports JSON and XML and you are able to even query and index the parts of the JSON data. I know also MySQL has JSON support and I guess all modern DBs has some. Someone could choose the document DB just because of the sentence even if Relational DB with a few JSON columns would be better choice.
poor security, have no transaction support - I used to use ElasticSearch a lot and the security is really good these days. Also there are "optimistic transactions" elastic.co/guide/en/elasticsearch/.... Again, the sentence could mislead someone choosing different technology even if there are possibilities to deal with security and concurrency.

Anyway, thanks for the overview and the new DB I can follow. After CockroachDB, YougabyteDB, FaunaDB this is also interesting piece of SW 🙂

Stephen Goldberg • Jul 6 '20

@Miso,

Thanks so much for the feedback! @margo wrote this article based on a presentation our CTO gave a few months ago. You can watch the entire presentation here:

youtube.com/watch?v=UIAfhmCZbSQ

It goes into more detail than the above article.

That said, to address your points.

I agree with you that this is an over simplification. Obviously a lot of RDBMS systems now can handle JSON, and a lot of Document Stores have adopted Multi-Modal to accomodate SQL like querying. That said, neither are purpose built to handle both and each have a lot of challenges for them. Cost, performance, consistency, security etc.... I have written a pretty lengthy article about this here:

harperdb.io/blog/multimodel-databa...

And I am going to post another article that is a deep-dive on storage engines in a few days on dev.to that will cover this specific concept in more detail. That said, I agree we are over simplifying a bit in this article and appreciate your feedback!

I believe where the team got the feedback that search engines have poor security was from this Info World article here:

infoworld.com/article/3268871/how-...

I think that this also is a bit of an oversimplification, because the point is based on out of the box configuration, because it is referring to "There’s no innate authentication or access control." which is fairly easily remedy if desired.

We will make some edits the article today to clarify these points, and really appreciate your feedback.

Stephen

Stephen Goldberg • Jul 10 '20 • Edited

@Miso,

I put this blog together based on Georgi's feedback, but I think it addresses some of your points as well! Thanks again for engaging us in discussion!

dev.to/harperdb/deep-dive-newsql-d...

Stephen

Mišo • Jul 13 '20 • Edited

Thank you both guys for the replies. I read the articles and I am looking for the video you sent.

Also, I read some articles about the architecture of the HarperDB and it sounds a really interesting approach. I would love to see some feature/performance blog(s) with comparison to MongoDB, Redis, ElasticSearch, PostgreSQL, and other mainstream DBs.

Have a nice day :)