With over 300 databases on the market, how do you determine which is right for your specific use case or skill set?
We continue to see the common debate of SQL vs. NoSQL and other database comparisons all over social media and platforms like dev.to. In most cases, it’s not that one database is better than the other, it’s that one is a better fit for a specific use case due to numerous factors.
A couple months back, our CTO Kyle Bernhardy, led an awesome talk titled A Deep Dive Into Database Architectures. You can watch this talk at the link, but since this is such a prominent discussion topic we thought it might be helpful to summarize. This article will provide an overview on database architectures, including use cases and pros & cons for each of them.
Let’s start with general considerations when selecting a database. It’s important to understand things such as data type / structure, data volume, consistency, write & read frequency, hosting, cost, security, and integration constraints. The more you know about these factors, the easier it will be to pick the right database for your project.
You may already know that there are generally 3 database hosting options:
Database fully maintained by organization on servers running within their data center(s)
More control, but usually more expensive and time consuming
Servers are maintained by cloud providers, organizations maintain database software and operating system running on the machine
Flexible scaling and no server upkeep, but no control over physical server and potential network limitations
Database maintained by service provider, organizations only charged for usage of service
Cost effective and zero upkeep, but data stewardship and potential network limitations
We’ll start with the most commonly used. Relational (SQL) Databases such as Oracle, MySQL, PostgreSQL, Microsoft SQL Server, and SQLite, organize data into tables with columns, each with a specified name and datatype. Additionally:
- Rows are identified with a unique attribute, or grouping of attributes, called a primary key (typically a single column)
- Relationships between tables are defined through foreign keys which reference primary keys
- Strict schema (data model) enforcement
- Data accessed via Structured Query Language (SQL)
- ACID compliance (Atomicity, Consistency, Isolation, and Durability)
- Extra features like Triggers & Stored Procedures
On the plus side, relational databases use mature technology that is widely understood and well-documented, SQL standards are well-defined, defined constraints enforce data integrity, they avoid data duplication and are highly secure and ACID-compliant. However on the negative side, SQL databases cannot handle unstructured or semi-structured data, their tables don’t necessarily map to objects, they require complicated ETL (Extract, Transform, Load) and maintenance, have row locking, and pricing for some products (Oracle, SAP) are out of reach for developers and some organizations.
Note: While some RDBMS systems can now handle JSON, they are not purpose built to do so.
There are several awesome use cases for relational databases; situations where data integrity is absolutely paramount (financial applications, defense and security, private health information), highly structured data, and automation of internal processes.
Relational databases are the most common database in production today, but they were not designed for the scale and agility of modern applications. About 10 years ago the NoSQL movement caught on to address these concerns and changed the database landscape forever. Note:
- NoSQL doesn’t mean anything (Non-SQL, Non-relational SQL, Not Only SQL)
- Designed to address structure, performance, data volume, and scalability
- Relational databases have since addressed many of these concerns
While databases are typically categorized as SQL or NoSQL, there are many intricacies to NoSQL databases. Let’s get into it.
Key-value stores are often used as the underlying storage for higher level databases. For example, MongoDB uses a key value store called WiredTiger as their default storage engine. Key-Value Stores, such as Redis, DynamoDB, and Cosmos DB, are:
- Simple, basic, & schemaless
- Provide basic functionality for retrieving arbitrary data via a specific key
- Values can be anything: single values, arrays, objects, files, etc.
- Database does not evaluate the data it is storing
- Data structure can be referred to as a dictionary or hash table
For pros, key value stores provide fast, low-complexity access to data, are flexible, and can scale quickly and cheaply. However, they have extremely limited functionality, cannot handle complex structures or query or search by anything other than key, do not scale well as data models grow, and they require more programming overhead for complex implementations.
Example use cases for key value stores would be embedded systems, URL shorteners, configuration data, application variables and flags for web applications, state information, and data represented by a dictionary or hash.
Document stores, such as MongoDB, DynamoDB, Couchbase, and Firebase, are similar to key-value stores, but the value is a document.
- Typically document formats are JSON, BSON, or XML documents
- Schemaless, no data structure enforcement (documents can be different)
- Data accessed and modified via NoSQL (or proprietary language)
- Well-suited for unstructured and semi-structured data
- Seen as easier for development
The pros of document stores include flexibility and scalability, schemaless, fast writes, ideal for semi-structured and unstructured data, and developers do not need to know data structure ahead of time / it can change overtime without downtime. The cons are that they are not ACID compliant, limited to querying within a document, relationships/cross references are not enforced, slow searching, cannot join documents/collections in a single query, lack of database enforcement requires developer discipline and vigilance for application level enforcement, and they typically result in data duplication.
Great use cases for document stores are unstructured or semi-structured data, content management, rapid prototyping, and collecting of high traffic data.
Graph databases, such as Neo4j, OrientDB, and TitanDB, are ideal for when relationships or connections are top priority.
- Based on mathematical graph theory
- Represent data as a network of related nodes, edges, and properties
- Database stores data items within nodes and relationships in edges that connect nodes
- Nodes are connected by relationships and grouped according to labels
- Facilitate data visualizations and graph analytics
- Each node contains free-form data
On the plus side, graph databases have advanced features for relationship querying, traversing, and tracking, are optimized for querying related data, and they avoid row locking. As for the negatives, graph databases have a large ramp up time for developers, high overhead for simple use cases, lack of standardization, poor performance of aggregate queries, and devs typically need to learn a custom query language.
Graph databases are great for analysis of heterogeneous data points, fraud prevention, advanced enterprise operations, social networking, payment systems, and GeoSpatial routing/visualization.
We’re almost there! Next up is time series databases, such as InfluxDB, Kdb+, and Prometheus, which are:
- Focused on datasets that change over time
- Heavily write oriented
- Designed to handle constant streams of data
- Typically append-only (no modification after ingestion)
- Rollup/aggregation/down sampling features to lower archive data footprint
On the positive side, time series databases are designed for dealing with linear data over time, can handle high ingestion rates, have built-in features specifically for dealing with time-based data, a schema optimized for time-series arrays, and batch delete features. As far as negatives, time series databases only deal with time-series data, do not support full SQL, their read speed suffers compared to writes, they have no transaction capability and are append-only (not optimized for updates).
Great use cases for time series databases are managing infrastructure, IoT sensor collection, and log monitoring and alerting.
Last but not least is search engines, such as Elasticsearch, Splunk and Apache Solr, which are:
- Built for non-relational, document-based data
- Arranged and optimized for storage and rapid retrieval of data
- Indexes data across a variety of sources including: file systems, intranets, document -Management systems, e-mail, and databases
Search engine pros include their focus on optimized searching, highly scalable and schemaless, and they have advanced search options like full text search, suggestions, and complex search operations. The cons are that they are expensive, have low durability and poor security, have no transaction support, are not efficient for writing and retrieving data outside of searching, and are difficult to manage.
Search engines are great when search results are top priority, logging, product catalogs, and blogs.
It should be noted that many real life implementations choose “polyglot persistence,” which is the concept of using different data storage technologies to handle different data storage needs within a given software application. Many database technologies implement this within a single software, referred to as multi-model or data lake technology. This can be ideal for specific use cases, but poses inherent risk such as instability, data inconsistency and corruption, expensive / resource intensive, and data replication on disk and memory.
A few examples of multi-model technologies are:
- Hadoop: Software tools running on top of multiple databases
- MongoDB: Multiple data storage options in a single database software platform
- PostgreSQL: Row-oriented, column-oriented, key-value, and document-oriented data storage options
It probably makes sense to discuss where HarperDB fits into the mix.
If for no other reason than to highlight that these database “categories” are not all black and white, and some databases take more of a “hybrid” approach by subscribing to several methodologies. Instead of falling into a single bucket, HarperDB can be considered as a structured object store with SQL capabilities. It features:
- Built-in REST API
- NoSQL & SQL operations including joins
- Dynamic schema
- Advanced publish-subscribe data replication
- Self-service management studio
- Traditional drivers/interfaces
We built HarperDB from the ground up to expand and blend the best capabilities of SQL, NewSQL, and NoSQL products because we felt there were certain use cases that could be better served with another solution. We believe that providing developers with the ability to choose the right tool for the job empowers developers and spurs innovation. Some examples where we feel that HarperDB is a better fit include cases where you need both NoSQL & SQL, rapid application development, integration, edge computing, distributed computing, and real-time operational analytics. Essentially, the SQL vs. NoSQL debate becomes irrelevant with HarperDB because you no longer have to choose!
Hopefully this database architecture overview is helpful in finding the right database for your use case. Please reply below with questions or comments - we would love to discuss!