I said it before: I love databases. For me, they fall into the same category of beautiful and complex systems as operating system kernels - because they have to solve the same sort of problems. A life without databases is possible but meaningless ;-)
However, there are so many different incarnations of databases available implementing a high number of different concepts, that it might be tough to choose the right concept for the job at hand and thus choose the right database software.
In general, a database has one job: organize a collection of data which can be stored and retrieved from a computer system. That means there are three jobs to be done:
- organization of a data collection (definition of virtual and physical structures)
- storage (creation, insertion, updates, and deletion)
- retrieval (making the data accessible by an external application)
Each of these tasks possibly has a large number of limiting factors or constraints and these, in turn, have an impact on the others. So database systems can become very, very complex pieces of software. If you ever write your own, you get confronted with problems you possibly never thought they exist ;-)
Imagine you can't use a dedicated database system for your need to "organize a collection of data which can is stored and retrieved from a computer system." What would you do? I think it is fair to say that you would try to avoid writing extra software for that - every modern operating system does a brilliant job on that in the form of a filesystem, billions trust them for their most critical data every day.
So why not use it?
In my opinion, adequately designed filesystem layouts (that is: directory structures and file naming conventions) are an often overlooked method to solve the database problem. Filesystems are (depending on the underlying physical storage and its properties) capable of storing exabytes of data in zillions of files, at high transaction rates and massive throughput. But: to save data as a file is not the problem here. The problems emerge when you try to (re-)access data, or more specifically, a dedicated data set. How do you find the data in an efficient and thus fast way? If the desired data is the file itself, there are many ways to handle that. But what if it is somewhere inside, sitting between hundreds or thousands uninteresting information? You end up creating and maintaining an index. One problem solved. But how do you handle simultaneous reads? And writes? And how do you manage the situation where data is not only hierarchical but perhaps has many interdependent connections and links? Now things are getting complicated. My message: as long as the semantics of a filesystem and its layout fit your application's use-case, the benefit of using a higher level of abstraction in terms of database management system might not pay out. Put the "/etc" directory of a Linux system in a database system? Might create more problems than it might solve (to make my point and give you an idea).
The relational theory is around us for quite some time and is almost omnipresent. It is a particular case of sets, and E.F. Codd developed the relational algebra, which in turn led to the invention of the now ubiquitous standard query language, SQL. In its core, the relational theory is all about entities and relations. If you have the time, I highly suggest you read his paper and if you have even more time: read c.J. Date's An Introduction to Database Systems which is in-depth and very intense coverage of all things relational. With this model, it is possible to define relations between entities (even recursive!) which enables you to create a precise model of real-world relationships with high flexibility.
You can model the real world in all its glory: is-a, one-to-one, one-to-many, many-to-many. If you need constrained flexibility, this is for you. By this, I mean that you can model almost any possible relationship and ensure that it adheres to given rules, such as unique, not null or foreign key constraints. But these constraints are more or less only structural. What you also should get is adherence to the 12 rules for relational database systems (also given to us mere mortals by Codd) as well as conformance to the ACID principles:
- Atomicity: every transaction is performed entirely or not at all
- Consistency: the database state is consistent after a transaction completed
- Isolation: concurrent transactions are isolated from the effects of each other (this one is hard to get right!)
- Durability: as soon as a transaction finished, the data persists on non-volatile storage
You might guess, that some of these properties are hard on a single server, but are even harder or impossible to have in a distributed system. In such systems you need concepts like clustering, sharding, two-phase commits, etc. But sooner or later you enter the arena of the CAP theorem and its many relatives - which led to the creation of a special form of NoSQL databases (more on this later).
To summarize, with a relational database engine you get a centralized state machine, state transitions into a new one with every transaction while ensuring that the structural and enforcing referential integrity of your data at every single point of time. If you need true scalability (not only horizontal - more resources on a single host - but also vertically - more hosts), this can be become complex and very costly (it once was a unique selling proposition for enterprise-class database engines like Oracle, IBM DB2, MS SQL Server or Sybase Adaptive Server).
Yes, these are plural. NoSQL is nothing but a tag that you can put on any database which does not conform to the relational database model. So memcached is a NoSQL database engine as many others are, and some of them are there for a long time. It is not new; there were NoSQL databases long before there where SQL database, but even after the relational database went into existence there where use-cases that required very different approaches. NoSQL might seem at first as if SQL alone is missing, but interestingly SQL is such common knowledge that many of NoSQL databases at least offer a subset not to make a move not too hard at first (IMHO they should more appropriately named "NonRelational" databases). So this property in the name is sometimes very misleading. Wikipedia lists a large number of different types under the umbrella of NoSQL. What they all share is that do not implement the relational model and thus can omit some of its defining properties or introduce weaker versions thereof (for example see eventual consistency).
Let's discuss some of the use-cases that require very different approaches:
Key-value stores, or KV-stores for short, behave much like Python dictionary or a map in C#, C++ or Java which is accessible over a network protocol. One popular example is memcached. All keys and values are in memory, which makes it very fast. The protocol is straightforward. If you want to retrieve the data, you need to know the key. It's that simple. While such an easy pattern is straightforward to replicate in relational databases, the overhead of the ACID pattern is possibly huge. In the end, it is easier and faster to write a highly specialized database than to tune a general-purpose relational database to reach the same performance figures - and fail miserably.
A typical architecture pattern is to cache rendered web-content (in Python pseudocode):
def cachedRenderRequest(request): return memcached.get(calculateHash(request), render_page(request))
If you are unfamiliar with Python: if you use get() on a dictionary (a map in C#, C++ or Java) it returns either the value to the given key or a default value. In this example, the render_page method would need to make sure that a rendered page written to the Memcached instance.
This can reduce the number of database queries to the backend by magnitudes and lower the response times and latency as well (the fastest code is always the code you never execute!).
A common anti-pattern is to replicate relational tables and their structure by using (compound-)prefixes as keys.
If you intend to store data which is highly diverse in terms of structure, then document stores are for you. A table definition is stringent, if you want to store additional or fewer fields we would need to talk about schema migration strategies (an interesting field on its own), and sooner or later you end up with ugly designs: hundreds of columns, nested tables, self-joins with automated join generation - you name it. Collections can build groups of documents, but every document in a collection can look different. These databases add full-text indexing and document addressing schemes. Remember IBM Notes? The underlying technology, Notes Storage Facility, is at its heart a NoSQL database extended into a full-fledged PIM application.
The often heart story goes like this: you create a table like this:
CREATE TABLE IOT_DATA ( TS TIMESTAMP WITH TIMEZONE, DEVICEID VARCHAR(255), KEY VARCHAR(255), VALUE VARCHAR(255))
And begin to onboard devices. Each of these devices sends measurements once a minute, let's say humidity, temperature, wind speed, air pressure, and carbon dioxide concentration. So 5 data points every minute makes 720 x 5 = 3.600 rows per device per day. In a year 1.314.000 rows per device. And now imagine that you are successful and you sell 1.000, 10.000 or even more than 100.000 devices. As your devices make the market they gain new measurement capabilities which create even more data points. And perhaps some customers need a measurement frequency of once a second. Or you think about an application in high-frequency trading with thousands of assets in a sub-second timescale. Time series data is guaranteed to blow your relational database; you can watch it melting down. You probably survive a short time by partitioning the tables and using hierarchical storage systems, but then it doesn't scale any more on one host, and you begin building clusters. Sometime later you see the problems arise again and start to re-design your application to implement some form of sharding. My advice: when you think of time series data you should evaluate a specialized database for that purpose like TimescaleDB, which is ironically a PostgreSQL extension. But what exactly is the problem with having time series data in a relational database? The amount of data. To make queries fast you will need indices, probably one for each of the table's column. In that way, not only the storage and thus RAM requirements rise but also the necessary IO capacity (every insert or delete operation requires corresponding index updates, for example). As a relational database is designed to hold to the ACID properties, scaling such data is hard especially when several servers are required.
Some databases are designed to be scalable in every respect; this requires to leave the ACID principles behind and do differently. Such examples are Apache Cassandra or ScyllaDB. Their design uses on many ideas that are present since the Dynamo paper and I can only encourage you to read this paper and get into very details.
Graph databases enable you to store and query graph information, which are structures where nodes are connected by edges. This type of structure is so different from the relational model that it can be implemented much more efficient as a specialized engine like Neo4j. With these databases, you can answer questions like "who is connected to me?", "what is the shortest path from Alice to Bob?".
While some people try to solve different problems with the same tool (=database), but I think it is essential to understand the use case and get the best-suited tool to do the job. If you only use a hammer, anything else must be a nail, right?!
Happy hacking ;-)
P.S.: the list of database types is not even close to complete, it is based on things I came across in the past, at least in terms of architecture patterns.
Picture by Axel Hindemith, Public Domain