Author: Evan Weaver
Date: January 8, 2019
Originally posted on the Fauna blog.
The database renaissance began with the NoSQL movement, broadly, around 2008 as the technical demands of a new wave of consumer internet companies operating at huge scale (Twitter, Facebook, Google, Amazon) outstripped the capabilities of the legacy RDBMS. Additionally, the price structures of RDMBS vendors like Oracle were not supportive of the wildly smaller unit economics of these new companies, especially the ad-supported ones. What to do?
Initially, companies turned to custom, distributed sharding services that used legacy databases as storage engines, especially MySQL and InnoDB. Distributed caches like Memcache and later Redis became ubiquitous. However, these systems were not self-operating. They still required DBA intervention to recover from failures; they were not self-sharding; they were not self-healing.
I was at Twitter at the time running the software infrastructure team, and it seemed obvious to us that it was possible to build a system that could transparently deliver scale and flexibility, so, along with others in the industry, we set about to do it as open-source, initially investing heavily in Apache Cassandra, Hadoop, and other storage platforms. And when your business requires scale, anything that get in its way must be abandoned.
The commercial vendors of these systems preached that nobody needed or wanted the traditional features of the RDBMS that their systems could not deliver, which is obviously untrue. Now every business requires scale—but what did we give up to get it with NoSQL? And can we get it back?
The CAP theorem can be effectively summarized as: what happens when the network partitions? Does a system maintain consistency (correctness), and temporarily pause, or does it maintain availability, and make a best-effort to keep working? In practice, though, consistency and availability are choices along an envelope of potential trade-offs, and the real-world story is more nuanced.
The first generation of NoSQL systems were eventually consistent, claiming that after a partition event they will reconcile conflicts in a variable but finite period of time, probabilistically voting on what the data is supposed to be. Few real-world eventually consistent systems are actually this sophisticated; instead, they use a simplistic “last-write-wins” based on the local system time. Even if they were smarter, this is not a useful guarantee for application building. It requires the systems designer to import a whole host of conflict-resolution functionality into the application layer.
An obvious transactional failure pattern is as follows:
- Bob deposits $200 via an ATM.
- A database shard accepts an “eventually consistent” deposit to Bob. His new balance of $200 is queued.
- Bob deposits $100 via a mobile app. This update goes to a different shard and replicates to the rest of the cluster, but not to the original shard.
- The network heals and replication proceeds.
- Because it happened later in physical time, Bob’s balance on the remaining shard is updated to $100 as per step 3.
Bob has now lost $200 in cash. Eventual consistency guaranteed availability. We can deposit money all day long! But did not guarantee correctness.
A variety of schemes have been proposed to solve this problem, specifically conflict-free replicated data types (CRDTs), which effectively close over all intermediate states so that the final value can be rebuilt correctly. This would record Bob’s transactions as debits and credits rather than final balances. Construction of the final balance can be performed in the database or in the application layer. The degenerate case is a blockchain, which records all transactions for all time on all nodes in the cluster.
However, in practice this doesn’t solve the problem at all, because a database is not a closed system. The whole point of reading and writing any data at all is to coordinate external effects. If Bob withdrew cash from an ATM, the money is gone, even if later the CRDT reveals that he didn’t have enough money at the time.
This is not at all a theoretical problem. Deliberately inducing withdrawal races at ATMs is a widely reported type of fraud. Thus, databases need “external consistency”, colloquially known as ACID.
Legacy databases (typically the centralized RDBMS) solve this problem by being undistributed. Ultimately, a single machine with a single disk—and for practical purposes, a single mutex on updates to that disk—is the source of truth. Updates to the state of the world are serialized, meaning that they happen in a single, deterministic order—and are externally consistent—they correspond to the order that occurred in physical time. Any scalability comes from vertical scaling of that single machine.
This model is easy to implement but fails to meet a number of critical guarantees, specifically ones related to availability.
Thus, we got the rise of primary/follower replicated systems, where the primary node could be manually replaced by a DBA with a follower node that asynchronously replicated the state of the primary. This isn’t great, but unlike an eventually-consistent distributed system, the scope of the inconsistency is known: it is exactly what might have happened between the last transaction replicated to the follower, and the externally-visible failure of the primary.
This is effectively the world the consumer internet sharding systems were operating in:
- they could perform transactions within a shard (a single machine)
- they could manually or semi-manually fail over to a backup shard
- they could not transact across multiple shards at the same time
For Twitter and Facebook this was more or less fine, but undesirable. For example, it’s confusing to get a notification on a phone about a message you can’t yet read on the website, but it’s not a catastrophe. Product features that required transactionality—username assignment, for example—but not as much scale remained in the legacy RDBMS.
But as products became more complex the downsides of a lack of external consistency became increasingly severe.
Spanner is Google’s solution to both of these problems. Initially available only within Google’s internal infrastructure, it is now available as a managed product on Google Cloud.
It does two things:
- Multi-shard transactions are implemented by a two-phase prepare/commit algorithm—essentially equivalent to Tuxedo, the 1980s transaction monitoring protocol.
- Instead of relying on HA or mainframe-class hardware to maintain availability, shard failover on commodity hardware is automated via Paxos.
This approach works well up to a point. It guarantees serializability—updates to each individual shard are guaranteed to happen in real-time order. But it does not guarantee external consistency, or coordination of real time across shards. To solve this final problem, Spanner does one more thing:
- Physical atomic clock hardware synchronizes the system time on all shards within very small error bounds.
Now the transaction coordinator can essentially say, “Hey shards, I’m doing a transaction at real time T, if you do not see any other updates also arrive within error bounds of our shared view of time, you know that it is unconflicted.” This induces a small delay in every externally-consistent read and write as every shard must wait out the clock ambiguity window.
This is great for Google, barring the latency impact. They have the resources to build and manage customized atomic clock hardware and bounded-latency networks. However, there are a variety of new database systems that implement similar protocols without atomic clocks, but always at a cost.
Databases that rely on NTP clock synchronization have a much longer ambiguity window—hundreds of milliseconds. In practice, these systems forgo the wait entirely, and fall back to guaranteeing single-record linearizability without external consistency. This can lead to similar cross-row, double-debit effects.
They also do not offer fast externally serializable reads, but often read from the last known Paxos leader. This can violate serializability as well because the leader may not know that it has been deposed and will happily serve a stale value during the election window, which is typically on the order of multiple seconds. Preventing this window requires inducing a pause.
Finally, if the clocks happen to desynchronize—something Google works mightily to prevent, because in the cloud, all kinds of events unrelated to the clock itself, like VM stalls, can cause this to happen—even the serializability guarantees on writes are lost.
Another class of databases query a single physical clock (a “clock oracle” as described in the Google Percolator paper) have an ambiguity window equivalent to the roundtrip internet latency to that shared clock, which is even worse, and suffer from an obvious single point of failure.
This model is equivalent to the multiprocessor RDBMS—which also uses a single physical clock, because it’s a single machine—but the system bus is replaced by the network. In practice, these systems give up multi-region scale out and are confined to a single datacenter.
It is clear that without physical clock synchronization, distributed external consistency is impossible...or is it?
What if you could construct a logical clock oracle for all your transactions, that does not rely on any single physical machine, and could be widely distributed? This is essentially what Calvin does. Calvin was developed by Daniel Abadi’s lab at Yale University.
John Calvin was a French Protestant reformer who believed that the eventual destiny of every soul—either heaven or hell—was predetermined by God before the beginning of the world.
This is how Calvin works as well. The order of transactions is decided by preprocessing—a subset of nodes act as partitioned oracles and assign an externally consistent order to all incoming transactions in a series of pipelined batches. The batches are then committed, 10ms at a time, to a distributed write-ahead log that can be implemented in Paxos or Raft.
This preprocessing has a number of benefits:
- It can be accomplished in a single network round-trip for the commit to the distributed log—no two-phase locking on the replicas.
- Its operational topology is distinct from the replica topology, which reduces long tail latency.
- Serializable reads require no coordination and do not have to wait out an ambiguity window. Instead, they scale out globally just like an eventually consistent system.
It necessarily makes some trade-offs, however:
- Since transactions are committed in batches, write latency cannot fall below the time to wait for the next commit, which is 5ms on average, and can be as high as 10ms.
FaunaDB, which implements this model with a variety of performance improvements, is a Relational NoSQL system, not a NewSQL system, although there is nothing that specifically precludes FaunaDB from eventually implementing a SQL interface.
Although it is true that no CP system can maintain liveness during a completely arbitrary partition event, in practice, we see that real-world availability of systems like Spanner and FaunaDB in the cloud is equivalent to the availability of AP systems. Faults beyond 99.9% availability are just as likely to come from implementation problems as from hardware and network failures, and the formal consistency models of the CP systems make them easier to verify under faults than AP systems.
For example, a five datacenter FaunaDB cluster will tolerate the loss of two entire datacenters without losing liveness at all. Also, CP systems like FaunaDB can typically maintain availability for reads at a lower consistency level (like snapshot isolation) even during partition events, which is equivalent or better than the consistency level offered by AP systems all the time.
Theoretically speaking, moving from 99.999% availability for writes to 99.9999% (a difference of a few minutes a year) is not worth it, especially when the cost of accepting writes during that pause is permanent data inconsistency.
Distributed transactions are one of the hardest problems in computer science. We are lucky to be living in a world where these systems are becoming available off-the-shelf. Any kind of bounded consistency is better than eventual consistency—even putting aside all the other problems with legacy NoSQL, like durability, and the legacy RDBMS, like operational overhead.
However, the holy grail of distributed, external consistency without physical clocks still eludes us—except for Calvin. FaunaDB is the only production-ready implementation of the Calvin-style transaction protocol. I encourage you to be mindful of the consistency guarantees your database systems provide, and also to give FaunaDB a try.