Papers We Love: Raft

#pwl #discuss #distributed

Welcome back to Papers We Love! This week, we'll be dipping our toes into the waters of distributed computing with the Raft paper. Raft is a consensus algorithm - in short, it's an algorithm that makes sure that a system stays reliable in the face of crashes.

I suggested we read this paper because I recently finished reading "A Philosophy of Software Design" by John Ousterhout, of Tcl fame. A lot of the examples in that book refer to RAMCloud, a distributed storage system, and the book mentions that RAMCloud uses Raft. I didn't know that much about Raft before reading this paper other than that it's a consensus algorithm and it's supposedly much simpler than Paxos, another popular consensus algorithm, so I figured it would be a good introduction to talking about distributed systems. You can read the paper here:

https://ramcloud.stanford.edu/raft.pdf

A handy companion to the paper is The Morning Paper's breakdown here.

I enjoyed the paper, but I feel like there was information missing from the paper, or at least information I didn't pick up on. For example, one thing I wonder is how Raft manages to be partition tolerant; if a netsplit occurs and you have two separate systems accepting requests for a few seconds, how does the system reconcile that when the partition goes away and a single leader emerges? From the sound of it, the partition that did the most work wins. I'm guessing that Raft is more about "how do we handle a bad server or two" rather than "how do we handle a situation in which the Internet connection between data centers goes down".

Two details I really enjoyed was the use of randomness in leader elections, and the use of fork's copy-on-write property for snapshotting. Randomness is such a useful tool, and I think making use of copy-on-write in this way is such a clever trick!

Top comments (3)

Kasey Speakman • Nov 19 '18 • Edited

For example, one thing I wonder is how Raft manages to be partition tolerant

So one key point here (and also with Paxos) is that all nodes are aware of all other nodes. So that even under partition, each node knows exactly how many votes are needed for a majority. Partitions without enough nodes to form a majority will continually attempt re-election but be unable to contact enough nodes to elect a leader. They will all be in follower or candidate mode. It is possible for multiple partitions to leave no possibility for a leader to be elected in any partition. But it is not possible to achieve a majority vote in more than 1 partition. (Unless somebody cheats, but that's not a covered problem here.)

The way a leader handles a write is that it first replicates the uncommitted write to all nodes. Once a majority of the nodes confirms receipt of the uncommitted write, then the leader applies the write to its own state machine and updates its last committed id. The next time the leader sends new uncommitted writes (or a heartbeat) to followers, it will also include its last commit ID. So then followers will commit any uncommitted writes up to that ID. I didn't see it spelled out explicitly in the paper, but it is implied that when a leader times out waiting for confirmation from a majority of followers, then it goes back to being a follower. When a new leader is elected, it will instruct followers to truncate uncommitted messages after the last known good state. This is similar to a transaction log replay after database crash.

When the partition resolves, the leaderless groups might begin contacting the other nodes and asking for election votes. However, those nodes with a leader have been committing new messages from their leader, so they will vote down the election. Meanwhile, the leader has still been trying to replicate to partitioned-off nodes. So once the partition comes down, the leader will fill in the rejoining nodes on what has happened.

For PACELC classification, I believe this consensus algorithm produces clustering which is PC-EC for writes and PA-EL for reads. A leader is required to update the state machine of whatever service needs cluster consensus. But Raft places no limitations on reading the state machine of individual follower nodes.

To put another way, when there is a network partition it could be that none of the partitions can elect a leader. So no changes to it (writes) can be made. At best only 1 partition will have a leader, and clients in other partitions cannot perform writes. So under partition, writes choose consistency over availability, hence PC. Under normal operating conditions, the leader (writer) chooses consistency (over minimum latency) because it confirms replication to a majority of nodes before committing a change. So EC, or Else Consistent.

For reads, under partition every node can provide its last known state, but if it is partitioned off from the leader, it could be far behind. So reads choose availability (over consistency) under partition, so PA. Under normal operating conditions, reads choose minimum latency over consistency, so EL. Because of the way Raft works, it is not feasible to make reads choose consistency. You might think of waiting to receive a commit of the current log entries before responding to the client. However, the commit notice may come with new uncommitted entries. So then you will be waiting on those. In an active system, it might take a while before you receive a commit with no new uncommitted entries, and reads would essentially behave as if unavailable most of the time. So it is better to just immediately return the current state machine even knowing there are uncommitted entries.

Rob Hoelz • Nov 20 '18

Thanks for the clarification! I guess I didn't realize from the paper that you needed a majority of all machines in the cluster, even those you can't currently talk to!

Ben Halpern • Nov 21 '18

I'm excited to read and weigh in on this. I haven't found the time yet.