Welcome back to Papers We Love! This week, we'll be dipping our toes into the waters of distributed computing with the Raft paper. Raft is a consensus algorithm - in short, it's an algorithm that makes sure that a system stays reliable in the face of crashes.
I suggested we read this paper because I recently finished reading "A Philosophy of Software Design" by John Ousterhout, of Tcl fame. A lot of the examples in that book refer to RAMCloud, a distributed storage system, and the book mentions that RAMCloud uses Raft. I didn't know that much about Raft before reading this paper other than that it's a consensus algorithm and it's supposedly much simpler than Paxos, another popular consensus algorithm, so I figured it would be a good introduction to talking about distributed systems. You can read the paper here:
A handy companion to the paper is The Morning Paper's breakdown here.
I enjoyed the paper, but I feel like there was information missing from the paper, or at least information I didn't pick up on. For example, one thing I wonder is how Raft manages to be partition tolerant; if a netsplit occurs and you have two separate systems accepting requests for a few seconds, how does the system reconcile that when the partition goes away and a single leader emerges? From the sound of it, the partition that did the most work wins. I'm guessing that Raft is more about "how do we handle a bad server or two" rather than "how do we handle a situation in which the Internet connection between data centers goes down".
Two details I really enjoyed was the use of randomness in leader elections, and the use of fork's copy-on-write property for snapshotting. Randomness is such a useful tool, and I think making use of copy-on-write in this way is such a clever trick!