What Among Us Teaches Us About Byzantine Failure

#distributedsystems #byzantinefailure #algorithms

Besides Among Us [1] facing devops challenges as its users outgrow its server capacity, the concept of the game itself has its roots in a problem that is famously difficult to solve in the world of distributed systems.

Those who have played the game before can probably skip to the next paragraph. The basic concept of the game is that you and your friends are in outer space somewhere, and are confined to a fixed map of either a base station or a spacecraft. You each have a series of tasks to complete, with the exception of the impostor, who is pretending to perform tasks. The impostor is perceived as one of your crewmates, but is actually doing their best to murder everyone on the ship without the others catching on to who it is. If there are enough players, there may be more than one imposter. This may sound this level of betrayal would damage relationships with your friends, but studies have shown [2] that games where the other players are opposition rather than working together breeds the opposite of togetherness, anyway, so Catan is probably doing just as much damage. The game is ended when your crewmates guess correctly who the impostor is during a meeting and launch them into space. A meeting can be called either when a deceased crewmember is found, or may be called a fixed number of times per crewmate (including the impostor) per game. This is the only time that the players may speak during the game, and a fixed amount of time is given to decide on a democratic majority basis. The crewmembers can also win by completing all the tasks before the impostor murders everyone. Murdered crew members can still complete tasks, but cannot communicate during meetings. Conversely, the impostor wins if every crewmember but one is dead.

The real key to the game is that, during a meeting, the crewmembers (who aren't dead) can democratically (majority) choose the member who's the impostor. If they choose the wrong person, then the wrong person has been ejected from the game, and the impostor is that much closer to winning. This introduces the first concept that the game calls on from computer science, which is consensus. Consensus is a problem introduced when one or more components are located on a different entity, such as a computer, and those entities must communicate to accomplish a common goal [3]. To be classified as a consensus problem, entities must put forward their votes, communicate with one another, and eventually agree on a common solution. This is a common problem in distributed systems, where the components are located on different computers [4]. This problem is also found in multi-agent systems, where more than one entity is required to solve the problem due to the size or complexity of the data. In the case of Among Us, the entities are humans, and it can be argued that the humans are acting as a distributed system by passing "the truth" of what they've witnessed around as the messages from one "node" (person) to another. The person who saw the deceased crew member knows it was in a certain room, and another might have seen another crew member coming from that general direction on the map. The nodes have to communicate together to join their components into one coherent story for what occurred during the current portion of the game.

The imposter may lie if they are at all good at the job, and this touches on the fundamental problem of unreliable data or state within consensus; if all entities were truthful, this would be a fairly straight-forward problem. In order to address this problem, systems are built with fault tolerance or resilience. Fault tolerance allows a system to continue operating with the degradation of one or more components [5]. In the case of Among Us, this would be if a crew member was mistaken for the impostor, and unceremoniously ejected. The system is somewhat fault tolerant in that, as long as there are more than three crew members at the time of voting, this mistake may rattle confidence in the non-impostors, but won't end the game. This could be referred to as graceful degradation. One mistake won't likely bring the entire system down, but your dead crew member can no longer communicate with the crew what they've seen, or will see as a ghost.

To reach consensus, it's an acceptable method to require all entities to agree on a majority value, rather than requiring every entity to produce the same value. This method may be adversely affected by an incorrect value (the impostor's vote). However, majority vote circumnavigates the problem of the inability to move on because of the Byzantine failure inherent to this game. This type of failure occurs when the symptoms may look different to different observers, and may send contradictory data down the line [6]. It's interesting to note here that in real-life systems, it may be a fault of the electrical components, and isn't a problem isolated to human bad-actors. But we're all just electricity in meat suits, so same difference. An example of a Byzantine failure in Among Us is in the case that there are enough players to have two impostors at one time, it's a worthwhile strategy to consider voting the other impostor off the ship if it means you won't look sus. An impostor may also call a meeting directly after killing someone and accuse others. Or, in a notable case, you may choose not to debate that you are the impostor, and tell the crew that they may vote you off if they would like with no other defense. For the sake of knowing, the other failure type of consensus is a crash failure. This is where the process stops and does not resume on fail, also called a fail-stop failure.

A system that can handle Byzantine failures are called Byzantium fault tolerant. The idea of developing a Byzantine failure-based consensus was first formalized for a NASA-funded project out of SRI International in 1978 [7]. The goal of this project was to achieve fault tolerance in a system thats failure would result in loss of life — an airplane. The first investigator, who dubbed it an interactive consistency problem, had several computers communicating through pairwise messaging that were plagued by a misbehaving processor. The interactive consistency problem was renamed the Byzantium Generals Problem after a suggestion of calling it the Albanian Generals Problem [8] was understood to perhaps end in future cancellation. A later investigation in 1980 [9] outlined a proof that a system that may overcome Byzantine failure must have at least 3 times the number of non-faulty processors to faulty if signed signatures are used. You may read this as that there needs to be three times as many true crew members as impostors in our game to have a hope of winning. As an aside, the concepts of dependability and fault tolerance were hot topics during that time (1980s), and there was a whole technical committee within the IEEE (Institute of Electrical and Electronics Engineers) called the Dependable Computing and Fault-Tolerance and IFIP Working Group 10.4 on Dependable Computing and Fault Tolerance [10]. Just a very interesting tidbit.

To handle a consensus problem, protocols are introduced. To be considered a protocol, the algorithm must eventually decide a value for a process. The protocol also must have integrity, or validity, which is the assertion that if a correct process decides a value, then this means a correct process must have suggested the value. The integrity may be on a spectrum of strong to weak. The integrity may be increased in a Byzantine failure type by making this value more stringent. The third requirement for a protocol is that all correct processes will agree on a single consensus. Without these attributes, the protocol breaks down, and the imposter(s) win.

The first of the modern (1999) algorithms that attempted to solve the problem is called the Practical Byzantine Fault Tolerance algorithm [11]. A more recent algorithm (2018), building off of the PBFT algorithm, is the Tendermint BFT [12], which Proof of Stake (e.g. Etherium) blockchains use today due to its handling of partially asynchronous networks.

Attempting to use one of the original protocols [8] proposed as "A Solution with Oral Messages" in 1982 in your next Among Us game might be fun. The protocol is summarized as follows :

The person calling the meeting (the "general") describes what they saw to call the meeting
Every member of the group (the "lieutenants") should hear what the person calling the meeting has stated, or should not vote if the person calling the meeting doesn't tell them anything pointing to a definitive imposter. A different person from the person calling the meeting should be selected to each ask every other member what they've seen.
Comparing the value that the selected member received from the person who called the meeting, each of the other values should be weighed democratically, or no one should be voted for if one or more of the members doesn't give a value.

Several rounds of game are required to reach the solution. In the paper, both the team member and the person who called the meeting (the lieutenant and general) are assumed to potentially be the "traitor". However, you may see a glaring hole in this plan: several rounds of this may be difficult to apply here, as the answer given to the question of "Where were you in relation to the dead team member?" is different each time; it's nearly impossible to replicate the previous rounds' result to see if the answers are different for each round. As it is, the paper states "...it is the traitors' ability to lie that makes the Byantine Generals Problem so difficult.". The paper goes on to state that the problem becomes simpler with the introduction of signed messages. This is also not a viable solution in this particular usecase. The problem with designing a protocol for this particular game lies in that when voting, it's nearly impossible to prove validity through proof of work or proof of stake, as a solution to the original paper's suggested signed messages. Proof of stake and proof of work are both used today in applications such as cryptocurrency, as noted above.

A solution that may have some bearing in Among us was proposed by a paper titled "An Efficient Algorithm for Byzantine Agreement without Authentication" [13]. Handy, right, since we have no means for authentication? It seems to add to the previous work done by adding in the idea of Low and High thresholds. The basic summary is that the low threshold of one correct process must support an assertion (number of impostors plus one), and correct processes that support the low threshold will support the high threshold (2*number of impostors + 1). The idea being here that incorrect data won't skew results, if you ask for confirmation of a fact before it's even considered for the high processes and asking for agreement from the rest of the processes. The Low (sus) processes won't be able to add false claims without support from one correct process. This algorithm also ignores assertions critical to decision making that are too close to the end of the algorithm to be communicated effectively with all processes — meaning here that the impostor could not throw everyone off by giving information 5 second before the deadline to vote in the game. The protocol for this algorithm is as follows

The "transmitter" (person who called the meeting) relays their message to all team members
During each meeting after the first one (k > 1) each team member will state whether they are a direct or indirect supporter of the claim made. At least (number of impostors)+1 must support the claim for the claim to be taken with any seriousness
The high process is handled by the game itself when you vote (majority will expel the proposed impostor, however, I don't remember in the game if this is exactly 2 * (number of impostors) +1).
After round 2 * (number of impostors) +3, the value is confirmed, and the processes all agree on an impostor, or another impostor choice is agreed upon.

Well, hold on, step 4 still doesn't really jive with how many rounds there are likely to be - most games end before the minimum of the fifth meeting. A strategy may be to have as many meetings as possible in a short amount of time, but overall, a useful thing to take away from the paper is not letting incorrect data be casually added by an impostor without analysis.

There's a reason why Byzantine Failure is considered one of the hardest problems in distributed systems. If anyone has an effective algorithm that they've applied to their Among Us games, I would love to hear it! The tweet that originally sparked my curiosity on the topic used a heartbeat protocol [14], which seems to violate the rules of not speaking between each round, but is still a very cool idea. It was implemented by calling out to other players every so often, and if a crewmember didn't respond, they has more clues about who a likely impostor was because they knew that crew member was dead without calling a meeting [15]. Lastly, I just want to stick this white paper here on the topic of possible scenarios where it might be impossible to solve the Byzantium Generals Problem [16], because although I didn't have a place for it in this article, it was a very fun read about how this particular problem is hard.

Works Cited

[1] http://www.innersloth.com/gameAmongUs.php

[2] https://www.alieward.com/ologies/ludology

[3] https://en.wikipedia.org/wiki/Consensus_(computer_science)

[4] https://en.wikipedia.org/wiki/Distributed_computing

[5] https://en.wikipedia.org/wiki/Fault_tolerance

[6] https://en.wikipedia.org/wiki/Byzantine_fault

[7] https://www.microsoft.com/en-us/research/uploads/prod/2016/12/Design-and-Analysis-of-a-Fault-Tolerant-Computer-for-Aircraft-Control.pdf

[8] https://lamport.azurewebsites.net/pubs/byz.pdf

[9] https://lamport.azurewebsites.net/pubs/reaching.pdf

[10] https://web.archive.org/web/20150402141319/http://www.dependability.org/

[11] http://www.pmg.lcs.mit.edu/papers/osdi99.pdf