Because of Math. We have a mathematical proof that to tolerate (n) amount of malicious nodes, you need (2*n* + 1) of good nodes.
In a distributed system, it is not guaranteed that there is nothing wrong with the message sent by the other node. When a wrong node is sending wrong messages to everyone else, the problem can be easily solved by blocking the node itself.
However, when the right message is sent to partial nodes and wrong message sent to other, it becomes very difficult to find the problem because in a distributed system, each node doesn’t know the state of every other node. Nodes sending wrong messages to only some nodes is called the byzantine failure model.
Byzantine failure model is the most difficult model to solve when it comes to networks, and at the same time, a problem that absolutely must be solved. Especially in p2p, where other nodes cannot be trusted, byzantine failure model must be assumed before going on to resolve abnormal situations.
Just because a distributed system consists of only certified nodes, it does not mean that byzantine failure model assumption can be ignored. The node itself can be managed by a certified person, but it could have been hacked, sent the wrong message due to bugs, or there could have simply been some sort of hardware failure.
Byzantine fault tolerance (a.k.a. BFT) is a system that operates normally within a byzantine failure model. However, even BFT does not operate when there exists numerous faulty nodes. Thus when discussing BFT, the amount of faulty nodes that can be tolerated must be mentioned. For example, if N = 5f, even if ⅕ of the nodes suffer from byzantine failure, the entire system operates properly. Likewise, if N=3f+1, ⅓ of the nodes could suffer from byzantine failure and the entire system will still operate with no issues.
If we have two systems that are BFT, then obviously the one that can handle more faulty nodes is the superior system. However there is no system that assumes a faulty node of greater than ⅓ of the entire system. This is because theoretically, ⅓ is the greatest amount of faulty nodes that the system can handle.
It's also well known in the industry that it is not possible for an asynchronous system to provide both safety (the guarantee that all non-malicious nodes will eventually agree on what progress was made) and liveness (the ability to continue to make forward progress) with more than this number of malicious failures.
If we are talking about a system for transactions: You can trivially ensure safety by simply making no forward progress at all. And you can trivially make forward progress unsafely by just letting each node do whatever they want. Neither of these modes of operation are useful. So, something has to give on each end and make a middle approach that's both as safe and as lively as we can make a system.
So, we have the following requirements:
- Our system is asynchronous.
- Some participants may be malicious, faulty, or tampered.
- We want safety, that is, we do not want one honest participant honoring one transaction and one honest participant honoring the other.
- We want livenesses, that is, it's not fair just saying we never clear any transactions ever. Sure, that's safe, but not useful. We want to be sure that we eventually agree on which transactions to clear.
So, now the question arises -- how many dishonest participants can we tolerate in our asynchronous system and still guarantee both safety and liveness? and thats where math is used.
As a simple way to get the gist of the proof, though it is not rigorous:
Suppose we have nnodes of which hare honest and dare dishonest. Obviously, n = h + d. Now the system needs to come to consensus on which of two checks to clear.
Think about the case where all the honest nodes are evenly split about the two directions the system could make forward progress. The malicious nodes could tell all the honest nodes that they agree with them. That would give h/2 + dnodes agreeing on each of two conflicting ways the system could make forward progress.
In this case, the honest nodes must not make forward progress or they will go in different directions, losing safety. Thus, the number of nodes required to agree before we can make forward progress must be greater than half the number of honest nodes plus the number of malicious nodes, or we lose safety.
If we call tthe threshold required to make forward progress, that gives us: t > (h/2) + d. This is the requirement for safety.
But the malicious nodes could also fail to agree at all. So the number of nodes required to agree before we can make forward progress must be no more than the number of honest nodes or we lose liveness.
This gives us t <= h. Or h >= t. This is the condition for liveness.
Combining the two results, we get:
h >= t > (h/2) + d h > (h/2) + d (h/2) > d d < (h/2)
Thus the number of faulty nodes we can tolerate is less than half the number of honest nodes. Thus we cannot tolerate 1/3 or more of the nodes being dishonest or we lose either safety or liveness
For a more visual explanation see the following video: