Sriram R

Posted on Feb 17, 2023

Distributed System Models

#distributedsystems #computerscience

System Models

As we saw in the first part of this series, a distributed system is made up of two parts: the Node and the Network. Based on how these two parts work, we can create different kinds of behaviour that should be taken into account when building distributed systems. We call these System Models.

The behaviours we use to create variations are usually based on two things:

How the different parts of a distributed system work together.
How the parts of a distributed system fail.

Network Behavior

Reliable Link

A reliable link says that if you send a message across a network, it will be delivered to its destination every time.
Most of the time, these links are used in single-machine systems where components can talk to each other reliably.

Fair Loss Link

A fair loss link means that a message sent across a network might get lost, duplicated, or reordered. This can be changed to a Reliable Link if we keep trying in case we lose the connection. If we keep trying, the message will eventually get to its destination, but there's no way to know when it will get there by the latest. In theory, it could take up to 100 years.

Arbitrary Link

This link says that any party can intercept messages sent between nodes and change, spoof, eavesdrop on, block, or replay messages.

This model is a very good representation of what happens when we use the internet in places like Starbucks or a Coffee Shop that aren't very reliable. The owner of the Coffee Shop can easily exploit the network packets and use them in a bad way.

With the arrival of TLS, it's no longer possible to intercept packets, but that doesn't stop a third party from blocking the communication.

Node Behaviour

Crash Stop

This model says that when a node becomes faulty, it will never recover and cease to function permanently.
For example, if you drop your phone under a train, it won't work again.

Crash Recovery

In this model, a node that stops working properly can get back to normal after any amount of time.
For example, if the operating system in a virtual machine (VM) crashes, a machine restart can fix it and make the node healthy again.

Byzantine

If a node departs from what it is supposed to perform, it is deemed to be defective. A byzantine node can break for no clear reason or reason at all.

This typically occurs when a hostile actor or a flaw compromises the node's algorithm.

There's a famous thought experiment called the Byzantine General Problem that explains byzantine behaviour in detail.

Blockchains are a great example of a Byzantine system, and all of the algorithms built for them assume that they will behave in a Byzantine way.

Timing Behaviour

Synchronous System

Every synchronous system sets a maximum time limit for a message to reach its destination and a maximum expected duration for a message to be processed.

For example, if you want to write something into RAM, the RAM has guarantees about how long it could take in the worst case.

Creating a synchronous distributed system is nearly impossible, and presuming Synchrony can be devastating.

Let's say you think a node can process a message in 5ms, but in the middle of the process, the Operating System does a Context Switch or a [long GC pause](https://docs.datastax.com/en/dse-trblshoot/doc/t
In this case, the assumption is wrong, and your system goes down with it.

Asynchronous System

We make no assumptions about processing time or message delivery time across a network in this model. Any message can be delayed at random and without warning.

Algorithms built for asynchronous systems are very strong because they are not affected by network delays or latency. However, building asynchronous systems is hard.

Partially Synchronous System

In this model, we assume that the system is mostly synchronous, but that it can randomly change into an asynchronous system. This provides a decent compromise between synchronous and asynchronous systems.

Why should you care about System Models?

Your Distributed System is only as strong as the assumptions you use to develop it. These models show how different systems can work, and if you make a wrong assumption, your system could break.

For instance, when blockchain algorithms were developed, they assumed a Byzantine Node Behaviour, which means any node might be evil because blockchain algorithms must be precise even if someone tries to tamper with the ledger. If they had assumed a Crash-Stop model rather than Byzantine, Blockchain algorithms would have been flawed since they would not have handled the scenario in which a node may be bad, and malicious people would have exploited that weakness.

As a result, understanding System Models and selecting an accurate System Model based on your use case is critical when designing large-scale systems.

DEV Community