Concepts of Distributed Systems - Part 1

#distributedsystems

What Are Distributed Systems ?

There are lots of different definitions you can find for distributed systems. For example, Wikipedia defines distributed systems as

A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another. The components interact with one another in order to achieve a common goal.

Similarly, Technopedia defines distributed systems as

A distributed system is a network that consists of autonomous computers that are connected using a distribution middleware. They help in sharing different resources and capabilities to provide users with a single and integrated coherent network.

Irrespective of which definition you choose, there are a couple of important things to notice in these definitions. First and foremost, a distributed system consists of components or computers that are autonomous. The second aspect is that for any user or program, a distributed system appears to be a single system(coherent, achieve a common goal etc.). The third aspect is that these autonomous components need to communicate and coordinate with each other in some way or the other.

One of the keys to building distributed systems lies in how the communication and coordination is established between these autonomous components.

Characteristics of Distributed Systems

There are certain characteristics which are common to distributed systems. We will discuss these characteristics in the following sections.

Transparency

A distributed system needs to hide a majority of details from the user(end-user or another system). That is the user of a distributed system is unaware of any differences in the components(software stack, libraries etc.) or computers(hardware details, operating system etc.) or how they communicate. The user is also unaware of how the different components are organised internally.

A distributed system is generally assumed to be available, even if parts of the system are temporarily unavailable. Users should not be aware that certain parts are unavailable, or being fixed or removed, or that other parts are being added to the system.

In general, an important characteristic of a distributed system is the ability to hide the fact that the system consists of physically distributed components and present itself as if it were a single system or computer. A system which accomplishes this is said to provide transparency.

A distributed system can provide different kinds of transparency.

Transparency Type	Transparency Details
Location	Hide where a resource is located
Migration	Hide the fact that a resource may be moved/ relocated while in use
Replication	Hide the fact that a resource may be replicated
Concurrency	Hide that a resource may be shared by multiple users
Failure	Hide the fact that resources of the system may fail, recover, be removed or added
Data	Hide differences in data formats and representation

Like any other choice we need to make, there is always a tradeoff associated. Aiming for a high level of transparency can adversely affect performance and the ability to understand a system, among other things. Not all levels of transparency are achievable or sometimes even required for all systems. Certain use cases warrant certain kinds of transparency, however we will not be covering these in the interest of brevity.

Scalability

Another important characteristic of distributed systems is the ability to scale. Scalability implies that the system is able to cope with an increased load(number of users, storage, compute or resources) without degradation in the quality of service it offers. There are many different facets to scaling a distributed system. However a common theme of accomplishing it is to move away from centralised services, centralised data and centralised algorithms. Centralised components, whether services, data or algorithms not only become a single point of failure, but also would become bottlenecks when the load on the system exceeds the capacity which they can handle.

In distributed systems, decentralisation is the key. Decentralised systems(services, data or algorithms) have certain characteristics.

No machine has complete information about the state of the system. In order to make certain decisions, one needs to look at the majority. Majority implies what most of the nodes agree on. Majority can be achieved using strategies like quorum, total order broadcast or consensus.
Machines make local decisions based on local information.
Faults can occur or some machines can fail, but the system as a whole still continues to work. This is often called Resilience. (Faults can be subdivided as hardware faults, which are random and have weak correlation between them, software faults which manifest under certain conditions and have strong correlation and human errors).
There is no implicit assumption of a global clock(Refer to Time and Order in Distributed Systems for details) .

There are subtle differences of how you approach scaling a system across geographical locations. Across wide area networks, the latencies are typically 3 orders of magnitude higher than latencies across local area networks. In a local area network, the network is generally reliable and based on broadcast. However across wide area networks, the network is generally unreliable and point to point.

Typically systems designed to run in local area networks work on a synchronous model, where a client(some system) sends a request and then blocks till it receives a response from the server(a different system). However such synchronous mechanisms will not work effectively across geographically distributed systems.

Most of the problems in scaling manifest as performance problems caused by the limited capacity of servers or the network. In general these problems can be resolved using the following techniques

Use Asynchronous communication where applicable. Certain applications lend themselves well to asynchronous communication. In such systems, the requestor does not block for the response to arrive. Generally this is accomplished using some kind of a callback or event handler which triggers when a response is received.
Replication - In distributed systems, it often makes sense to replicate components. Replication not only increases availability, but it helps to balance the load on the system across multiple components, leading to better performance. In case of geographical scaling, replicas can be setup so that they are closer to the clients they are serving. The challenges with replication lie in the fact that we need to maintain consistency across multiple copies. If there were no changes made to the copies, replication would be very simple to accomplish.
Partitioning - Partitioning is a form of decentralisation. For example, if the data volumes are too large to fit into a single replica, they may be split into smaller chunks and stored on different machines. Services can be split by function(called Y-axis splitting) and each service can have multiple replicas(sometimes called a m-n topology, where there are m components each having n replicas) . Certain applications need partitioning by ranges, using either keys or hash functions which operate on some key. Partitioning and replication typically go hand in hand.

Building distributed systems can seem a formidable task. However there are design principles which can be used to build reliable and robust distributed systems. Often the issues arise when systems are built using certain fallacies of distributed systems. These fallacies were catalogued by L Peter Deutsch.

The network is reliable
Latency is zero
Bandwidth is infinite
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

In this post, we have covered what are distributed systems, what are some of their characteristics and why building distributed systems is a difficult task. I've glossed over quite a few things for the sake of brevity. If I was to detail out all the aspects of the things here, the post would become excessively big.

References

https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing

https://microservices.io/articles/scalecube.html

https://www.cl.cam.ac.uk/~jac22/books/ods/ods/node18.html

http://crystal.uta.edu/~kumar/cse6306/papers/mantena.pdf

https://en.wikipedia.org/wiki/Replication_(computing)

https://en.wikipedia.org/wiki/Atomic_broadcast

https://en.wikipedia.org/wiki/Quorum_(distributed_computing)

https://en.wikipedia.org/wiki/Consensus_(computer_science)