DEV Community

Kostas Kalafatis
Kostas Kalafatis

Posted on • Originally published at dfordebugging.wordpress.com

Fallacies of Distributed Systems

The network is reliable.

Before we talk about this first fallacy, let’s quickly look at what reliability means. The degree to which a product or service conforms to its specifications when in use, even when it fails, is defined as reliability. Thus, reliability can be defined as the quality of uptime, that is, the assurance that functionality and the end-user experience are preserved as effectively as possible.

Now, let us return to our fallacy. Networks are complicated, dynamic, and frequently unpredictable. A network failure or network-related issue can occur for a variety of reasons, including a switch or power failure, misconfiguration, an entire datacenter becoming unavailable, DDoS attacks, and so on. Networks are unreliable due to their complexity and general unpredictability.

Many systems are interconnected by nature, which means that during normal operation, they may make calls to other systems via one or more networks to meet functionality requirements. The underlying process and mechanism of communication will be (hopefully) abstracted away in any tiered architecture - for example, accepting a payment on behalf of a customer may be as simple as calling a method like this:

var processor = PaymentStrategyFactory.GetCardPaymentProcessor();
processor.Process(paymentRequest);
Enter fullscreen mode Exit fullscreen mode

This level of abstraction is both useful and desirable, but it does tend to bias our perception of success. What happens when an underlying network call to a third-party payment processor fails due to a timeout exception, such as an HTTP timeout? Depending on when and how such an error occurs, the request may be retryable without causing any negative consequences. However, the system must be indempotent for any retry to be deterministic in the affirmative. Retrying the above request N times without indempotence may result in the customer being charged N times.

So, how do you make your distributed system dependable, ensuring that it continues to function normally even in the presence of untrustworthy networks? Accepting and treating failure as a given is critical. In the face of adversity, you should design your system to be able to mitigate inevitable failures and continue to operate as expected.

From an infrastructure perspective, your system must be fault-tolerant and highly redundant.

Image description

Aside from infrastructure concerns, you must also consider connections being dropped and messages and API calls being lost due to network failures. Data integrity is critical for some use cases (for example, a real-time chat app), and all messages must be delivered exactly once and in order to end users — at all times (even if failures are involved). Your system must be stateful in order to ensure data integrity. Mechanisms such as automatic reconnection and retries, deduplication (or idempotency), and methods to enforce message ordering and guarantee delivery are also required.

A queuing system can be used to leverage a Retry and Acknowledgement design to mitigate network unreliability.

Queuing systems, such as RabbitMQ and ActiveMQ, typically use the Store and Forward pattern, in which the message is stored locally before being forwarded to the recipient (s). If the forwarding fails, the queuing system will automatically retry. If the message fails after a certain number of attempts, it will be moved to the Dead Letter Queue. Until the queue receives an acknowledgement of receipt from the recipient, the message is considered undelivered. But beware, queuing requires a shift from a synchronous request/response model to an asychronous one. This is not a trivial exercise in any system of any complexity, and it may necessitate re-evaluation of other aspects of your system, not least the user experience.

Latency is zero.  

Before we delve too deeply into networks—not to mention the fallacies associated with them—there are a few key terms to understand. When we talk about sending messages over a network, we usually mean sending data across the network. However, we can discuss data and how it is transmitted in two ways: latency and bandwidth.

Latency is a measurement of how long it takes for data to arrive based on where it was sent. Latency can be defined as the speed at which data travels from one location to another, or how long it takes for data to travel from node A to node B in our system. Many developers are probably already familiar with the concept of latency, because we frequently refer to data latency in milliseconds. For example, if a request was served from a server (node A) to a client (node B) in 400 milliseconds, we can say that the latency of that data being sent was delayed by 400 milliseconds, or that the message took 0.4 seconds to travel from the server to the client. Many performance-conscious developers consider latency when deciding what to optimize or improve in their applications or systems.

High latency has the unsettling effect of not being constant. On a bad network, it can occasionally be measured in seconds. By definition, there is no guarantee of the order in which individual packets will be serviced by a network, or even that the requesting process will continue to function. Latency exacerbates the situation. Furthermore, where applications compensate by sending multiple concurrent requests, a temporarily high latency can be exacerbated by the response to it.

When running apps in your local environment, latency can be close to zero, and it's often negligible on a local area network. In a wide area network, however, latency quickly degrades. Because the network may span large geographical areas, data in a WAN frequently has to travel further from one node to another (this is usually the case with large-scale distributed systems). When designing your system, keep in mind that latency is an inherent limitation of networks. We should never assume that there will be no delay or zero latency between sending and receiving data.

Image description

Latency is primarily limited by distance and the speed of light. Of course, there is nothing we can do about the latter. Even in theoretically perfect network conditions, packets cannot exceed the speed of light. However, we can do something about distance by bringing data closer to clients through edge computing. If you're building a cloud-based system, make sure your availability zones are close to your clients and route traffic accordingly.

There are a few ways to cut down on latency or at least make it less noticeable:

  • Caching: Browser caching can assist in lowering latency and the number of requests sent to the server.A CDN can also be used to cache resources in multiple locations around the world. Once they are cached, the client can get them from the datacenter or point of presence that is closest to them instead of from the origin server.
  • Using an event-driven protocol: Depending on the nature of your application, you may want to consider using a communication protocol such as WebSockets. WebSockets have a much shorter round-trip time than HTTP (once the connection has been established). Furthermore, a WebSocket connection is kept open, allowing data to be passed back and forth between server and client in real time as soon as it becomes available ta significant improvement over the HTTP request-response model.
  • Improving server performance: Server performance (processing speed, hardware used, available RAM) is strongly related to latency.To keep your network from getting clogged up and your servers from getting too busy, you need to be able to (dynamically) increase the capacity of your server layer and reassign load.
  • Inverting the flow of data: We can use the Pub/Sub model instead of querying other services and store the data locally. This way, we'll have the data when we need it. Of course, this adds complexity, but it can be a useful tool in the toolbox.

Bandwidth is infinite

Bandwidth is a measurement of how much data can be transmitted from one location to another in a given amount of time. We frequently use the term "bandwidth" in the context of internet speed (for example, megabits per second or Mbps) because it refers to how much data can be sent over our specific network or the capacity of our network to send some amount of data in a given amount of time. While latency is the actual speed at which data travels through the network to its destination, bandwidth is the network's capacity to send a certain amount of data to its destination.

Image description

Without a doubt, computing technology has advanced significantly since the 1990s, when the eight fallacies were coined. Network bandwidth has also increased, allowing us to send more data across our networks. However, even with these advancements, network capacity is not infinite (partially because our appetite for generating and consuming data has also increased). Users may not be able to enjoy consistent, high-bandwidth network connections, or they may experience highly variable connectivity at times, such as when using mobile networks. Furthermore, as bandwidth has increased in general, so has the need to support large volumes of data such as audio and video streams. When a large amount of data attempts to flow through the network and there is insufficient bandwidth support, the following problems can occur:

  • Queueing delays, bottlenecks, and network congestion.
  • Packet loss resulting in lower service quality guarantees, such as messages being lost or delivered out of order.
  • Poor network performance and even system instability.

The third fallacy—that bandwidth is infinite—is at odds with the second fallacy—that latency is not zero. The number of network round trips can be reduced by increasing data transfers. To conserve bandwidth, you should reduce the amount of data being transferred. Finding the sweet spot between these two factors will determine how much information is transmitted over the wire.

The bandwidth of your network can be increased in many different ways. They include:

  • "Comprehensive monitoring": Monitoring and logging network activity is critical for detecting and correcting problems (such as the source of excessive bandwidth use).
  • Multiplexing: Multiplexing is a technique that improves bandwidth utilization by allowing you to combine data from multiple sources and send it over the same communication channel/medium, and it is supported by protocols like HTTP/2, HTTP/3, and WebSockets.
  • Lightweight data formats: Using data exchange formats designed for speed and efficiency, such as JSON, you can save network bandwidth.Another option is MessagePack, a compact binary serialization format that produces even smaller messages than JSON.
  • Network traffic control: Consider throttling or rate limiting, congestion control, exponential backoff, and other mechanisms.

In general, we should exercise caution when assuming that high bandwidth is a universal experience. Whatever the network bandwidth, it will never come close to the speed at which co-hosted processes can communicate.

The network is secure

Image description

As Gene "Spaf" Spafford once said:

The only truly secure system is one that is powered off, cast in a block of concrete, and sealed in a lead-lined room with armed guards—and even then I have my doubts.

A network can be attacked or hacked in many ways, such as through bugs, weaknesses in operating systems and libraries, communication that isn't encrypted, mistakes that let unauthorized people access data, viruses and malware, cross-site scripting (XSS), and DDoS attacks, just to name a few (the list is long).

Since HTTPS is the only way to get to your e-commerce website, you can rest easy knowing that all communication between your customers and your website is encrypted. But some of the scripts on your website that collect data about your customers use HTTP. Also, that third-party service that you privately call from your own API to validate customer addresses works over HTTP because HTTPS was once thought to be "too expensive."

In April 2014, people found out about the Heartbleed flaw. The SSL standard has a "heartbeat" feature that lets one person talking to another over an SSL connection send a short message to the other person to see if they are available. Heartbleed was a buffer overflow attack on OpenSSL servers. Attackers were able to trick OpenSSL into returning random bits of memory from the server, even for failed connections, by sending malformed hearbeat requests that OpenSSL didn't validate with bounds-checking. Because these pieces of memory were often close to the code that authenticated users, it was easy to get data about users who had already been verified by just pinging the server.

Taking a quick look at the have I been pwned? According to the website, attackers exposed millions of user details by exploiting flaws such as Heartbleed, improperly configured application stacks, insecure databases, and so on.

So definitely, the network is not secure.

Even though there is no such thing as true (absolute) security in the world of distributed computing, you should still do everything you can to keep your system from being broken into or attacked when you design, build, and test it. The goal is for security problems to happen as little as possible and have as few effects as possible.

Here are a few things to think about:

  • Threat modeling: A structured process is recommended for identifying potential security threats and vulnerabilities, quantifying their severity, and prioritizing mitigation techniques.
  • In-depth defense: A layered approach with different security checks at the network, infrastructure, and application levels should be used.
  • Security mindset: Keep security in mind when designing your system and follow best practices as well as industry recommendations and advice, such as OWASP's Top 10 list, which covers the most common web application security risks that your system should be prepared to address.

Topology doesn't change

In a nutshell, network topology refers to how the links and nodes of a network are organized and related to one another. The network topology of a distributed system changes all the time. Sometimes it's due to an error or an issue, such as a server crash. Sometimes it's on purpose — we add, upgrade, or remove servers. When designing your distributed system, it's critical not to rely on the network topology being consistent or to expect it to behave consistently. There are numerous network topologies to choose from, each with its own set of benefits and drawbacks.

Image description

Each node in a ring topology, for example, connects to exactly two other nodes. Data is passed from node to node, with each node handling one packet. A ring topology does not require a central node to manage connectivity, and it is relatively simple to reconfigure if nodes are added or removed. Furthermore, it can be made full-duplex (dual ring topology), allowing data to flow both clockwise and counterclockwise. On the other hand, a single ring topology is extremely prone to failure. If just one node fails, the entire network can go down.

Consider another kind of topology. There is no central connection point in a mesh topology because the nodes are interconnected. Mesh topologies use two methods to transmit data: routing (in which nodes determine the shortest distance from source to destination) and flooding (information is sent to all nodes within the network). Because the network's nodes are interconnected, it is durable and fault-tolerant, with no single point of failure. No single device can bring the network down; new nodes can be added without disrupting the network; and high-speed data transfers are possible (when routing is used). However, due to the large number of connections required for interconnectivity, it is a complex topology that necessitates meticulous planning, a lengthy setup time, and constant monitoring and maintenance.

Given that network topology is indeed volatile, there are two broad approaches to mitigating topology changes. 

Abstract network specifics

  • Instead of directly referencing IP addresses, use DNS hostnames to refer to resources.
  • For microservice architectures, consider using a "service discovery pattern" and tools such as Zookeeper and Consul.

Plan for failure

  • Build your architecture to avoid irreplaceability, and assume that any server could fail.
  • Implement chaos engineering and test for system behavior in the event of infrastructure, network, or application failures.

There is one administrator

Image description

If your system is self-contained and deployed in environments that you completely control and manage, you may only need one administrator. However, it is more likely that any non-trivial system will rely on one or more in-house or third-party services for operation and will be deployed in environments that you do not completely control or that differ in a variety of ways, such as operating system and/or application framework versions, security patching policies, UAC, shared resources, firewall rules, and so on. To deploy and support the system, multiple in-house teams may be involved, each operating outside of the processes governing that system's development. Third-party services, including their availability, compatibility, and development cadence, are typically beyond your control.

As your systems grow, they will rely on systems that are not under your control. Take a moment to consider all of the dependencies; you have everything from your code to the servers you run it on.

It is critical to have a clear method of managing your systems and their configurations. As the number of systems with different configurations grows, it gets harder to keep track of and manage them.Infrastructure as Code (IaC) can aid in the codification of these variations in your systems.

But what is Infrastructure as Code? Infrastructure as Code (IaC) refers to the management and provisioning of infrastructure using code rather than manual processes.

IaC generates configuration files containing your infrastructure specifications, making it easier to edit and distribute configurations. It also ensures that the same environment is always provisioned. IaC facilitates configuration management and helps you avoid undocumented, ad-hoc configuration changes by codifying and documenting your configuration specifications.

Diagnosing problems can be difficult even in the best of circumstances, and it is especially difficult in distributed systems. To learn more about how a distributed system works, make sure that the Three Pillars of Observability (centralized logging, metrics, and tracing) have been taken into account as key design elements.

Transport cost is zero

Just as latency is not zero, transporting data from one location to another has a cost that is not insignificant.

Image description

First and foremost, networking infrastructure is not free. Servers, network switches, load balancers, proxies, firewalls, operating and maintaining the network, securing it, and staff to keep it running smoothly are all expensive. The greater the network size, the greater the financial cost.

Aside from money, we must consider the time, effort, and difficulty involved in designing a distributed system that operates over a highly available, reliable, and fault-tolerant network. It is often less risky, easier, and cheaper to give this complexity to a fully managed, battle-tested solution that was made just for this purpose.

While you may not be able to do much about the price of infrastructure, you can influence how it is used. Depending on the features offered by the system, a significant amount of time could be saved by eliminating the need to marshal data between Layers 7 and 4. Using an optimized data format such as protocol buffers as opposed to JSON or XML can significantly reduce the cost of transport for systems that provide audio and video streaming, telephony, or realtime collaboration features.

The network is homogeneous

Often, not even your home network is homogenous. It’s enough to have just two devices with different configurations (e.g., laptops or mobile devices) and using different transport protocols, and your network is heterogeneous.

Image description

Most distributed systems need to integrate with multiple types of devices, adapt to various operating systems, work with different browsers, and interact with other systems. Therefore, it’s critical to focus on interoperability, thus ensuring that all these components can "talk" to each other, despite being different.

Instead of using proprietary protocols, open ones that are widely supported should be used whenever possible. Protocols like HTTP, WebSockets, SSE, and MQTT are all examples. When it comes to data formats, similar considerations mean that choices like JSON and MessagePack are typically the most suitable.

Conclusion

Though the term "distributed computing fallacies" was coined decades ago, it continues to accurately describe common misconceptions about this type of computing. This is due to the fact that distributed systems' defining features and fundamental issues have not changed significantly over time. It's true that developments like cloud computing, automation, and DevOps make life easier and lessen the effect of these misconceptions, but they don't eliminate them entirely.

A person could be forgiven for thinking that the eight erroneous assertions don't matter. Assuming there are no problems with your system, it is unquestionably convenient. You may not regularly encounter these fallacies, but that's no reason to ignore them. Knowledge of these constraints, on the other hand, will aid in the design of more robust and efficient distributed systems.

It will be fascinating to observe how technology develops and determine if the flaws of distributed computing are still important 10, 20, or 30 years from now. For the time being, they are. Because of this, developing trustworthy distributed systems is a formidable technical obstacle. Thinking otherwise is the biggest mistake you could make.

Top comments (0)