Program Testing can be used to show the presence of bugs, but never to show their absence. -- Edsger Dijkstra
Fault Tolerant systems
A fault tolerant system is a system which is able to operate even when faults are present. Important to a fault tolerant system is a specification, which defines what it means for a system to operate without failure. The specification defines the expected behavior, like 99.99% availability.
Fault, Error and Failure are often considered synonymous, however there are differences between all of them. A fault is a latent defect is a system which can lead to an error when activated. An error is a manifestation of a fault.
Failure occurs when a system no longer complies with its defined specification. A system crashes when its not supposed to, or a system does not respond to a request, or a system computes incorrect results are all examples of failure.
Fault tolerance happens at execution time. What this implies is that a fault tolerant system has been designed in a way that the system will behave correctly and not have failures, even if errors occur during execution of the system.
Architectural Building Blocks
Units Of Mitigation
When building fault tolerant applications, it is paramount to identify and architecturally define the right granularity for units of mitigation. A unit of mitigation identifies the basic units for error containment and recovery. Choosing the right granularity of units of mitigation is paramount to building fault tolerant system.
If the unit of mitigation is too large it can lead to unavailability of the system as a whole. As an example, consider a monolith or a micro-service deployed as a homogeneous cluster. If each node is considered as a unit of mitigation, then not only is the whole node unavailable in the face of errors, but the complete cluster could be rendered unavailable. This could be since the same set of conditions can lead to the same errors manifesting across the cluster.
A better approach would be to evaluate finer levels of granularity. In some cases, these could be tiers within an application. An important design criteria is to have well defined interfaces, through which these tiers interact. Assuming that the front-end, middle-tier and the data access layers are identified as units of mitigation, the error containment and recovery could be handled at the tier levels especially at the peripheral interfaces.
Web-services could have units of mitigation as service endpoints. In a publisher-consumer architecture, the units of mitigation could be at the producer and consumer level. A request response architecture could have units of mitigation as the Request and response handlers. Identifying units of mitigation at the correct level of granularity would allow better system availability as well as make it easier to apply other patterns of error recovery. We will discuss some of these error recovery patterns later in this post.
Fault tolerant systems assume the existence of unreliable components which can fail at any time. However the system as a whole should continue to function even in the presence of these failures. This assumes that there is a certain degree of redundancy in the system. Typical distributed systems comprise of a n+m topology. This means there are n heterogeneous components each with m replicas/nodes.
In considering interactions with a fault tolerant system, we want to look at its behavior as a black box. From outside, requests are sent to the system for processing. These requests could be RPC or Web style requests. To be robust, these requests can be retried by the source. In classical fashion, once a request is issues and a timer has expired, the request is retried. From the perspective of the fault tolerant system, these requests should be idempotent, otherwise retires would occasionally result in duplicate work. As a design choice, it makes sense to break up a fault tolerant system into sub-systems which are each idempotent. By capturing state between the sub-systems and ensuring that state is kept across failure, ensures that the system as a whole would be fault tolerant.
In the world of distributed systems, idempotency translates to the fact that an operation can be invoked repeatedly without changing the result. Idempotence requires that the system does not duplicate work in the case of a failed response.
Assume a scenario where a client sends a request to a system to debit $100 from his account. Assuming an initial account balance of $1000, in the success case, the system would send a response with the result, $900.
However, there can be two failure cases which can lead to the system to behave erroneously.
Scenario 1 -- The system debits the account, however, the response times out.
Scenario 2 -- The node at which the request was sent fails before acting on the request, either due to a network partition or a crash.
Unique Identifiers for Unit of Work
In scenario 1, if the response times out, the client would retry the request. In such a case the system would erroneously debit the account twice, leading to the account balance of $800, although the client expected a single debit. This would fail the requirements of idempotency. Ideally we want Exactly Once semantics. That is if there are duplicate messages, the system should perform work just once. One of the ways of dealing with this scenario is to have Uniquefiers - Unique identifiers for unit of work.
Each request received by the system is associated with a uniquefier. When the system receives a request which has a uniquefier it has not seen before, it treats it as a new request. A retry would have a uniquefier which the system has seen before, in which case the system sends the previously computed response to the client. These uniquefiers can be maintained in some kind of a near cache for lookup by the system, to find out whether its a previously seen uniquefier or a new one.
Scenario 2 is simple. Since the system never processed the request, the client can retry the request. Irrespective of which node the request goes to, the system would behave correctly.
Commutative operations are operations which can be performed in any order without affecting the end result of the system. Commutative operations guarantee the same result in the event of out of order requests. Commutative operations are a common means of achieving CRDT's(Conflict Free replicated data types). CRDT's are used to ensure eventual consistency without the overhead of expensive synchronization or consensus. CRDT's come in two flavors
When a replica receives an update, it first updates its local copy. At some later time, it sends its complete state to another replica. Occasionally, each replica would send its complete state to some other replica. The receiving replica would apply a merge function to merge its local state with the state it just received. Similarly this replica also occasionally sends its state to another replica, so every update eventually reaches all replicas in the system. If the merge function is idempotent, commutative( and hence associative), then all the replicas are guaranteed to converge to the same value.
As an example, consider a simple merge function that takes the maximum value of a counter. In such a case even if local maximums would differ across replicas, eventually all the replicas would converge to a single overall maximum value of the counter.
In this approach, the replica does not send across the state, which can be huge, but just the operations it has executed as updates as a broadcast to all the other replicas. The other replicas are supposed to replay the updates locally. Since these are broadcasts, there might be multiple updates lets say u1 and u2 which are sent to two other replicas r1 and r2. The replicas might receive these updates in different orders, lets say r1 receives u2 before u1 and r2 receives u1 before u2. However if these operations are commutative, the replicas would converge to the same state.
We've used operations based CRDT's in e-commerce and order fulfillment systems where we used graphs to model the CRDT with vertices as start and end state, with the edge weight capturing the quantity for the operation. Duplicate operations would result in the same edge. Additionally, all replicas converged to the same digraph eventually.
Recovering from Errors
Error recovery continues execution even with a detected error by placing the system in a state that does not contain the error. This means resuming execution at a known place, allowing it to continue processing at least as well as it did before the error was detected. Error recovery consists of two main parts. The first part involves undoing the bad effects of the error. The second part involves creating an error free state in the system that can resume execution. Both of these must use a minimum amount of time in order to maximize availability. We discuss some of the error recovery mechanism's here in brief.
A Checkpoint is an incrementally saved state that facilitates rapidly restoring processing to a point at which the state was saved. Restoring from the saved state decreases the time required to return to the same state that existed at the time of the error. Instead of having to replay the entire sequence of events from the beginning, processing can resume quickly from an intermediate state. This decrease in recovery time increases the amount of time that the system is available for service, which is its availability
Rollback involves taking the system back to a known good state. If a Checkpoint is available then the system state can be restored to the one that it was in at the point to which the rollback will proceed. If there are no checkpoints then the rollback should take processing back to right before the last requests were started. Requests should be saved until completed so that if a rollback occurs the requests still exist and can be processed again.When the system state is rolled back some work will be done twice, once before the error and once afterwards. Care must be taken to ensure that side effects of this repeated work does not cause new problems.
Failover can be used in cases where the system has redundancy and if an active replica fails, processing can be continued on a standby. There are various topologies but we will assume a n+m redundancy topology. Ideally, to continue execution the non-active unit should be placed in the same state as the active unit at the time when the error occurs.Data that is used by the failing active unit must be persistent in some way so that the redundant unit that takes over can have access to it. Other issues with failover can manifest in requiring a coordinator to automatically route traffic to the new active replica rather than the old one.
Retries are a mechanism to handle either transient failures or leverage failover to continue handling requests. However retries should be limited since faults are deterministic; when a latent fault is given the same stimuli it will activate in the same way. Reprocessing stimuli from before an error can result in the error reoccurring. If the events to be processed are being delivered to a distributed system with multiple nodes in, for example, an n+m topology redundancy arrangement then the failure will march through the entire system, failing one processor after another, leading to catastrophic failure. Other important aspects are to buffer the time between retries, typically using some kind of exponential mechanism.
In this post, we have covered some ground around architectural considerations for fault tolerant systems, as well as some of the error recovery strategies. There are a lot of common patterns like System Monitors, Circuit Breakers, Bulkhead, Queue based Request management which are used actively, which I've not covered here.