DEV Community

Cover image for Design for failure
SamKnowsCoding
SamKnowsCoding

Posted on

Design for failure

Once you design your application as a collection of stateless microservices, there are a lot of moving parts, which means there is a lot of potential for things to go wrong.
Services can occasionally be unresponsive or even break, so you can't always rely on them being available when you need them. Hopefully these events are very transient, but you don't want your application to fail just because some dependent service is running slow or has a lot of network latency on a given day. That's why you need to design for failure at the application level. Since failure is inevitable, you must build your software to resist failure and to scale horizontally.

Embrace the failure

Failure will happen. So that is why we must design for failure.
Failure is the only constant. We must change our thinking, from moving from how to avoid failure to how to identify failure when it happens, and what to do to recover from it. This is one of the reasons why we moved DevOps measurements from “mean time to failure” to “mean time to recovery.” It’s not about trying not to fail. It’s about making sure that when failure happens, and it will, you can recover quickly.

Plan to be throttled and retry, degrade gracefully

Plan to be throttled. You pay a certain level of quality of service for your backup service in the cloud, and they hold you to that agreement. Let's say you choose a plan that allows 20 database reads per second. When you exceed that limit, the service will throttle you. You will get a 429_TOO_MANY_REQUESTS error instead of 200_OK, and you will need to deal with this problem.
In this case, you would retry. This logic needs to be in your application code. When you retry, you want to back off exponentially on failure. The idea is to degrade gracefully.
Also, if you can, cache where appropriate so you don't always have to make remote calls to these services if the result won't change.

Retry Pattern

Image description

This allows the application to handle transient failures by transparently retrying and failing operations when trying to connect to a service or network resource. I have heard developers say, you have to deploy the database before starting my service because it expects the database to be there at startup. This is a fragile design that is not suitable for cloud-native applications. If the database is not there, your application should wait patiently and then retry again. You must be able to connect, and reconnect, as well as fail to connect and connect again. This is how you design robust cloud-native microservices. The key here is a retry mode that backs up exponentially, with longer delays between each attempt. Rather than retrying 10 times in a row and overwhelming the service, you retry and let it fail. You wait a second, and you retry again. Then you wait 2 seconds, then you wait 4 seconds, then you wait 8 seconds. Each time you retry, you increase the wait time a bit until all the retries have been used up, and then you return an error condition. This gives the back-end service time to recover from the factors that caused the failure. It could just be a temporary network delay.

Circuit Breaker Pattern

Image description

The circuit breaker pattern is similar to that of your home's electric circuit breaker. You may have experienced a tripped circuit breaker in your home. You may have done something that exceeded the circuit power limit and caused the lights to go out. That's when you took a flashlight down to the basement and reset the breaker to get the lights back on. This breaker pattern works in the same way. It is used to identify a problem and then do something to avoid a cascading failure. A cascading failure is when one service is unavailable and causes a cascading failure of other services. With breaker pattern, you can avoid this by tripping the breaker and having an alternate path back to something useful until the original service is restored and the breaker is off again. It works in such a way that everything flows normally as long as the circuit breaker is closed. The circuit breaker is monitored for failure up to a certain limit. Once this limit threshold is reached, this particular threshold, the circuit breaker trips and all further invocations of the circuit breaker return an error, not even an invocation of the protected service. Then after the timeout, it enters this semi-open state and tries to communicate with the service again. If it fails, it goes back to the closed state. If it succeeds, it goes back to the fully open state.

Bulkhead Pattern

Bulkhead Pattern can be used to isolate a failed service to limit the scope of the failure. This is a pattern where the use of a separate thread pool can help recover from a failed database connection by directing traffic to another thread pool that is still active. It gets its name from the design of bulkheads on ships. The compartments below the waterline are separated by walls called "bulkheads". If the ship is damaged, only one compartment will fill with water. Bulkheads prevent water from affecting other compartments and sinking the ship. Using bulkhead pattern isolates consumers from services that fail as a cascade, allowing them to retain some functionality in the event of a service failure. Other services and functions of the application continue to work.

In general, failure is inevitable, so we design for failure, rather than trying to avoid it. Developers need to build resilience so they can recover quickly. Retry pattern works by retrying failed operations. Circuit breaker pattern is designed to avoid cascading failures. Bulkhead pattern can be used to isolate failed services.

Thanks for reading.

Top comments (0)