Recently I was reading a blog on Circuit Breaker by Martin Fowler. I decided to summarise the content based on the blog and my experiences on this topic.
In distributed systems a system often needs to call remote service calls to get proper data. But it can generate critical resource outages, cascading failures across multiple systems if the remote services went down and lots of service calls are being queued up till timeout happens. These situations can be avoided easily by implementing circuit breaker patterns.
Circuit breaker pattern is simply like this, if there is too much error(based on desired error threshold) simply terminate the call to remote service immediately after a remote call is being received. It will start making the calls after it restores the communication with the remote service. It should check the service health after certain period to check the service availability after enabling the circuit breaker.
Let's say we implement a circuit breaker on a remote service call. Whenever a request needs to be executed it sends to a queue. Then a consumer reads the queue later and executes the service call one by one or multiple threads and response back.
So in this situation we set rules like if 20 service calls in the last 60 seconds are being timed out or get network/gateway errors like(5xx errors, 422 etc) we will enable circuit breaker for 60 sec.
Means for next 60 sec no service calls will be executed or the consumer will not consume any service request from the queue.
In the background there will be a worker who will start checking the health status of the remote service in every 5 sec right after the circuit breaker is being enabled. Once it gets the healthy service response it will monitor till 20 more sec. After that it will reset the circuit breaker or increase the circuit breaker timeout for more 60 sec.
Benefit: Assume we had million service requests in the queue, and the consumer has 50 threads to process the queue. Without the circuit breaker it would try to execute all the requests even if the remote service died or made errors. Right after we got errors(till the threshold) for the service calls, we started holding the requests till the service came back online and had a smooth response. This makes a smooth transaction between distributed services and less error handling.
Sample Package to Check: I checked one of the python packages for circuit breakers. Thought i did not use that personally, but seems it has some functionality out of the bat. Will definitely try later.