If at first you don't get an answer...

#aws #architecture #microservices #distributedsystems

give up quickly and ask someone else. That's my favorite microservices retry policy. To understand what works and doesn't work, we need to think about how distributed systems of services and databases interact with each other, not just a single call. Like my last post on why systems are slow sometimes, I will start simple, pick out key points, and gradually explain the things that will kill your distributed system, if you don't have timeouts and retries setup properly. It's also one of the easiest things to fix, once you understand what's going on, and it generally doesn't need new code, just configuration changes.

A bad timeout and retry policy can break a distributed systems architecture, but is relatively easy to fix.

I often find the timeout and retry policy across an application is set to whatever the default is for the framework or sample code you copied when starting to write each microservice. If every microservice uses the same timeout, this is guaranteed to fail, because the microservices deeper in the system are still retrying when the microservices nearer the edge have given up waiting for them.

Don't use the same default timeout settings across your distributed system.

This causes an avalanche of extra calls, work amplification, that is triggered when just one microservice or database deep in the system slows down for some reason. I used to work on a system that fell over at 8pm every Monday night because the monolithic backend database backup was scheduled then, and database queries slowed down enough to trigger a retry storm that overloaded the application servers. The simplest way to reduce work amplification is to limit the number of retries to zero or one. I've seen defaults that are higher, and seen people react to seeing timeouts by increasing the number of retries, but this is just going to multiply the amount of extra work the system is trying to do when it's already struggling. Use zero retries in more deeply nested microservice architectures that will retry closer to the user.

Limit the number of retries to zero or one, to minimize work amplification.

This should be a fairly easy configuration change to implement, and a good checklist question that should be answered for your inventory of microservices is to ask how the timeout and retry settings are set, and what is needed to change the configuration. It's really nice if the settings can be changed dynamically and take effect in a few seconds. That provides a mechanism that may be able to bring a catatonic system back to life during an outage. Unfortunately in many cases the configuration is set once when the application starts, or could be hard coded in the application, or is so deep inside some imported library that no-one knows that a timeout setting exists, or how to change it.

Discovering and documenting current settings and how to change timeouts and retries can be a pain, but it's worth it.

Many people don't realize that there are two different kinds of timeout, some coding platforms and libraries bundle them together, and others expose them separately. The best analogy is to consider the difference between sending a letter, and making a phone call. If you send a letter, with an RSVP, you wait for a response. Maybe after a timeout you send another letter. It's the simplest request/response mechanism, and UDP based protocols like DNS work this way. A phone call is different because the first step is to make a connection by establishing a phone call. The other party has to pick up the phone and acknowledge that they can hear you, and you have to be sure you are talking to the right person. If they don't pick up, you retry the connection by calling again. Once the call is in progress, you make the request, and wait for the other party to respond. You can make several requests during a call. TCP based protocols which underlie many APIs, work this way. If the phone line goes dead, you have to call again, and establish a new connection.

Be clear about the difference between connections and requests for the protocols you are using.

The connection timeout is how long you wait before giving up on getting a connection made to the next step in your request flow. The request timeout is how long you wait for the rest of the entire flow to complete from that point. The common case is to bundle everything into the request at the application level, and have some lower level library handle the connections. Connections may be pre-setup during an initialization and authentication phase or occur when the first request is made. Subsequent application requests could cause an entire new connection to be made, or a keep-alive option would keep it around to get a quicker response for the next request.

Dig deep to find any hidden connection timeout and keep-alive settings.

How should you decide how to set the connection and request timeouts?
While connection timeouts depend on the network latency to get to the service interface, request timeouts depend on the end-to-end time to call through many layers of the system. The closer to the end user, the longer the request timeout needs to be. Deeper into the system, request timeouts need to be shorter. For simple three-tier architectures like a front end web service, back end application service and database it's clear what order things are called in, and nested or telescoped timeouts can be setup for each tier. The timeouts must be reduced at every step as you go from the edge to the back-end.

Edge timeouts must be long enough, and back end timeouts short enough, that the edge doesn't give up while the back end is still working.

Web browsers and mobile apps often have a default timeout of many seconds before they give up. However humans tend to hit the refresh button or go and look at something else if they are staring at a blank screen for too long. Some web site data says you should target 2-4 seconds for page load time. However they are waiting in an app for something like a movie to start streaming, then users will wait a bit longer, maybe up to 10 seconds. If you have too many retries and long timeouts between your services, it's easy to add up to more than a few seconds, and while your system is still working to get a response, the end user will ignore the result as they have already given up and sent a new request. The best outcome is that your application returns before they retry manually, with a message that tells them you gave up, and asks if they want to try again. This minimizes the chance that you'll get flooded by work amplification.

For human interactions, if you can't respond within 10s, give up and respond with an error. Try to keep web page load times below 4s.

One question is what to do after giving up on a call. This can be problematic, as described in the Amazon Builders Library piece Avoiding fallback in distributed systems, the fallback code to handle problems is often poorly tested, if at all. A common problem is that a system that times out calls elsewhere will crash, return bad or corrupted data, or freeze.

Injecting slow and failed responses from the dependencies of a service is an important chaos testing technique.

In my last post I talked about how end to end response time is made up of residence times for every step along the way, and those residence times are a combination of waiting time in a queue and service time getting the work done. During a timeout, there is an increase in wait time, and you can think of this as an additional timeout queue, where work is being parked that can't move forward. This can clog up systems by creating very long queues of stale requests. Processing them in order from the oldest first is often a bad policy, as you may never get to do work that someone is still actually waiting for. A better policy is to discard the entire queue of requests waiting for timeout when it hits a size limit. Each retry adds to service time, as the request is processed again.

If there is only one instance of a service endpoint or database, you have no-where else to go. Attempting to connect to it repeatedly and quickly is unlikely to help it recover, so in this case you need to use a back-off algorithm. This causes it's own problems, so should only be used where there is no alternative. Some systems implement exponentially increasing back-off, but it's important to use bounded exponential back-off with an upper limit, otherwise you are creating a big queue which will take too long to get connected, and upstream systems will already have given up waiting. This is also sometimes called capped exponential back-off. Adding some randomness to the back-off is a better policy, as it will disperse the thundering herd of clients that could end up backing off and returning in a synchronized manner. Deterministic jittered backoff is a way to give each instance a different back-off, but with a pattern that can be used as a signature for analysis purposes. Marc Brooker of AWS describes this technique in Timeouts, Retries and Backoff with Jitter.

Only if you have to, use random or jittered back-off to disperse clients and avoid thundering herd behaviors.

Consider a deeply nested microservice architecture, where the request flow depends on what kind of request is being made, and the ideal request timeout ends up being flow dependent. If we can pass a timeout value through the chain of individual spans in the flow then dynamic timeouts are possible. One way to implement this is to pass a deadline timestamp through the flow. If you can't respond within the deadline, give up. This requires closely synchronized clocks across the system, and may be a useful technique for relatively large timeouts of tens of seconds to minutes. Some batch management systems implement deadline scheduling for long running jobs. For short latency microservice architectures a better policy would be to decrement a latency budget for each span in the flow, and give up when there's not enough budget left to get a result, but there is still time to respond back up and report the failure cleanly before the user times out at the edge. The question is how much latency to deduct at each stage, and to solve this properly would need some extra instrumentation based on analysis of flows through the system.

Dynamic request timeout policies are needed for deeply nested microservices architectures with complex flows.

Connection timeouts depend on the number of round trips needed to setup a connection (at least two, more for secure connections like HTTPS), and network latency to the next service in line, they should be very short, a small multiple of the latency itself. Within a datacenter or AWS availability zone, round trip network latency is generally around a millisecond, and a few milliseconds between AWS availability zones. Sometimes people talk about one-way latency numbers, so be clear what is being discussed, as a round trip is twice as long.

Use a short timeout for connections between co-located microservices, whether direct, or via a service mesh proxy like Envoy.

Round trip latency between AWS regions in the same continent is likely to be in the range 10-100ms, and between continents, a few hundred milliseconds. Calls to third party APIs or your own remote services will need a much higher timeout setting. Your measurements will vary, but will be limited by signals propagating at less than the speed of light, about 300,000km/s, which is about 150km round trip per millisecond. Processing, queueing and routing the network packets etc. will always make it much slower than this, but unfortunately the speed of light is not increasing so this is a fundamental latency limit!

Use a longer timeout for connections between AWS regions, and calls to third party APIs.

If you set your connection timeout to be shorter than the round trip, it will fail continuously, so it's common to use a high value. This works fine until there's a problem, then it magnifies the problem, and causes requests to timeout when they could be succeeding over a different connection. It's much better to fail fast and give up quickly on a connection that isn't going to work out. It would be ideal to have the connection timeout be adaptive, to start out large, and to shrink to fit, for the actual network latency situation. TCP does something like this for congestion control once connections are setup, but I don't know of any platforms that learn and adapt to latency, and invite readers to let us know in the comments if they do, and I'll update this post.

Set connection timeouts to be about 10x the round trip latency, if you can...

The default connection retry policy is to just repeat the call. Like phoning someone over and over again on the same number. However, if that person has a home number and a mobile number, you might try the other number after the first failure, and increase your chance of getting through. It's common to have horizontally scaled services, and if you've built a stateless microservices architecture (that doesn't use sessions to route traffic to locally cached state), you should try to connect to a different instance of the same service, as your default connection retry policy. Unfortunately many microservices frameworks don't have this option, but it was built into the Netflix open source service mesh released in 2012. NetflixOSS Ribbon and the Eureka service registry implemented a Java based service mesh, Envoy implements a language agnostic "side-car" process based service mesh with a configurable connection timeout, and defaults retries to one, with bounded and jittered backoff.

If you have horizontally scaled services, don't retry connections to the same one that just failed.

There's a high probability that an instance of a microservice that failed to connect will fail again for the same reason. It could be overloaded, out of threads, or doing a long garbage collection. If an instance is not listening for connections because it's application process has crashed or it's hit a thread or connection limit, you get a fast fail and a specific error return, as the connection will be refused immediately. It's like an unobtainable number error for a phone call. Immediately calling again in this case is clearly pointless.

Fast failures are good, and don't retry to the same instance immediately if at all possible.

I've seen some microservice based applications behaving in a fragile or brittle manner, that collapsed when conditions weren't perfect. However after fixing their timeout and retry policies along these lines they became stable and robust, absorbing a lot of problems so that customers saw a slightly degraded rather than an offline service. I hope you find these ideas and references useful for improving your own service robustness.

Thanks to Seth and Jim at Amazon for feedback, corrections and clarifications.

Photo taken by Adrian in March 2019 at the Roman Amphitheater, Mérida, Spain. A nice example of a low latency, high fan-out, communication system.