Darius Juodokas

Posted on Mar 19, 2021 • Edited on Apr 10, 2021

"Avoid Round-Robin in PROD" or "The tale of the bad raspberry"

#loadbalancing #performance #roundrobin #truestory

Why do I care?

It's all good while it's good. When things go south, you might want your balancing mechanism to reliably keep the show on the road. Round-robin works in many cases, but there are cases when RR will slow your service down to a complete halt. And adding more instances won't do you any good. Believe me, this is not a corner-case :)

What is round-robin

Round-robin is an algorithm that decides which item to choose next from the list. It's the second simplest algorithm (the first one would be "always select *N*th item", and the third always choose a random item). Suppose you have 5 raspberries lined up on your table. These are freshly picked, red, big, juicy, sweet raspberries. Oh, I know you want them! Which one will you take first? Which one after that? And then?
Willingly or not, you will apply one or another algorithm to the sequence you are picking those raspberries off the table and om-nom-nomming 'em real nice. You can take them randomly, sequentially (left-to-right or right-to-left), middle-out, out-middle, biggest first, or any other algo.

Since RR doesn't really apply to items that aren't reusable, let's assume I'm a generous person. Once you take one of the 5 raspberries, I put another one in the place of the one you just took. Sounds good?

RR looks like this. You assign indices to your raspberries 1 to 5. Say the left-most is 1 and the right-most is 5.

you take the 1st raspberry [and I put another one in its place]
you take the 2nd raspberry [and I put another one in its place]
you take the 3rd raspberry [and I put another one in its place]
you take the 4th raspberry [and I put another one in its place]
you take the 5th raspberry [and I put another one in its place]

and then round we go

you take the 1st raspberry [and I put another one in its place]
you take the 2nd raspberry [and I put another one in its place]
you take the 3rd raspberry [and I put another one in its place]
you take the 4th raspberry [and I put another one in its place]
you take the 5th raspberry [and I put another one in its place]

and then round we go

you take the 1st raspberry [and I put another one in its place]
you take the 2nd raspberry [and I put another one in its place]
...

See, it's a very simple algorithm. You always take raspberries in the same sequence, one-by-one, until you reach the last one. And then you begin the same sequence anew.

Scale up the example

Scaling the berry eaters (consumers)

Let's enhance our example. Now you are not eating them yourself. You are in a room full of people and they keep on coming for the raspberries. A person approaches your desk, you pick the next raspberry and give it to that person. I replace the berry on your desk with another one and the person walks away.

Nothing really changes, right? You can still apply RR algorithm. You know, which berry you took last and you know which one you'll pick next.

Scaling the berry distributors (producers)

Now the example becomes a bit more complex. We're scaling YOU. To make it go faster, we're now assigning 2 more peoples to distribute the raspberries. Now there are 3 people working in a round-robin manner. This doesn't change anything really, I'm scaling YOU just to make the example more realistic. If you were alone, it would be difficult for you to serve several people at the same time. Now since there are 3 of you, multitasking becomes very realistic.

However, you are still picking the same 5 raspberries. You don't have 3 different sets of berries. You only have one set.

One bad berry

I have made a decision. Every time you pick a berry at spot #3, I'll no longer replace it. Instead, you will have to come to take it from the bucket yourself and put in on the table. This will slow you down considerably. If you could serve 2 people in 1 second before, now you'll find it hard to serve 1 person in 5 seconds. Your throughput dropped 10-fold: from 2pps to 0.2pps (people-per-second).

But that's alright since there are 3 of you and there still are 4 berries "cached" on the desk all the time!

Not really though...

The halt

Do not expect you three will always be picking berries in the same order. One of you is faster, another one is slower. You will work at different paces. And the people - they will come randomly: sometimes the queue will be 20 peeps long, other times there will be only 2 folks, both of which you can handle at the same time (there are 3 of you, the distributors).

And this is the reason why at times 2 of you (or even all 3) will be handing over the same 3rd berry. While you were running towards me to get that berry, other 2 or 3 folks completed the full cycle (4→5→1→2→3) (or twice!) and now they are running after you - to get the 3rd berry from me.

What's happening at the client side of the desk? People are waiting. They are getting anxious, because there are 3 of you, there are 4 berries on the table and you are running around to get that 1 berry that isn't there.

Then you all come back, serve the 3rd berry, complete the cycle and again you go running. And again people are waiting.

Raspberries in PROD

They can be anything you are iterating over in RR manner. Be it DNS records, servers in the pool, the next hops in the LoadBalancer, etc. If you have multiple consumers and a set of items you are serving in RR pattern, and one of the items is considerably slow, all the consumers will notice the slowdown. And every consumer will be slowed down equally. Because the slowdown is not alleviated by the number of items in the list. If the consumer gets the instance - it gets the slowdown.

The only thing that is alleviated by adding more items is the frequency, how often the slowdown will occur.

If raspberries were servers

Suppose there are 60 web servers in the pool. Normally a webserver responds in 100ms. Great! One of the 60 servers' MEM% reached 100% and it's currently swapping. CPU% immediately sky-rockets to 100% too. It still is capable to serve requests, but... very, VERY slowly. It now takes like 30sec to serve 1 request. And your liveness probe timeout is 40 seconds, so the server responds to health check polls on time. It can still accept new requests.

What happened to the raspberries happens there too. All the requesters eventually iterate over all the 60 well-working servers and end up at the one that's terribly slow. Since a browser is making many calls to load a single webpage, it's very easy to complete 1 iteration over the set of 60 servers! And if your webpage fetches 120 items to load the page, the same page load will probably hit the bad server twice, if no one else is using the system. Hit that server once - you'll wait for 30 seconds+ to load the page. Hit that server twice - you'll wait 60seconds+ to load the page. And so on.

How many users are willing to wait 30+ seconds for your webpage to load?

Why am I telling you this?

Because we've stepped on that very landmine. As I said in the beginning, RR is a very simple and good algorithm as long as everything works. Heck, it's even a great algorithm!

But it takes ONE bad berry to halt your RR for good. If you can remove the bad berry from the items set - great! RR is now running all fine again. But if you can't, or it's still kicking and doesn't want to go away...

We are running an application scaled rather widely horizontally. Think hundreds of instances. And we are running load tests generating solid amounts of requests in a very short period of time. During one of the tests, I decided to make a heap dump of one of the JVMs in the cluster. HD halts all the threads for half a minute or so but doesn't kill the JVM. And then I noticed the phenomenon: even though there were hundreds of other servers working in parallel and I was only freezing one of them, load on ALL the servers dropped completely (from 80% CPU% to ~5% CPU%). So freezing a single server froze the entire application. For good! Now, what if I was taking an HD in the PROD cluster? Users' browsers would stop loading the page.

Another phenomenon: heavy workers attract more work

The problem I had

I recall now why I was taking that heap dump. That JVM's memory usage was higher than on other JVMs in the cluster. And the CPU% was higher too. It didn't make a lot of sense: all the instances are the same, why is THAT one getting more load?

It looked like this:

Notice how the red server load is significantly higher than on the other servers. Then it drops and another server immediately takes over. Now look at the beginning of the test: the load is ramping up on all the servers, and then that red server goes rogue and the load drops on all other servers. While the load generators keep on generating the same amount of load.

I looked everywhere: LB distribution, proxies' access logs, application logs, configurations, thread dumps, GC logs, stickiness... Nothing was off. According to the setup, all the instances should be receiving the same amount of workload. Yes, yes, some requests are heavier than others. But we are talking about tens of thousands of parallel requests and hundreds of servers in the pool. I'd expect more than one server to exhibit that behaviour!

I thought hard. I was modelling different request paths in my head for several days, and then it hit me: what if the CPU% is the cause and not the effect? Let's try it out. I ran several loops like below on one of the well-performing servers:

while :; do :; done

to increase cpu% usage and effectively slow down the throughput in that JVM. And it worked. I got control of the phenomenon: I found a way to break an instance. See here:

I could deliberately make either of the instances bad. This is a step to the right direction. Does this only work with the CPU%? Let's try a SIGSTOP.

Oh my... That I did NOT expect. You can clearly see where I issued a SIGSTOP ON A SINGLE INSTANCE. All the instances halted. SIGSTOP followed by a short pause of several minutes and with a SIGCONT, to keep the app alive.

As you see, freezing (not KILLING, not restarting, not shutting down, not removing from the network, but freezing) a single instance in a cluster halts all the other instances. It doesn't happen immediately - there's a delay of several seconds (could be minutes: the more requests are coming, the shorter the delay will be). And it doesn't really help to have thousands of instances in the pool... The same result will happen. Just the delay might be slightly longer.

Why did I have that problem

It might seem like there are two problems in my case above, but... really, it's the same one. It's round-robin load balancing and a bad raspberry in the cluster.

Single node freezes the whole cluster

Remember the raspberry example? If I slow down one node (or if I freeze you when you try to take raspberry #3), the whole flow eventually aligns up at the slower node and all the requests slow down at that point. If the node is halting the requests, then all the requests will halt. They won't be rejected or dropped - they will just... stop. For as long as the bad node is standing still.

Heavy worker steals work

Now for the initial problem. It was a head-scratcher. The problem was that the server was slightly slowing down when processing one of the requests. Let's call that request POST_X. Processing of POST_X request caused 100% cpu% usage for a very short time, which slowed all the transactions a tiny little bit. However, that little spike slowed the JVM just enough for another two POST_X requests to reach that server. Now the 100% CPU usage was twice as long. Which caused another bunch of POST_X requests to get trapped in that server. And so on and so forth. And eventually, that instance was doing nothing but processing POST_X requests (and a few others). It's easy to imagine that the CPU% was 100% all the time. It became a bad raspberry. Because it was a slow server, it eventually attracted all sorts of requests, not just POST_X. This explains why all the servers lost their load and that one bad berry attracted most of the requests sent to the cluster.

There was only one bad berry. Other instances also had to process POST_X requests and they also used to slightly spike their cpu%. However, the server that got bad first acted as a bottleneck - requests got held in that single server and less requests-per-second reached all the other servers. Meaning, that in other servers POST_X-induced JVM pause was not long enough for another POST_X request to arrive before the peak ended (there were fewer requests floating around, as e2e flows were stuck in the bad server).

See the change of winds on the graph? Sometimes the bad berry jumps on to another server. I haven't checked that, but I assumed it could be JVM GC on one (or several) of the good servers that kicked in, held some of the requests (these e2e flows didn't iterate over to the current bad berry) and gave some time for the bad berry to cope with its current workload. As it did, its CPU% dropped. As soon as the GC ended, someone had to become a bad berry. If we have 100 nodes in the cluster, every node had a 1/100 probability of becoming a bad berry. It could be the same node, it could be the one that GCed, it could be any other node.
However, that's just a hypothesis I haven't confirmed (nor denied) reliably by factual data.

The berry became bad because of POST_X requests accumulating, but soon enough there were lots of other requests jamming that same CPU. POST_X was a firestarter.

The problem was fixed by changing the application code and making the POST_X less CPU-intensive (something about drools and a negative one.. don't ask).

What's better than RR?

Well, RR is an amazingly simple and easy algorithm and it seems to be just enough for nearly every use-case. However, it's only true as long as things work well.

In my particular case, it would be better to load balance between nodes either by applying the least_cpu_load or even better - my favourite - least_active_connections policy. These are more complex, more sensitive policies and require more precise maintenance, but they should prevent your cluster from halting completely if one of the nodes freezes.

If you want (or have) to stick with RR, make sure you have more than adequate liveness monitoring. For instance, if a server fails to respond to a h/c request in 5 seconds - remove it from the request-receivers pool and let it drain its workload while other nodes handle all the new requests. When that node manages to respond to h/c requests in under 5sec - put it back online. If the node responds to a h/c with an error code or an ECONNRESET or anything other erroneous -- remove that node from the request-receivers pool and kill it as soon as its workload drains (if it does). Kubernetes does its routing using iptables, and iptables (netfilter) has a concept of conntrack. Removing the node from request-receiving pool is as simple as updating the iptables rule matching its IP from

-j ACCEPT

-m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT

which will allow the server to only accept traffic in already established connections, but won't accept any new connections. The server will also be able to send responses.

Written with StackEdit.

DEV Community