Recently, we enabled an APM (Application Performance Monitoring) tool for our webshop. This gave us incredible insights into what's happening in every single request. We can now see time spent on retrieving things from databases, rendering HTML and middleware processing for a single request.
One thing immediately caught our attention: Most of our requests' time was spent communicating with Redis. We use Redis for caching and sharing data between instances in our cluster, for example for session information.
Looking at a couple of examples, we saw talking to Redis was mostly really fast, 9 times out of 10 we saw sub-millisecond response times. But then sometimes, it would take ~42ms to execute a command. This happened for both
SET type commands sometimes, and was always around 42ms. And it also happened when talking to a Redis server running on the same host.
Now that's what I call a weird problem.
Redis is single-threaded and all commands are executed in sequence. Naturally, our initial assumption was that there was some very big thing being retrieved or set. This could block the whole process for some time, and all other processes would have to wait. We took a look at the actual data being stored and retrieved in some slow cases, and it wasn't close to a size that would cause problems.
Redis provides a very extensive latency debugging guide. We went through this step by step (a couple of times) to find the cause of our latency issues.
Did we suffer from intrinsic latency? Yes, but not even close to the 42ms kind of latency we were seeing.
Did our network have latency? Well, we were seeing about the same numbers for localhost and remote communication. The cause can't really be in network latency.
Were these commands we were executing just really heavy? We enabled slowlog and looked at the results. We were still seeing 42ms latency from our application hundreds of times every minute, but there wasn't even a single command that took more than 10ms on the Redis side of things.
The same way we went through all other possible issues. The
transparent_hugepage setting was already disabled. There was no swapping going on. We had already disabled persistence (since this store is only used for caching). Even on a Redis server with only a couple of keys we could reproduce the issue, so expires were not affecting our latency.
Our conclusion: Redis wasn't the issue here. So what was?
We tried to reproduce the issue using another handy built-in tool, redis-benchmark. This way we could run any kind of command with arbitrary payload sizes. We tried to reproduce the cases we saw as closely as possible, but didn't get close to the latency we were seeing from our application.
Then we did the same, but now with the client that our application was using. This application is written in PHP, and we use Predis as our Redis client. With a small benchmark script, we could easily reproduce the issue.
So is Predis just slow? It's really just a small wrapper around a TCP connection to Redis. We managed to determine that actually in our benchmark script all time was spent waiting on a response from Redis. And every time almost exactly 42ms. So Redis is fast, Predis is fast, but for some reason we are waiting 42ms for local network communication.
This is when we found a really old Github issue referencing this problem. The issue is actually a combination of TCP optimizations that work against each other. One is Nagle's algorithm, which tries to reduce the amount of packets that need to be sent over the line by bundling packets. The other is Delayed acknowledgement, which tries to reduce the amount of acknowledgement packets that need to be sent by bundling those. The whole issue is explained really nicely in this blog post by Julia Evans.
Fortunately by now a fix has been implemented in Predis, so we could enable
TCP_NODELAY on our connections. This immediately improved latency.
The TCP issue that we ran into can happen on any TCP connection, not just with Redis. It's just that Redis is normally extremely low latency which makes it very obvious to see something is wrong. For libcurl, the default setting has actually been updated somewhat recently, so you shouldn't see problems there. For other HTTP/TCP clients, this might be something you want to investigate for your applications.
Enabling APM has given us a lot of insights into what our application is doing. If you haven't tried it yet, I would highly recommend it for any web application. There are lots of options out there, we've had good experiences in trying out Datadog APM.