Hunting down large-scale Node.js application performance bottlenecks

#devops #node

Here’s a case study from rspective site reliability team. This time we want to share a story of our customer. To reduce the costs of infrastructure (above 60 000 EUR per month) our customer decided to give up AWS in favour of baremetal + kubernetes solution. This transfer brought up the expected benefits (way fewer machines hosting databases and cache, way cheaper CDN cost) but for unknown reasons, it also highlighted the problem of increased resource consumption and growing latency on machines which host Node.js applications.

A threat of a partial return to expensive AWS hung over the business. So we decided to take a closer look at the problem from the inside.

We start profiling. The first step and immediately the first tough nut to crack surfaces on our way. An application, which locally generates call- and flamegraphs, doesn’t work in production. We switch to the manual v8 profiling which means starting the node process with the --prof flag.

Unfortunately, downloading and processing logs fails in the node version 8.10. Cause? A bug. The same thing in 8.12, fortunately 10.x allows us to move on.

We analyze the logs to check the CPU peaks and thus find out what takes up most of the processor time. We have a suspect - it's the “find” method of lodash. We optimize it and that helps. Converting the data format from a table to an object is one of the remedies that gain 20-30 ms on latency for several endpoints.

Clearly, we aren't satisfied yet. Profiling brings more suspects. One of them is a piece of code that impacts all of the request processed in the backend.

It turns out that another element from lodash - cloneDeep, which was supposed to provide immutability and was introduced over a year ago - with the current volume of data has a negative impact on latency and consumption of a processor.

This problem has been hard to capture because its influence on the overall performance has been growing gradually. As it usually happens during optimization, the long-sought trouble disappears after a simple change. In this case, it turns out to be replacing cloneDeep with Object.freeze.

We verify the patches for 1 kubernetes pod. The result - processor consumption decreases by 30%, the average latency for the whole supply is reduced from 140 ms to 30 ms. We decide for a rollout to all production machines.

The final effect looks satisfactory. At 700 pods with applied patches, average processor consumption decreased from 30 to 8% - which means we can reduce the number of pods.

By incrementally reducing batches of 100 pods, we reached 200 pod mark with CPU consumption of 44% peak time. Which is a better result than the initial peak time with 700 pods (~55%).

What have we achieved? We have freed up a lot of resources and gained space to handle more traffic and upcoming features. And, of course, the client does not have to go back to the expensive AWS.

ICYMI - we're hiring for rspective and Voucherify