Do you also feel the tension of the cover image? It is time for a battle again. 🥊 After I published the first part of my comparison, I was overwhelmed about the amount of feedback I received. May it be comments on my post, discussions on twitter or LinkedIn.
The fact that the initial post triggered a lot of inspiring discussions is very valuable. While reading through your feedback it was kind of obvious that there is a need for a second part.
I received a lot of feedback about optimizations for AWS Lambda and that people are curious how this affects the performance in comparison to our state machine. We will also take a closer look on the perspective of costs to get a more complete view how the services differ.
Here we are.
Like in our first part, again all experiments are triggered using Apache Bench with the following parameters.
ab -n 15000 -c 1 https://hash.execute-api.eu-central-1.amazonaws.com/.../
-n configures the total amount of requests that are triggered - in our case 15.000
-c is the number of concurrent requests - in our setup 1
⚠️ IMPORTANT: it is important to consider, that the results from apache-bench are not 100% accurate. The measured throughput depends on the hardware and network capabilities of my local workstation. For upcoming benchmarks, I consider to use something like CloudShell.
But apache-bench gives some very early feedback and potential indications. Hence we use these results in combination with the Lambda duration and Step-Function execution duration.
So what is the goal of our upcoming experiments? We want to
apply some optimizations on our Lambda function with a clear focus to decrease latencies. Based on the feedback I got, there were two main approaches for optimization:
- Reusing downstream http connections by activating keep-alive settings.
- Improving overall execution performance by increasing the allocated memory.
For short-lived operations, such as in our case writing and reading to and from S3, the latency overhead of setting up a TCP connection might be greater than the operation itself. To activate http keep-alive you simply have to set an environment variable in your Lambda function configuration.
Environment: Variables: AWS_NODEJS_CONNECTION_REUSE_ENABLED: 1
Let us deploy the change and start our first test. Let us first start with analyzing the Apache Bench reports. The complete reporting is available on GitHub. Here some highlights:
- The Lambda function was able to process all requests 43 seconds faster compared to the state machine.
- Both the state machine and the Lambda function were able to process round about 7 requests per second
- The mean time per request for the Lambda function was 131ms and 134ms for state machine.
Looking at these results, this little tweak of activating TCP keep-alive helped a lot to speed up the Lambda function. In terms of end-2-end performance and latency, both solutions are now very close to each other.
Let us take a closer look into CloudWatch and X-Ray to confirm the observations.
The average execution time of the state-machine is 46.4ms and Lambda performs with 49ms.
Here things are still looking interesting. The Lambda function duration on average still has some up and downs during the execution of the test while the duration of the state-machine is stable. Both solutions show some cold-start behavior while it seems that the state-machine needs less time to become "warm".
But in total the impact on the Lambda function performance is very impressive compared to the results in the first part.
But the question is: how much memory does my Lambda function need? The range is quite large from 128 MB to 10.240 MB.
There is an awesome open source tool called "Lambda Power Tuner" that helps you to determine your memory settings based on different strategies like speed, cost or balanced.
If you use "cost" the state machine will suggest the cheapest option (disregarding its performance), while if you use "speed" the state machine will suggest the fastest option (disregarding its cost). When using "balanced" the state machine will choose a compromise between "cost" and "speed"
In my case the "Lambda Power Tuner" suggested 256 MB as "Best cost" and 2048 MB as "Best Time".
Awesome, now we have a good start for the final tests.
As we aim to reduce latency, let us first start with the proposed "Best Time" setting of 2048 MB memory and let us have a look at the apache-bench metrics:
- The Lambda function was able to process all requests 81 seconds faster compared to the state machine.
- Both the state machine and the Lambda function were able to process round about 8 requests per second
- The mean time per request for the Lambda function was 121ms and 127ms for state machine.
Compared to our first test, there are some improvement but they seem to be marginally on average. Let us try to get some more insights using CloudWatch and X-Ray.
For the most parts, the duration of the Lambda function is just below the execution time of the state-machine.
The average execution time of the state-machine is 45.1ms and Lambda shines with 41.8ms.
What would happen, if we set our memory configuration to the setting considered as "Best cost"? Let us review the results in the next chapter.
In short again our apache-bench metrics:
- The Lambda function was able to process all requests 155 seconds faster compared to the state machine.
- The state machine was able to process 7.5 requests per second while the Lambda function processes 8 requests per second
- The mean time per request for the Lambda function was 122ms and 132ms for state machine.
CloudWatch and X-Ray results also confirm very close results.
The average execution time of the state-machine is 54.8ms and Lambda is just in the lead with 50.5ms.
Based on the scale of my test, the AWS Cost Explorer was not really helpful as the load I generated was too low. The AWS calculator is a helpful tool to better compare the costs of both services.
The estimate is publicly available if you want to have a detailed look.
I calculated with 5 million invocations per month per service. Based on our test results, I was able to determine very precise values for the parameter that influence pricing like Lambda invocation duration/state-machine execution or consumed memory. The monthly costs are:
- 8 USD for AWS Lambda with 2048MB memory (Best time)
- 1.83 USD for AWS Lambda with 265MB memory (Best cost)
- 5.52 USD for the AWS Step Function express workflow
In this part we covered some important aspects like options to improve the performance of a Lambda function. I think it is again very important to mention, that this benchmark should not be interpreted as "use Step Functions whenever you can".
My goal was more to raise discussions about the importance to not build you decision based on hypothesis or rumors. Make your decision based on data to make the best of all kind of decisions you can make.
Use Lambda to transform not to transport
Or in my words: the best code is the code that is never written.
☝️ And here come the thing and this is very important to keep in mind:
BOTH SERVICES ARE AWESOME.
If you have the need to write a Lambda function, you will be able to solve a lot of problems. But depending on what you want to achieve, Step Functions give you a lot of power to get the same results without writing ANY line of code, while making up your mind about things like TCP keep-alive or how to figure out what the best memory setting is. In all tests, AWS Lambda showed the well-known cold-start behavior that is something you should keep in mind. AWS Step Function also needs some warm-up time but it is not really comparable to AWS Lambda cold-starts. There was an interesting discussion around this on twitter:
It only remains to say: happy coding AND happy orchestrating! 🥳 I really hope that my analysis and the approach to decision-making helps you in deciding towards or against one of these services for your individual use cases.
About the author:
👋 Hi my name is Christian. I am working as an AWS Solution Architect at DFL Digital Sports GmbH. Based in cologne with my beloved wife and two kids. I am interested in all things around ☁️ (cloud), 👨💻 (tech) and 🧠 (AI/ML).
With 10+ years of experience in several roles, I have a lot to talk about and love to share my experiences. I worked as a software developer in several companies in the media and entertainment business, as well as a solution engineer in a consulting company.
I love those challenges to provide high scalable systems for millions of users. And I love to collaborate with lots of people to design systems in front of a whiteboard.