DEV Community

Cover image for AWS Step function vs. AWS Lambda benchmark
Christian Bonzelet
Christian Bonzelet

Posted on • Edited on

AWS Step function vs. AWS Lambda benchmark

Looking into the AWS ecosystem of serverless services, AWS Step Functions is one of my personal most favorite services. I recently had a chat with some colleagues about a potential use case of Step functions in favor of AWS Lambda. While we discussed the general concept of AWS Step Functions, one of my beloved colleagues argued towards the usage of AWS Lambda like

Let us use AWS Lambda because a workflow described as a state machine sounds like it is much slower.

I could neither substantiate this statement nor could I contradict it. So I started to examine the original assumption "Step Functions is slower than Lambda" with facts. Time for a benchmark!

For me the results were crystal clear 😆

One does not simply

Just kidding! Let us first get a common understanding what AWS Step Functions and AWS Lambda is. If you are familiar with these services, you can jump right into the section about the test setup and results.

By the way: the source code is also available for you on Github.

🤹 What is AWS Step Functions?

AWS Step Functions was published in 2016 as a serverless orchestration service. I think the following definition of AWS Step Functions explains very well, what kind of problems AWS Step Functions solves:

Step Functions is a serverless orchestration service that lets you combine […] AWS services to build business-critical applications. Through Step Functions’ graphical console, you see your application’s workflow as a series of event-driven steps.

Step Functions is based on state machines and tasks. A state machine is a workflow. A task is a state in a workflow that represents a single unit of work that another AWS service performs. Each step in a workflow is a state.
Source: What is AWS Step Functions? - AWS Step Functions

State machines can be invoked both asynchronously and synchronously. Step Functions itself offers several ways to invoke you state machine, for example:

  • via an explicit StartExecution call using your favourite AWS SDK,
  • on each http request hitting your AWS API Gateway,
  • as a destination in your Amazon EventBridge event bus

Typical use cases for AWS Step Functions cover data processing, machine learning, microservices orchestration or governance and security automation. Since the launch of the AWS SDK service integrations, you can use out of the box working integrations with every service that is supported by the AWS SDK. This offers you a huge number of new opportunities to integrate with AWS services without writing a single line of code.

While creating a new state machine you can decide between two execution types named “Standard” or “Express”. Each type has several characteristics and strengths. While standard workflows are a good fit for long-running workflows, Express workflows are a good fit for high-traffic workloads, data streaming or mobile application backends.

⚡️ What is AWS Lambda?

Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda runs your code on a high-availability compute infrastructure and performs all of the administration of the compute resources, including server and operating system maintenance, capacity provisioning and automatic scaling, code monitoring and logging. With Lambda, you can run code for virtually any type of application or backend service.
Source: https://docs.aws.amazon.com/lambda/latest/dg/welcome.html

Don’t get me wrong, I am also a big fan of AWS Lambda. But since AWS announced the game changing SDK service integrations for Step Functions, I start to think more about what are typical use cases for AWS Lambda. To use AWS Lambda more for the things that it is amazing at in the future.

Or to quote Eric Johnson at the serverless office hours:

Use Lambda to transform not to transport

⏰ Benchmarking latencies

The goal of this benchmark is not to say that service A is better/worse than service B. Each service has its strengths and weaknesses.
What we want to achieve is, getting a better understanding what kind of latencies we can measure for AWS Step Functions and how this compares to a similar integration based on AWS Lambda.

General setup

We want to measure the time it takes to read from and write data to Amazon S3 both from a state machine and an AWS Lambda function.

We test the behavior in two different versions. Version 1 simply writes to S3. Version 2 extends this by executing a GetObject operation afterwards. The code of the Lambda function is written in javascript.



const AWSXRay = require("aws-xray-sdk-core");
const AWS = AWSXRay.captureAWS(require("aws-sdk"));
const S3 = new AWS.S3();
const bucketName = process.env.DestinationBucketName;

exports.lambdaHandler = async (event, context) => {
  try {
    console.log("EVENT: " + JSON.stringify(event));
    const key = "lambda/" + event.requestContext.requestId;
    await S3.putObject({
      Bucket: bucketName,
      Key: key,
      Body: new Date().toISOString(),
    }).promise();

    await S3.getObject({
      Bucket: bucketName,
      Key: key,
    }).promise();

    const response = {
      statusCode: 200,
      isBase64Encoded: false,
    };
    return response;
  } catch (err) {
    console.log(err);
    return err;
  }
};


Enter fullscreen mode Exit fullscreen mode

The state machine workflow is similarly straight forward and chains the same Amazon S3 calls as the AWS Lambda function.

State machine graph

Both the AWS Lambda function and the state machine can be invoked via an API Gateway. All experiments are triggered using Apache Bench with the following parameters.

ab -n 15000 -c 1 https://hash.execute-api.eu-central-1.amazonaws.com/Prod/invoke-lambda/

-n configures the total amount of requests that are triggered - in our case 15.000
-c is the number of concurrent requests - in our setup 1

I decided to use this setting because I want to generate a moderate stream of load for both integrations.

X-Ray is activated on all integration layers so that we are able to get a complete trace from the API-Gateway down to S3.

Experiment 1 - Writing to S3

The first experiment focuses only on the execution of a PutObject without reading the files afterwards. The automatic Amazon CloudWatch dashboards for AWS Lambda, AWS API Gateway and AWS Step Functions are a good starting point to provide us valuable insights.

Let us first start with analyzing the Apache Bench reports. The complete reporting is available on GitHub. Here some highlights:

  • The state machine was able to process all requests 539 seconds faster compared to the lambda function.
  • The state machine was able to process 2.07 more requests per second
  • The mean time per request for the state machine is 35.92 ms lower than the lambda based integration

API Gateway latencies

A closer look into the Amazon CloudWatch dashboard underlines what Apache Bench tells us. While observing the complete length of the benchmark we see that the average latency of Step Functions is constantly below AWS Lambda.

Average latencies on API Gateway

Both integration types indicate a drop in latencies indicating some kind of cold start behavior. While the drop of Step Functions on average is more significant compared to AWS Lambda.

When we take a closer look into the 99th percentile, we see some more spikes but in general a similar result over time.

99 percentile latencies on API Gateway

Statemachine and AWS Lambda function execution

Let us now jump into the next integration layer and take a look at the duration of the AWS Lambda function and the state machine itself. Not very surprisingly that the the state machine is very much faster - in the end round about 60% compared to the duration of the Lambda function.

Statemachine and lambda execution

The AWS Lambda function runs with the default memory settings of 128MB and a default timeout of 3 seconds. Depending on the concrete use case, fine-tuning your memory settings might have a significant impact on the lambda metrics.

Downstream service latencies

I was very much surprised to see, that the connection between Step-Functions and S3 seems to be much more efficient. Looking at our X-Ray service map and traces the average latency between Lambda and S3 is 63ms compared to the integration with Step Functions of 28ms. It may be a coincidence that the relatively difference is also almost 60%. Or it might reveal, that Step Functions does some optimization handling the AWS client SDK under the hood.

X-Ray service map experiment 1

Experiment 2 - Write and read from S3

I was interested to know if the amount of work a statemachine has to cover, impacts latencies and execution times compared to my AWS Lambda function. Hence we extended our experiment to also read data from S3 after writing it.

Again, let us first check the report from Apache Bench:

  • The state machine was able to process all requests 1287 seconds faster compared to the lambda function.
  • The state machine was able to process 3.01 more requests per second
  • The mean time per request for the state machine is 85,83 ms lower than the lambda based integration

API Gateway latencies and execution duration

Long story short, the results are comparable to the ones from the first experiment. But it is interesting to see, that the gap between the state machine and the Lambda function is getting bigger. Some factors will influence this, like the chosen implementation and runtime of the AWS Lambda function.

💡 Please checkout the awesome article of my AWS Community Builder fellow Alexandr Filichkin about a performance comparison of the different lambda runtimes.

The AWS Lambda function is not able to get closer to the latency behavior of the state machine implementation.

API Gateway latencies experiment 2

The AWS Lambda function needs almost double the amount of time to write and read data from/to S3.

execution duration experiment 2

Also interesting to see, that the latency between my AWS Lambda function and Amazon S3 seems to slightly increase compared to the first experiment on average. AWS Step Function keeps on optimizing the connection to Amazon S3 🤩.

xray service map experiment 2

Conclusions

Based on the things I learned, what would I answer now if someone states

Let us use AWS Lambda because a workflow described as a state machine sounds like it is very much slower.

My general answer would be: measure first. My specific answer on the comparision of AWS Step Functions and a AWS Lambda function is, that this is not true in all cases. Our little experiment revealed some interesting insights:

  • AWS Step Function scales and is much faster in our setup compared to my AWS Lambda function.
  • In this experiment, the state machine shows a more efficient communication with S3 compared to my custom code implementation.
  • When we compare the Step Function implementation with AWS Lambda it is obvious that we do not have to write custom code to achieve the same results.
  • The new capabilities of the Step Function Workflow Studio and SDK service integration lower the barrier to achieve the same result in this use case while reducing time-to-market.

But be cautious in generalizing the test results. There is a lot you can do to optimize your AWS Lambda functions to optimise for performance efficiency. Your results might also differ in other use cases. These results should not disband you from creating additional benchmarks including your specific use cases to measure what is important to you.

Please also keep in mind if you really have to optimize for performance and take into account if it is also possible to implement your use case asynchronously.


About the author:

👋 Hi my name is Christian. I am working as an AWS Solution Architect at DFL Digital Sports GmbH. Based in cologne with my beloved wife and two kids. I am interested in all things around ☁️ (cloud), 👨‍💻 (tech) and 🧠 (AI/ML).

With 10+ years of experience in several roles, I have a lot to talk about and love to share my experiences. I worked as a software developer in several companies in the media and entertainment business, as well as a solution engineer in a consulting company.

I love those challenges to provide high scalable systems for millions of users. And I love to collaborate with lots of people to design systems in front of a whiteboard.

You can find me on LinkedIn or Twitter.


Cover Image by Mateusz Wacławek on Unsplash

Top comments (21)

Collapse
 
lisacopeland profile image
Lisa Copeland

I'm having an issue with making dynamoDB calls in a step function vs lambda - I am seeing queries taking 7-10 milliseconds in the lambda vs about 250-775 milliseconds in the step function - any thoughts as to what would cause this? Any input would be appreciated

Collapse
 
cremich profile image
Christian Bonzelet

Hard to guess. Do you see some throttling on DynamoDb that could cause the increased latency?

Collapse
 
lisacopeland profile image
Lisa Copeland

The issue turned out to be xray - when we disabled it the calls in the stepfunction took as long as the calls in the lambda

Collapse
 
artidataio profile image
Imaduddin Haetami

It's weird to me that you're using step function as lambda replacement. Both possibly can do the same task but step function is a much better interface for scheduling i.e. "What should we execute next? When?" and microservice orchestration and especialy good in passing along messages among compute services. While lambda focus on general computing services.

As you can see, you have managed to "put" then "get" to s3 with both lambda and step function. However, the focus here should be in step function, you have 2 computing service, one is get, the other one is put, and you execute one after the other. Whereas in lambda, you possibly have one code base with aws sdk to access the s3 that does the same thing.

Frankly, given the same task, I will simply use lambda as it will be the cheaper option. However, if you are benchmarking this way, you have misunderstood step function power.

Collapse
 
cremich profile image
Christian Bonzelet

Hi @artidata thanks for your reply. Step Functions gives you an interesting option when orchestrating services or a workflow. Both services are awesome and have it's use cases that have quite some overlaps. As I stated: the comparison is not meant to be to argue excplicit for or against a given services. It is more to offer more options for some use cases.

I will look deeper into a cost comparison in an upcoming benchmark.

Collapse
 
trobert2 profile image
trobert2 • Edited

I think it's important to point out that the lambda benchmark might be different for other programming languages.
I've noticed that SDK calls from golang lambda functions are faster than nodeJS functions. I just wanted to add this dimension so that we don't just blame lambda for the latency but consider other factors

Collapse
 
cremich profile image
Christian Bonzelet • Edited

Absolutely. There is a lot you can improve on your Lambda function code. The thing is: you have to know all these details if you have to improve for performance efficiency. It seems like the state machine in the stepfunction simply works pretty well without maintaining a bunch of code or configuration.

Optimizing your lambda function can be a really complex task.

Collapse
 
jogold profile image
Jonathan Goldwasser

Does the conclusion hold when setting the env var AWS_NODEJS_CONNECTION_REUSE_ENABLED in the Lambda function?

docs.aws.amazon.com/sdk-for-javasc...

Collapse
 
cremich profile image
Christian Bonzelet

That is worth to test in an upcoming version. As well as increasing memory configurations on the lambda. First indications are already discussed on Twitter:

twitter.com/marekq/status/14536579...

Collapse
 
biglucas profile image
Lucas Ferreira

I've been using step function for a while and I've loved it.

One question, how do you set the things inside the step function block that calls the S3 API? For example, how do we set the content of the PutObject action?

Note: I've not seen the step function after this great update.

Collapse
 
cremich profile image
Christian Bonzelet

@bigluxio you should play around with this awesome new feature :D
With regards to your question: in the first step I wanted to keep it simple. So I just write the execution id from the context object in S3. You get the whole state machine definition from here:

github.com/cremich/aws-sf-lambda-b...

My recommendation would be: start with the new Workflow Editor to setup your workflow, export the state machine definition and provision it using your favorite infastructure as code tool.

Collapse
 
nbyte profile image
Anthony Hildoer

Analyzing the performance difference between step functions and lambdas is like examining the performance of the Vespas versus mopeds. 😬

Collapse
 
cremich profile image
Christian Bonzelet

So which service is the vespa and which one the moped? :D

Collapse
 
nbyte profile image
Anthony Hildoer • Edited

I don't think it matters. I drive a freight train. ;-)

Lambda, and things built on top of that, are great for small/medium projects. But, once running at enterprise/Internet scale, one needs to switch to services that directly are built on top of EC2. Even Fargate managed container service is orders of magnitude more scalable (horizontally and vertically) than anything that depends on Lambda.

For me, a project crosses the border from small/medium to enterprise/internet scale at around a sustained load of 10s of API requests / second, or an SLA > 99.9%, whichever comes first. This is the moment when I see dependencies on Lambda fall apart.

Collapse
 
rolfstreefkerk profile image
Rolf Streefkerk

It would be interesting to also take cost into consideration

Collapse
 
cremich profile image
Christian Bonzelet

Definitely. I will prepare a second part of this focusing on tuning the lambda settings but also looking into the affect of costs.

Collapse
 
sirmomster profile image
M. Mitch

It's an interesting article, thanks for sharing it.

I was just wondering about something, did you use step functions in an asynchronous way and lambda synchronous from API gateway?

Collapse
 
gerardketuma profile image
Gerard Ketuma

Looking at the codebase, it is an Express state machine executed synchronously:

github.com/cremich/aws-sf-lambda-b...
github.com/cremich/aws-sf-lambda-b...

Also lambda is executed synchronously.

Collapse
 
cremich profile image
Christian Bonzelet

Thanks for the feedback @sirmomster
I will give the article an update to be more precise in the general setup section.

Yes, like @gerardketuma already wrote: both lambda and the state machine are executed synchronously. In this case I used an express state machine.

Collapse
 
jerusdp profile image
Jeremiah Russell

I wonder how the lambda code impacts this result? Would a rust lambda compare more favouribly?

Collapse
 
cremich profile image
Christian Bonzelet

It is likely to have an impact. Like I tried to scratch without going too much in detail as this is worth a complete article on its own:

Some factors will influence this, like the chosen implementation and runtime of the AWS Lambda function.
💡 Please checkout the awesome article of my AWS Community Builder fellow Alexandr Filichkin about a performance comparison of the different lambda runtimes.

[...] be cautious in generalizing the test results. There is a lot you can do to optimize your AWS Lambda functions to optimise for performance efficiency.