Building with a serverless mindset brings many benefits, from high availability to resiliency, pay for value, managed operational excellence, and many more.
You can often achieve cost and performance improvements as well, with respect to more traditional computing platforms.
At the same time, the best practices that allow you to design well-architected serverless applications have been evolving in the last five years. Many techniques have emerged such as avoiding "monolithic" functions, optimizing runtime dependencies, minifying code, filtering out uninteresting events, externalizing orchestration, etc. You can read about many of these practices in the AWS Serverless Application Lens whitepaper (last update: Dec 2019).
In this article, I'd like to dive deep into an optimization technique that I consider particularly useful as it doesn't require any code or architecture refactoring.
I'm referring to optimizing the resource allocation of your Lambda functions.
AWS Lambda resource allocation (power)
You can allocate memory to each individual Lambda function, from 128MB up to 3GB of memory.
Before you stop reading because "who cares about memory utilization?" let me clarify that it's much more appropriate to talk about power rather than memory. Because with more memory also comes more CPU, I/O throughput, etc.
So for the rest of this article, I'm going to call it power. When I say "512MB of power" it will correspond to 512MB of memory for your Lambda function.
So why does it matter?
It matters because more power means that your function might run faster. And with AWS Lambda, faster executions mean cheaper executions too. Since you are charged in 100ms intervals, reducing the execution time often reduces the average execution cost.
For example, let's assume that by doubling the power of your Lambda function from 128MB to 256MB you could reduce the execution time from 310ms to 160ms. This way, you've reduced the billed time from 400ms to 200ms, achieving a 49% performance improvement for the same cost. If you double the power again to 512MB, you could reduce the execution time even further from 160ms to 90ms. So you've halved the billed time again, from 200ms to 100ms, achieving another 44% performance improvement. In total, that's a 71% performance improvement, without changing a single line of code, for the very same cost.
I understand these numbers are quite hard to parse and visualize in your mind, so here's a chart:
The blue line represents our average execution time: 49% lower at 256MB and 71% lower at 512MB. Since Lambda's execution cost is proportional to memory allocation, we'd expect to spend more. But because we are jumping down to 200ms and 100ms respectively, the orange line (cost) is constant.
What if I don't need all that memory?
It doesn't matter how much memory you need. This is the counterintuitive part, especially if you come from a more traditional way of thinking about cost and performance.
Typically, over-provisioning memory means you're wasting resources. But remember, here memory means power 🚀
Our function might need only 50MB of memory to run correctly, and yet we will allocate 512MB so it will run faster for the same money. In other cases, your function might become faster AND cheaper.
Great, but how do I verify this in practice?
I asked myself the very same question in 2017. One day (on March 27th, around 6 PM CEST), I started working on automating this power-tuning process so my team and I could finally take data-driven decisions instead of guessing.
Meet AWS Lambda Power Tuning: github.com/alexcasalboni/aws-lambda-power-tuning 🎉
AWS Lambda Power Tuning is an open-source tool that helps you visualize and fine-tune the power configuration of Lambda functions.
It runs in your AWS account - powered by AWS Step Functions - and it supports multiple optimization strategies and use cases.
The tool will execute a given Lambda function a few times, parse the logs, crunch some numbers, and return the optimal power configuration.
This process is possible in a reasonable time because there is only one dimension to optimize. Today there are 46 different power values to choose from, and the tool allows you to select which values you want to test. In most cases, you can also afford running all the executions in parallel so that the overall execution takes only a few seconds - depending on your function's average duration.
Here's what you need to get started with Lambda Power Tuning:
- Deploy the power tuning app via Serverless Application Repository (SAR) - there are other deployment options documented here (for example, the Lumigo CLI or the Lambda Power Tuner UI)
- Run the state machine via the web console or API - here's where you provide your function's ARN and a few more options
- Wait for the execution results - you'll find the optimal power here
- You also get a handy visualization URL - this is how you'll find the sweet spot visually before you fully automate the process
How do I find the sweet spot visually?
Let's have a look at the two examples below.
The red curve is always (avg) execution time, while the blue curve is always (avg) execution cost.
In both cases, I'm checking six common power values: 128MB, 256MB, 512MB, 1GB, etc.
Example 1
In this example, I'm power-tuning a long-running and CPU-intensive function. It runs in 35 seconds at 128MB and in about 3 seconds at 1.5GB. The cost curve is pretty flat and decreases a bit until 1.5GB, then increases at 3GB.
The optimal power value is 1.5GB because it's 11x faster and 14% cheaper with respect to 128MB.
Example 2
The average execution time goes from 2.4 seconds at 128MB to about 300ms at 1GB. At the same time, cost stays precisely the same. So we run 8x faster for the same cost.
Before we proceed with more examples...
Remember: we may not need 1GB or 1.5GB of memory to run the two functions above, but it doesn't matter because in both cases we get much better performance for similar (or even lower) cost.
Also note: if you are a data geek like me, you've probably noticed two more things to remember when interpreting these charts.
- The two y-axes (speed and cost) are independent of each other, so the point where the two curves cross each other is not necessarily the optimal value.
- Don't assume that untested power values (e.g. 768MB) correspond to the curve's interpolated value - testing additional power values in between might reveal unexpected patterns.
What does the state machine input/output look like?
Here's the minimal input:
{
"lambdaARN": "your-lambda-function-arn",
"num": 50
}
But I highly encourage you to check out some of the other input options too (full documentation here):
{
"lambdaARN": "your-lambda-function-arn",
"num": 50,
"parallelInvocation": true,
"payload": {"your": "payload"},
"powerValues": [128, 256, 512, 1024, 1536, 3008],
"strategy": "speed",
"dryRun": false
}
For special use cases - for example, when you need to power-tune functions with side-effects or varying payloads - you can provide weighted payloads or pre/post-processing functions.
Here's what the output will look like (full documentation here):
{
"results": {
"power": "512",
"cost": 0.0000002083,
"duration": 2.906,
"stateMachine": {
"executionCost": 0.00045,
"lambdaCost": 0.0005252,
"visualization": "https://lambda-power-tuning.show/#<encoded_data>"
}
}
}
power
, cost
, and duration
represent the optimal power value and its corresponding average cost and execution time.
stateMachine
contains details about the state machine execution itself such as the cost related to Step Functions and Lambda. This information is particularly useful if you want to keep track of optimization costs without surprises - although typically, we are talking about $0.001 for the whole execution (excluding additional costs that your function might generate invoking downstream services).
Last but not least, you'll find the visualization URL (under lambda-power-tuning.show), an open-source static website hosted on AWS Amplify Console. If you don't visit that URL, nothing happens. But even when you visit the URL, there is absolutely no data sharing to any external server or service. The <encoded_data>
mentioned above only contains the raw numbers needed for clientside visualization, without any additional information about your Account ID, function name, or tuning parameters. You are also free to build your custom visualization website and provide it at deploy-time as a CloudFormation parameter.
Show me more examples, please!
Depending on what your function is doing, you'll find completely different cost/performance patterns. With time, you'll be able to identify at first glance which functions might benefit the most from power-tuning and which aren't likely to benefit much.
I encourage you to build a solid hands-on experience with some of the patterns below, so you'll learn how to categorize your functions intuitively while coding/prototyping. Until you reach that level of experience and considering the low effort and cost required, I'd recommend power-tuning every function and playing a bit with the results.
Cost/Performance patterns
I've prepared a shortlist of 6 patterns you may encounter with your functions.
Let's have a look at some sample Lambda functions and their corresponding power-tuning results. If you want to deploy all of them, you'll find the sample code and SAM template here.
1) The No-Op (trivial data manipulation)
When I say No-op functions, I mean functions that do very little, and they are more common than you might think. It happens pretty often that a Lambda function is invoked by other services to customize their behavior, and all you need is some trivial data manipulation. Maybe a couple of if
's or a simple format conversion - no API calls or long-running tasks.
Here's a simple example:
def lambda_handler(event, context):
print("NOOP")
response = 'OK'
if event['something'] == 'KO':
response = 'KO'
return {
'output': response
}
This kind of function will never exceed 100ms of execution time. Therefore, we expect its average cost to increase linearly with power.
(click on the image to open the interactive visualization)
Even though there is no way to make no-op functions cheaper, sometimes you can make them run 3-5x faster. In this case, it might be worth considering 256MB of power, so it runs in less than 2ms instead of 5ms. If your function is doing something more than a simple if
, you might see a more significant drop - for example, from 30ms to 10ms.
Does it make sense to pay a bit more just to run 20ms faster? It depends :)
If your system is composed of 5-10 microservices that need to talk to each other, shaving 20ms off each microservice might allow you to speed up the overall API response by a perceivable amount, resulting in a better UX.
On the other hand, if this function is entirely asynchronous and does not impact your final users' experience, you probably want to make it as cheap as possible (128MB).
2) The CPU-bound (numpy)
This function requires numpy, a very common Python library for scientific computing - which is available as an official Lambda layer.
import numpy as np
# make this execution reproducible
np.random.seed(10)
def lambda_handler(event, context):
# create a random matrix (1500x1500)
matrix = np.random.rand(1500, 1500)
# invert it (this is CPU-intensive!)
inverted_matrix = np.linalg.inv(matrix)
print(inverted_matrix)
return {'OK': 'OK'}
The function creates a random matrix (1500 rows, 1500 columns) and then inverts it.
So we are talking about a very CPU-intensive process that requires almost 10 seconds with only 128MB of power.
The good news is that it will run much faster with more memory. How much faster? Check the chart below.
(click on the image to open the interactive visualization)
Yes, it will run almost 21x faster (2100%) with 3GB of power. And that's for a cost increase of only 23%.
Let me repeat that: we can run this function in 450ms instead of 10 seconds if we're happy about paying 23% more.
If you can't afford a 23% cost increase, you can still run 2x faster for a 1% cost increase (256MB). Or 4x faster for a 5% cost increase (512MB). Or 7x faster for a 9% cost increase (1GB).
Is it worth it? It depends :)
If you need to expose this as a synchronous API, you probably want it to run in less than a second.
If it's just part of some asynchronous ETL or ML training, you might be totally fine with 5 or 10 seconds.
The important bit is that this data will help you find the optimal trade-off for your specific use case and make an informed decision.
Note: the numbers above do not take into consideration cold starts. By default, Lambda Power Tuning ignores cold executions, so all these averages are not biased. This allows you to reason about the largest majority of (warm) executions.
3) The CPU-bound (prime numbers)
Let's consider another long-running function. This function also uses numpy to compute the first 1M prime numbers for 1k times in a row.
import numpy as np
def lambda_handler(event, context):
# do the same thing 1k times in a row
for i in range(1000):
# compute the first 1M prime numbers
primes = compute_primes_up_to(1000000)
return {'OK': 'OK'}
def compute_primes_up_to(n):
# this is the fastest single-threaded algorithm I could find =)
# from https://stackoverflow.com/questions/2068372/fastest-way-to-list-all-primes-below-n-in-python/3035188#3035188
sieve = np.ones(int(n/3) + (n%6==2), dtype=np.bool)
sieve[0] = False
for i in range(int(int(n**0.5)/3+1)):
if sieve[i]:
k=3*i+1|1
sieve[ int((k*k)/3) ::2*k] = False
sieve[int((k*k+4*k-2*k*(i&1))/3)::2*k] = False
return np.r_[2,3,((3*np.nonzero(sieve)[0]+1)|1)]
The function takes almost 35 seconds to run with only 128MB of power.
But good news again! We can make it run much much faster with more memory. How much faster? Check the chart below.
(click on the image to open the interactive visualization)
Yes, it will run more than 14x faster (1400%) with 1.5GB of power. And that's with a cost DECREASE of 13.9%.
Let me repeat that: we can run this function in 2 seconds instead of 35 seconds, while at the same time we make it cheaper to run.
We could make it even faster (17x faster instead of 14x) with 3GB of power, but unfortunately the algorithm I found on StackOverflow cannot leverage multi-threading well enough (you get two cores above 1.8GB of power), so we'd end up spending 43% more.
This could make sense in some edge cases, but I'd still recommend sticking to 1.5GB.
Unless...
Unless there was an even more optimal power value between 1.5GB and 3GB. We aren't testing all the possible power values. We are trying only 6 of them, just because they are easy to remember.
What happens if we test all the possible values? We know that our best option is 1.5GB for now, but we might find something even better (faster and cheaper) if we increase the granularity around it.
{
"lambdaARN": "your-lambda-function-arn",
....
"powerValues": "ALL",
....
}
Here's what happens if you test all the possible values:
(click on the image to open the interactive visualization)
It turns out the (global) sweet spot is 1.8GB - which allows us to run 16x faster and 12.5% cheaper.
Or we could pick 2112MB - which is 17x faster for the same cost of 128MB (still 20ms slower than 3GB, but for a better avg cost).
Remember: when you see an increasing or decreasing trend (cost or speed), it's likely it will continue for a while also for power values you aren't testing. Generally, I'd suggest increasing your tuning granularity to find globally optimal values.
4) The Network-bound (3rd-party API)
Let's move on and talk about a first network-bound example. This function interacts with an external API, an API that's public and not hosted on AWS: The Star Wars API.
import json
import urllib.request
# this is my (public) third-party API
URL = 'https://swapi.dev/api/people/?format=json'
def lambda_handler(event, context):
# prepare request
req = urllib.request.Request(URL)
# fetch and parse JSON
with urllib.request.urlopen(req) as response:
json_response = json.loads(response.read())
# extract value from JSON response
count = json_response['count']
return {
'count': count,
}
The function performs a GET request to fetch the number of characters available via the Star Wars API (we could have used the official swapi-python library for a higher-level interface, but that wasn't the point).
As we could have predicted, this external API's performance isn't impacted at all by the power of our Lambda function. Even though additional power means more I/O throughput, we are only fetching 5KB of data, so most of the execution time is spent waiting for the response, not transferring data.
(click on the image to open the interactive visualization)
The red curve above is pretty flat and the blue curve is always increasing, which means we cannot do much to speed up this function or make it cheaper.
We might save 50-100 milliseconds with additional power, but usually that's not enough to reduce the cost or keep it constant.
In this case, we can run a little bit faster with 256MB or 512MB of power - up to 16% faster if we're happy to triple the average execution cost.
Is it worth it? It depends.
If your monthly Lambda bill is something like $20, how do you feel about bumping it to $60 and run a customer-facing function 15-20% faster? I would think about it.
If it's not a customer-facing API, I'd stick to 128MB and make it as cheap as possible. And there might be other factors at play when it comes to third-party APIs. For example, you may need to comply with some sort of rate-limiting; if you're performing batches of API calls in series, a function that runs slower might be a good thing.
5) The Network-bound (3x DynamoDB queries)
This pattern is pretty common. Every time our function uses the AWS SDK to invoke a few AWS services and coordinate some business logic. We are still talking about a network-bound function, but it shows a different pattern. In this case, we are performing three dynamodb:GetItem
queries in sequence, but the same pattern holds with other services such as SNS or SQS.
import boto3
dynamodb = boto3.client('dynamodb')
def lambda_handler(event, context):
# three identical queries in series
# this is just an example
# usually you'd have 3 different queries :)
for i in range(3):
response = dynamodb.get_item(
TableName='my-table',
Key={
'id': {
'S': 'test-id',
}
}
)
message = response['Item']['message']
return {
'message': message,
}
We are talking about AWS services, quite likely operating in the same AWS region. So our API calls won't leave the data center at all.
Surprisingly, we can make this function run much faster with additional power: this pattern is very similar to the first example we analyzed at the beginning of this article.
(click on the image to open the interactive visualization)
The function runs in about 350ms at 128MB, 160ms at 256MB, and 45ms at 512MB.
In practice, every time we double its power we also halve the billed time, resulting in constant price until 512MB.
After that, we cannot make it cheaper, so 512MB is our sweet spot.
But we could get an additional 40% performance improvement (28ms execution time) at 3GB, if we are ready to pay 6x more. As usual, this tradeoff is up to you and it depends on your business priorities. My suggestion is to adopt a data-driven mindset and evaluate your options case by case.
6) The Network-bound (S3 download - 150MB)
This is not a very common pattern, as downloading large objects from S3 is not a typical requirement. But sometimes you really need to download a large image/video or a machine learning model, either because it wouldn't fit in your deployment package or because you receive a reference to it in the input event for processing.
import os
import boto3
s3 = boto3.client('s3')
# from the Amazon Customer Reviews Dataset
# https://s3.amazonaws.com/amazon-reviews-pds/readme.html
BUCKET = 'amazon-reviews-pds'
KEY = 'tsv/amazon_reviews_us_Watches_v1_00.tsv.gz'
LOCAL_FILE = '/tmp/test.gz'
def lambda_handler(event, context):
# download 150MB (single thread)
s3.download_file(BUCKET, KEY, LOCAL_FILE)
bytes = os.stat(LOCAL_FILE).st_size
total = bytes / 1024 / 1024
unit = 'MB'
if total > 1024:
total = total / 1024
unit = 'GB'
# print "Downloaded 150MB"
print("Downloaded %s%s" % (round(total, 2), unit))
return {'OK': 'OK'}
Because we are trying to store a lot of data in memory, we won't test lower memory configurations such as 128MB and 256MB.
At first glance, the cost/performance pattern looks quite similar to our fist network-bound example: additional power doesn't seem to improve performance. Execution time is pretty flat around 5 seconds, therefore cost always increases proportionally to the allocated power (at 3GB it's almost 5x more expensive on average).
(click on the image to open the interactive visualization)
From this chart, it looks like we can't do much to improve cost and performance. If we go for 1GB of power, we'll run 23% faster for a cost increase of 55%.
Can we do better than this?
Good news: this kind of workload will run much faster with a very simple code change:
# download 150MB from S3 with 10 threads
transfer_config = boto3.s3.transfer.TransferConfig(max_concurrency=10)
s3.download_file(BUCKET, KEY, LOCAL_FILE, Config=transfer_config)
With the new code above, we're simply providing a custom TransferConfig
object to enable multi-threading.
Now the whole process will complete a lot faster by parallelizing the file download with multiple threads, especially since we get two cores above 1.8GB of power.
Here's the new cost/performance pattern:
(click on the image to open the interactive visualization)
Not only is the pattern very different, but the absolute numbers are much better too. We run in 4.5 seconds at minimum power (which is already 10% faster than what we could do before at maximum power). But then it gets even better: we run another 40% faster at 1GB for a cost increase of 23%.
Surprisingly, we run almost 4x faster (1.1 seconds) at 3GB of power. And that's for the same cost (+5%) with respect to the single-threaded code at 512MB.
Let me rephrase it: adding one line of code allowed us to run 4x faster for the same cost.
And if performance didn't matter in this use case, the same change would allow us to make this function 45% faster AND 47% cheaper with minimum power (512MB).
I believe this is also an interesting example where picking a specific programming language might result in better performance without using additional libraries or dependencies (note: you can achieve the same in Java with the TransferManager utility or in Node.js with the S3 Managed Download module).
Conclusions
We've dived deep into the benefits of power-tuning for AWS Lambda: it helps you optimize your Lambda functions for performance or cost. Sometimes both.
Remember that memory means power and there is no such thing as over-provisioning memory with AWS Lambda. There is always an optimal value that represents the best trade-off between execution time and execution cost.
I've also introduced a mental framework to think in terms of workload categories and cost/performance patterns, so you'll be able to predict what pattern applies to your function while you are coding it. This will help you prioritize which functions might be worth optimizing and power-tuning.
AWS Lambda Power Tuning is open-source and very cheap to run. It will provide the information and visualization you need to take a data-driven decision.
Thanks for reading, and let me know if you find new exciting patterns when power-tuning your functions.
Top comments (13)
I often see different numbers than yours when going from 128 to 256 (or other doubling of power) I get a more than halving of execution time. But I can’t figure out why this is true. Is it because of the scheduler, CPU cache, or bandwidth?
Have you seen these cases, and if so how do you explain it.
Edit: most of my lambdas are a DynamoDB query, Update, and value return.
Hi Ross, thanks for asking!
For DDB queries, it should be a mix of better CPU and better I/O throughput. It is quite common to see such an improvement when you double your function's power - it looks exactly like pattern 5 in the article :)
Yes thanks! I think your example for 512MB goes below 100ms, so its actually showing a bigger time reduction, but not reflected in the price information. Glad it’s not just me seeing this pattern
Great article! Thanks a lot.
When using amplify, we have cloud formation templates generated by amplify that we can customised if we want (very often I do that to add new permissions).
How should I hardcode the power I need in this scenario? Because my entire system should be described using these templates as far as I know.
Thanks a lot.
Hi Ricardo, very good question :)
You should be able to edit (or add) the
MemorySize
property of your function. It's always an integer value in MB and you can customize it for each individual function. CloudFormation doc here.I don't think you can do this with Amplify templates, but when using AWS SAM you can also customize the memory of all the functions in your template using the Globals section.
With amplify your can directly modify the cloud formation template for lambda functions and the changes will be preserved
Hi Alex. Awesome article! Given the obvious benefits, why is not AWS Lambda doing this optimization automatically for you (via opt-in for example)?
Very good question, German :)
I can think of two good reasons why this is hard to provide as a managed solution:
1) you need to provide a valid input event for power-tuning, as every function requires its own specific payload; while this is something that can be automated, it's also very function-specific and domain-specific (Alexa skill vs S3 event vs Kinesis stream vs HTTP request)
2) power-tuning requires executing your functions in a real environment, with real downstream dependencies (databases, 3rd-party API's, legacy systems, etc.) so you need to be aware of those downstream services and prepare them for the load, otherwise you'll get noisy or inconsistent results
But never say never :)
This is an awesome article and a brilliant deep dive. You can't imagine the number of debates I have had with my teammates when trying to select the best memory for our lambda functions.
Thanks 🙏 I'm glad it was a useful read for you (and your team)
I feel like I'm missing something here. The cost of lambda goes up roughly linearly with memory, as does performance, so without the 100ms binning of billing you'd expect things to more or less break even. Because of the 100ms step though it's generally better (for cost) to use less power. In your first example the billed cost actually goes up by 30% even though the time goes down by 30%. For fast functions (on the order of a small number of 100ms) it will generally be cheaper to use as little power as possible, because the 100ms billing rounding error creates substantial cost unless 100ms is small compared to execution time.
This depends a lot on the use case. Some of the workload categories I've described do benefit from using more power (especially the CPU-bound ones) since it often results in lower average cost too.
Thanks, I think you've found a typo in my first example. As I've reshaped that example a few times, in the latest version average cost is not constant anymore. The idea was that often you can double power and half execution time, resulting in the same average execution cost. I'll fix it now.
I think this is good advise only if you don't care about performance (i.e. for most asynchronous executions, where nobody is waiting for a response).
Well explained. Great article , Alex. Thanks for putting this..
Some comments have been hidden by the post's author - find out more