Steve Bjorg for LambdaSharp

Posted on Oct 14, 2022

Lessons Learned on Optimizing .NET on AWS Lambda

#aws #dotnet #serverless

I learned a lot diving into the foundational parts of the AWS Lambda implementation for .NET, but I also have some questions left. Maybe I will pick this up sometime later or maybe someone will feel inspired to venture deeper into this topic.

Conclusions

Before sharing my conclusions, I want to stress again that you should use them as a starting point and benchmark your own code to see what makes sense for your situation.

Also, make sure you understand what optimal means for your application. There is no one-size-fits-all. Know what your objectives are ahead of time.

Tiered Compilation

Based on the data gathered by the benchmarks, Tiered Compilation should only be used if minimizing cold start duration is your top priority. From a cost perspective, it never makes sense to enable it.

ReadyToRun

This option is easy to recommend. If you know what the target CPU architecture is going to be then ReadyToRun is an obvious choice. Even with Tiered Compilation disabled, the Lambda function performs very well. This was quite a surprise, because ReadyToRun generates unoptimized code (Tier 0) and without Tiered Compilation enabled, that code will never be rejitted. However, from the measurements, the rejitting overhead is so onerous that 100 warm stars are not enough to make up the difference.

Pre-JIT .NET

I would recommend setting the AWS_LAMBDA_DOTNET_PREJIT environment variable to Always unless cold start duration is critical. If anything, I would explore how to pre-JIT even more of the code during the INIT phase since it's free and runs faster than the INVOKE phase for lower memory configurations.

CPU Architecture

The ARM64 architecture is the exciting new kid on the block for .NET and AWS Lambda, but the venerable x86-64 architecture should not be discounted. In these benchmarks, it has often fared better in performance, but at increased cost. There are also more issues with the ARM64 architecture. Make sure to check the issue tracker to see if any of them might affect your project.

Future Work

Here are some areas I would like to explore further when time permits.

Are all AWS Lambda regions the same?

My benchmarks were conducted in us-west-2. I would assume all regions perform the same, but it would be interesting to confirm this is the case indeed.

Benchmarking with I/O Operations

I received some feedback that my benchmarks were not representative because they lacked I/O operations, such as when interacting with other services. That is correct and it was intentional. I/O operations are at least an order of magnitude slower than compute operations. My interest was driven by understanding the interplay of various compiler options, CPU architectures, and memory configurations. Adding I/O into the mix would have prevented establishing a clean baseline. That said, I agree that benchmarking these scenarios with I/O operations is both interesting and valuable.

.NET 7 Native

One of the features I'm most excited about in .NET 7 is native compilation. Preliminary results shared by others have shown very promising improvements to performance. However, because native compilation uses the generic Lambda runtime, the INIT phase is no longer free. Does that also mean, the INIT phase no longer runs at full speed? If so, it changes everything about how we need to approach minimizing execution costs. Still, the promise of much faster execution is tantalizing to say the least.

Self-Hosted .NET

I thought about benchmarking self-hosted .NET Lambda functions but given that native compilation is a much better alternative, I did not bother to do so. From my experience, self-hosted functions are large and slow. The only time it made sense to consider using them was with .NET 5 to access newer features before .NET 6 was supported by AWS Lambda. For .NET 7, I would focus on native compilation instead and ignore the self-hosted option.

More Pre-Jitting during INIT Phase

As mentioned, I think there is something to be said about pre-jitting more code during the INIT phase. I don't know what else would make sense, but I would explore this area to shift some billable execution time to the free INIT phase. I feel there is some untapped potential here.

Custom Amazon.Lambda.RuntimeSupport Package

I can't shake the feeling that there are some opportunities to specialize the Amazon.Lambda.RuntimeSupport package for greater flexibility and better performance.

To better understand how Lambda functions interact with the AWS Lambda service, I created a mock implementation of the service and its 4 APIs:

/2018-06-01/runtime/invocation/next: This endpoint returns the payload for the next Lambda invocation request.
/2018-06-01/runtime/invocation/{awsRequestId}/response: This endpoint receives the response of a successful Lambda invocation.
/2018-06-01/runtime/invocation/{awsRequestId}/error: This endpoint receives the error message of a failed Lambda invocation.
/2018-06-01/runtime/init/error: This endpoint receives the error message of a failed Lambda initialization.

The Lambda invocation is suspended on /2018-06-01/runtime/invocation/next if no other payload is available. That means, it is technically possible to respond to a request using /2018-06-01/runtime/invocation/{awsRequestId}/response and then do some additional post response clean-up work that does not impact the responsiveness of the Lambda function. For example, the garbage collector could be explicitly triggered. I don't know if that makes sense in practice, but it's an interesting notion to explore.

It also bugs me that baseline .NET Core 3.1 runs faster than .NET 6. With all the hard work that went into optimizing performance of .NET 6, it feels wrong. Maybe something could be done at the custom runtime level to improve things further.

More C# Compiler Options

I only benchmarked the most obvious C# compiler options, such as Tiered Compilation and ReadyToRun, but there are more options that might be interesting to explore.

Profile-Guided Optimizations (PGO)

This is one of the exciting features that got away this time. Profile Guided Optimization (PGO) enables the .NET runtime to gather execution information that can be fed back into the C# compiler to produce a better executable. In essence, it's a smart optimizer that looks at real-world data to produce the best possible code.

I don't know how one would instrument a Lambda function to collect the profile data, but if it possible, it would be very interesting to make it part of a CI/CD pipeline. Something akin to the following steps:

Build and deploy unoptimized version of Lambda function
Run integration tests against deployed Lambda function and collect profile data
Re-build Lambda function with profile information

Parting Thoughts

While there are stones left unturned on this journey, I hope that some of this work can already been put to good use. I find measuring performance very rewarding, because it can be objectively assessed. I also think it's important because the faster our code runs, the less harm we do to the environment. Last, but not least, there is also an attractive notion of minimalism that is easily capture as only execute what is needed and nothing more.

If you have any questions, suggestions, or corrections, please leave them in the comments and I will update these posts accordingly.

DEV Community