Reducing cold starts on AWS Lambda with Java runtime - Future ideas about SnapStart, GraalVM and Co

#aws #java #serverless #coldstart

Introduction

In the previous 8 parts of our series about AWS Lambda SnapStart we measured the cold starts of Lambda function with Java 11 and 17 runtime first without without and with enabling of SnapStart and also applying various optimization techniques like priming when using SnapStart. You can refer to the cold start times measured with GraalVM Native Image. Current measurements reveal, that the fastest cold start times can be achieved with GraalVM Native Image followed by SnapStart with priming (in case you can apply such optimization for your use case, for example when you’re using DynamoDB as your database of choice), followed by SnapStart without any optimizations. Of course the slowest cold start times you will experience without using GraalVM Native Image and SnapStart. See the summarized measurements in my previous articles of this series or in one of my presentations like this one.

In this article I’d like to share some thoughts on how AWS can further improve its offering around reducing cold starts on AWS Lambda with Java Runtime and also improve the developer experience.

Thoughts on how AWS can further improve its offering around reducing cold starts on AWS Lambda with Java Runtime

Let’s starts with the potential improvements on the AWS SnapStart enabled Lambda functions :

Now as we have the correct snapshot restore numbers reported since the end of September 2023, we see that there is a big potential to reduce the time required for such snapshot restore. I’m quite sure, that AWS is already working on it.
Another potential improvements is in the area of how SnapStart is implemented. The snapshot is taken in the deployment phase which currently takes additionally a bit more than 2 minutes. As I first measured it, it took 2 minutes and 40 seconds, so there is already improvement achieved there. I’d personally like to have the configurable option to take the Firecracker microVM snapshot on the first Lambda function invocation (instead of during the deployment phase). And as long as this operation takes place, the regular cold start should occur as if SnapStart had been disabled. With that we trade off a quicker deployment time (which improves developer experience) and having slower cold starts for about 2 minutes after the first invocation after the Lambda function deployment. In such scenario, then the Firecracker microVM snapshot is fully created and ready to be restored the SnapStart becomes automatically re-enabled.
The same I’d like to have available for Lambda functiona that hasn’t been invoked for 14 days and in such a case microVM snapshot will be deleted leading to its re-creation during the next invocation. This leads to the huge cold start times and even timeouts which currently makes SnapStart not really usable for Lambda functions with such invocation pattern, that they can remain not invoked for 14 days.
In the parts 5 and 6 of this series we introduced optimization technique called “priming” based on optional Lambda hooks with CRaC API and discussed how it can be applied on already known scenarios. Priming is a purely educational thing: you need to understand how things works behind the scenes in order to know whether it worth applying and how or not. In the end it comes down to nearly everybody writing the same boilerplate code, for example to prime the invocation to DynamoDB. I personally expect that not only AWS (they can for sure provide some help with priming at least for AWS services) but the Java community will provide some open source frameworks capable of applying priming “out of the box” by analyzing which AWS services, libraries and frameworks we are using in Lambda and then generating the corresponding priming code in Lambda hooks based on CRaC API behind the scenes (for example during the compilation phase) for us.
Another improvement has to be achieved around reliability of the microVM snapshot creation itself. I know several people reported that errors occasionally occur there during the deployment phase without any given reason which leads to the CloudFormation state rollback. AWS has already recently provided some improvements for troubleshooting such cases which we'll explore in the future article.
Another interesting potential optimization with SnapStart not directly related to the cold start times is related to the achieving the peak performance of the Lambda function. It currently belongs to the best practice to reduce the cold start times to use tiered compilation for Lambda function using Java managed runtime. As Lambda functions are often small and single purposed it can be potentially possible during the Firecracker microVM snapshot creation phase to call it several thousand times (usually 10.000) using C2 compilation to achieve its peak performance also for the warm function execution. Here is of course a huge trade off between the gain (performance) and associated costs for executing the Lambda function so many times and also developer experience as the Firecracker microVM snapshot creation will then take much longer. Once again, snapshot creation during the first Lambda execution discussed above can improve at least the developer experience.

Now let’s look into the GraalVM Native Image and potential AWS offering around it:

I personally still think that providing the Lambda managed GraalVM (Native Image) runtime will give developers more options and benefit everybody. Delivering GraalVM Native Image through the Lambda Custom Runtime or the Lambda Container Image shifts many operational responsibilities like scaling CI/CD pipelines to the developers, but gives the developers the freedom to use the runtime version they'd ike including the new ones. Maybe the new release of the Amazon Linux 2023 runtime for AWS Lambda can become the base of it.
I’d also suggest in case such a offering would become available in Lambda in the future, AWS should give developers an option similar to the one discussed above for SnapStart to create the Native Image (which currently takes between 20 seconds and several minutes) either in the Lambda deployment phase or during the first Lambda invocation (defaulting to pure GraalVM with Just-in-Time compilation or even AWS Corretto Java runtime) as long as the native image hasn’t been fully created.
As Ahead-of-Time compilation has its own set of challenges and developers fear runtime errors, especially as we don’t control all dependencies and how they are GraalVM -ready. Maybe for such a case developer can provide set of tests which will be executed by AWS on their behalf on the Lambda GraalVM Native Image execution environment sandbox and if they pass the environment become active. Otherwise once again defaulting to pure GraalVM with Just-in-Time compilation or even AWS Corretto Java runtime and notifying developers about the errors during the test execution. Of course this approach is tricky to implement and may not be required at all as GraalVM publishes more and more GraalVM Native Image ready dependencies.

Conclusion

In this article we looked at the possible improvements around AWS existing offering to reduce the cold start times of Lambda function using Java managed runtime as SnapStart and also looked how managed GraalVM (Native Image) Lambda runtime could benefit the developers. Also Java runtime itself is subject to improve for the Serverless use case. I encourage you to look at the Project Leyden which primary goal is to improve the startup time, time to peak performance, and footprint of Java programs. I also encourage you to watch the video about the current state of the Project Leyden in October 2023 presented by the Java architects at Oracle.