AWS Lambda SnapStart - Part 7 Re-measuring of Java 11 Lambda cold starts

#java #aws #serverless #coldstart

What happended in the previous parts

In the 1st part of my article series we measured the cold start time of the Lambda function using Corretto Java 11 and 1024 MB memory execution environment with SnapStart enabled. First we used the pure Java and AWS SDK for Java (version 2) without other frameworks like Micronaut, Quarkus and SpringBoot. After that we introduced optimization technique called priming described in the 5th part of my article series and measured the cold start times once again. Here are the results of those measurements:

Framework	p50	p90	p99
cold start time w/o Priming	1266.05	1306.85	1326.81
cold start time with Priming	352.45	401.43	433.76

Even if the results looked promissing, especially in case of priming optimization which could significally further reduce the cold start times, measuring the API end-to-end latency showed the quite significant overall latency which I couldn't completely explain by adding up the cold start (which mainly consisted of snapshot restore duration time for the Lambda function with SnapStart enabled), function execution (warm start) time and API Gateway latency.

For example "Restore Duration" of the snapshot displayed in the Lambda CloudWatch Log took (only) between 30 and 280 ms (see one of the CloudWatch Log examples in the image below).

The function execution time (warm start time) was between 7 and 50 ms. The API Gateway (integration) latency was quite small (only several milliseconds). But overall API Gateway end-to-end latency looked like this (see also the 5th article):

Framework	p50	p90	p99
API Gateway end2end latency w/o Priming	1750	1957	1973
API Gateway end2end latency with Priming	877	1090	1098

This revealed that I missed roughly between 300 and 600ms somewhere in the invocation chain which I couldn't clearly link to some specific action. I had the assumption (but it wasn't more than assumption), that "Restore Duration displayed in the CloudWatch Log of the Lambda function was a bit too short taking into the account the complexity of the Firecracker microVM snapshot restore procedure (including the whole JVM).

Re-Measuring the cold start times

From time to time I looked into the CloudWatch metrics to see if I could identify some changes and indeed several days (starting from September 20) ago I noticed that the "Restore Duration" significantly increased, but the API Gateway end to end latency remained the same. Let's take another look at it. Let's use the same project for our test with the updated dependencies (for example I use AWS SDK for Java version "2.20.156") and re-do some tests. This is one of the examples of how the "Restore Duration" time displayed now (the results varies of course, I saw the values between 600 and 1000 ms) :

Now with the same CloudWatch Logs Insights Query for both our Lambda function (with Java 11 execution environment and 1024 MB memory) both with SnapStart enabled: GetProductByIdWithPureLambda (without priming) and GetProductByIdWithPureLambdaAndPriming (with priming)



filter @type = "REPORT"
  | parse @message /Restore Duration: (?<restoreDuration>.*?) ms/
  | stats
count(*) as invocations,
pct(@duration+coalesce(@initDuration,0)+coalesce(restoreDuration,0), 50) as p50,
pct(@duration+coalesce(@initDuration,0)+coalesce(restoreDuration,0), 90) as p90,
pct(@duration+coalesce(@initDuration,0)+coalesce(restoreDuration,0), 99) as p99
group by function, (ispresent(@initDuration) or ispresent(restoreDuration)) as coldstart
  | sort by coldstart desc

for approximately 100 cold starts we got the following results:

Framework	p50	p90	p99
cold start time w/o Priming	1656.83	1837.64	2026.72
cold start time with Priming	765.21	902.16	1013.86

Conclusion

Now the measurements of cold start times (and overall API Gateway end-to-end latency) make complete sense and I assume that AWS made a fix to display the "Restore Duration" and thefore the entire cold start time correctly. Nethertheless these are still quite big cold start times (especially if you can't prime some invocations to reduce them), comparing to other Lambda runtimes like Node.js and Python. Obviously GraalVM Native Image shipped with Lambda Custom Runtime can provide much better results (see Results from GraalVM Native images running in custom runtime) but with different set of trade-offs. Of course in case of using SnapSart, the snapshot restore times can be improved in the future, so I hope that we can further reduce the cold start times with Java by several hundred additional milliseconds.