In the 1st part of my article series we measured the cold start time of the Lambda function using Corretto Java 11 and 1024 MB memory execution environment with SnapStart enabled. First we used the pure Java and AWS SDK for Java (version 2) without other frameworks like Micronaut, Quarkus and SpringBoot. After that we introduced optimization technique called priming described in the 5th part of my article series and measured the cold start times once again. Here are the results of those measurements:
|cold start time w/o Priming||1266.05||1306.85||1326.81|
|cold start time with Priming||352.45||401.43||433.76|
Even if the results looked promissing, especially in case of priming optimization which could significally further reduce the cold start times, measuring the API end-to-end latency showed the quite significant overall latency which I couldn't completely explain by adding up the cold start (which mainly consisted of snapshot restore duration time for the Lambda function with SnapStart enabled), function execution (warm start) time and API Gateway latency.
For example "Restore Duration" of the snapshot displayed in the Lambda CloudWatch Log took (only) between 30 and 280 ms (see one of the CloudWatch Log examples in the image below).
The function execution time (warm start time) was between 7 and 50 ms. The API Gateway (integration) latency was quite small (only several milliseconds). But overall API Gateway end-to-end latency looked like this (see also the 5th article):
|API Gateway end2end latency w/o Priming||1750||1957||1973|
|API Gateway end2end latency with Priming||877||1090||1098|
This revealed that I missed roughly between 300 and 600ms somewhere in the invocation chain which I couldn't clearly link to some specific action. I had the assumption (but it wasn't more than assumption), that "Restore Duration displayed in the CloudWatch Log of the Lambda function was a bit too short taking into the account the complexity of the Firecracker microVM snapshot restore procedure (including the whole JVM).
From time to time I looked into the CloudWatch metrics to see if I could identify some changes and indeed several days (starting from September 20) ago I noticed that the "Restore Duration" significantly increased, but the API Gateway end to end latency remained the same. Let's take another look at it. Let's use the same project for our test with the updated dependencies (for example I use AWS SDK for Java version "2.20.156") and re-do some tests. This is one of the examples of how the "Restore Duration" time displayed now (the results varies of course, I saw the values between 600 and 1000 ms) :
Now with the same CloudWatch Logs Insights Query for both our Lambda function (with Java 11 execution environment and 1024 MB memory) both with SnapStart enabled: GetProductByIdWithPureLambda (without priming) and GetProductByIdWithPureLambdaAndPriming (with priming)
filter @type = "REPORT" | parse @message /Restore Duration: (?<restoreDuration>.*?) ms/ | stats count(*) as invocations, pct(@duration+coalesce(@initDuration,0)+coalesce(restoreDuration,0), 50) as p50, pct(@duration+coalesce(@initDuration,0)+coalesce(restoreDuration,0), 90) as p90, pct(@duration+coalesce(@initDuration,0)+coalesce(restoreDuration,0), 99) as p99 group by function, (ispresent(@initDuration) or ispresent(restoreDuration)) as coldstart | sort by coldstart desc
for approximately 100 cold starts we got the following results:
|cold start time w/o Priming||1656.83||1837.64||2026.72|
|cold start time with Priming||765.21||902.16||1013.86|
Now the measurements of cold start times (and overall API Gateway end-to-end latency) make complete sense and I assume that AWS made a fix to display the "Restore Duration" and thefore the entire cold start time correctly. Nethertheless these are still quite big cold start times (especially if you can't prime some invocations to reduce them), comparing to other Lambda runtimes like Node.js and Python. Obviously GraalVM Native Image shipped with Lambda Custom Runtime can provide much better results (see Results from GraalVM Native images running in custom runtime) but with different set of trade-offs. Of course in case of using SnapSart, the snapshot restore times can be improved in the future, so I hope that we can further reduce the cold start times with Java by several hundred additional milliseconds.