I have visualized the lag time it takes to ingest logs to CloudWatch Logs and compared it by source service!
0. "timestamp" and "ingestionTime"
CloudWatch Logs event includes the following two times.
- (i) timestamp = The time the event occurred
- (ii) ingestionTime =The time the event was ingested into CloudWatch Logs
There is a lag between "timestamp" and "ingestionTime".
The "timestamp" (=event occurrence time) is passed by the parameter InputLogEvent of PutLogEvents.
This is something we're rarely aware of if we're using an integrated service or CloudWatch agent
1. Visualize CloudWatch Logs ingestion lag
We can get both from the log data in Logs Insights, so we will aggregate them using the stats command.
1-1. Query the "timestamp" and "ingestionTime"
The initial query of Logs Insights only displays timestamp, but let's display ingestionTime as well
fields @timestamp, @ingestionTime, @message, @logStream, @log
| sort @timestamp desc
Result
You can see that there is a time lag of several seconds.
1-2. Query the time difference(=lag)
fields @timestamp, @ingestionTime, toMillis(@ingestionTime) - toMillis(@timestamp) as diff
| sort @timestamp desc
Result
toMillis(@ingestionTime) - toMillis(@timestamp) as diff
"toMillis" function converts the timestamp found in the named field into a number representing the milliseconds.
1-3. Visualize the time difference(=lag) with stats
fields @timestamp, @ingestionTime, toMillis(@ingestionTime) - toMillis(@timestamp) as diff
| sort @timestamp desc
| stats max(diff), pct(diff,50), avg(diff) by bin(30min)
The maximum, median (50th percentile), and mean of the difference are output at 30-minute intervals. The period is specified as 2 weeks.
The maximum number of log fields that Logs Insights can output is 1,000, so the interval and period must be adjusted accordingly.
Results
2 Comparison of lag by source service
Let's run a visualization query to see if the time difference differs depending on the log source!
2-1. AWS Lambda
AWS Lambda is integrated with CloudWatch and automatically outputs logs with appropriate permissions, so we query the group. (/aws/lambda/{function name})
The median time is about less than 9 seconds.
2-2. EC2 + CloudWatch Agent
Query the log group that EC2 is outputting via CloudWatch Agent.
The median is a little over 5 seconds, but the maximum is around 16 seconds.
It may depend on the contents of the agent configuration file (per force_flush_interval?)
2-3. API Gateway
Query the log group that is output when log output is enabled on API Gateway.
(API-Gateway-Execution-Logs_{stage ID})
The median time is over 20 seconds, the longest among those compared.
Conclusion
There were differences in lagging tendencies among the services. It would be fun to try to guess the cause.
I used Logs Insights to visualize the data, and found it easy and useful.
This post is an English rewrite of an post I wrote in Japanese.
Top comments (0)