DEV Community

Shuichi
Shuichi

Posted on • Edited on

Visualize CloudWatch Logs ingestion lag with Logs Insights

I have visualized the lag time it takes to ingest logs to CloudWatch Logs and compared it by source service!

0. "timestamp" and "ingestionTime"

CloudWatch Logs event includes the following two times.

  • (i) timestamp = The time the event occurred
  • (ii) ingestionTime =The time the event was ingested into CloudWatch Logs

There is a lag between "timestamp" and "ingestionTime".

sequence

The "timestamp" (=event occurrence time) is passed by the parameter InputLogEvent of PutLogEvents.
This is something we're rarely aware of if we're using an integrated service or CloudWatch agent

1. Visualize CloudWatch Logs ingestion lag

We can get both from the log data in Logs Insights, so we will aggregate them using the stats command.

1-1. Query the "timestamp" and "ingestionTime"

The initial query of Logs Insights only displays timestamp, but let's display ingestionTime as well

fields @timestamp, @ingestionTime, @message, @logStream, @log
| sort @timestamp desc
Enter fullscreen mode Exit fullscreen mode

Result

logs

You can see that there is a time lag of several seconds.

1-2. Query the time difference(=lag)

fields @timestamp, @ingestionTime, toMillis(@ingestionTime) - toMillis(@timestamp) as diff
| sort @timestamp desc
Enter fullscreen mode Exit fullscreen mode

Result

logs

toMillis(@ingestionTime) - toMillis(@timestamp) as diff

"toMillis" function converts the timestamp found in the named field into a number representing the milliseconds.

1-3. Visualize the time difference(=lag) with stats

fields @timestamp, @ingestionTime, toMillis(@ingestionTime) - toMillis(@timestamp) as diff
| sort @timestamp desc
| stats max(diff), pct(diff,50), avg(diff) by bin(30min)
Enter fullscreen mode Exit fullscreen mode

The maximum, median (50th percentile), and mean of the difference are output at 30-minute intervals. The period is specified as 2 weeks.
The maximum number of log fields that Logs Insights can output is 1,000, so the interval and period must be adjusted accordingly.

Results

visualization

2 Comparison of lag by source service

Let's run a visualization query to see if the time difference differs depending on the log source!

2-1. AWS Lambda

AWS Lambda is integrated with CloudWatch and automatically outputs logs with appropriate permissions, so we query the group. (/aws/lambda/{function name})

aws lambda
The median time is about less than 9 seconds.

2-2. EC2 + CloudWatch Agent

Query the log group that EC2 is outputting via CloudWatch Agent.

ec2

The median is a little over 5 seconds, but the maximum is around 16 seconds.
It may depend on the contents of the agent configuration file (per force_flush_interval?)

2-3. API Gateway

Query the log group that is output when log output is enabled on API Gateway.
(API-Gateway-Execution-Logs_{stage ID})

api gateway

The median time is over 20 seconds, the longest among those compared.

Conclusion

There were differences in lagging tendencies among the services. It would be fun to try to guess the cause.

I used Logs Insights to visualize the data, and found it easy and useful.

This post is an English rewrite of an post I wrote in Japanese.

Top comments (0)