Shuichi

Posted on Jun 25, 2023 • Edited on Jul 20, 2023

Visualize CloudWatch Logs ingestion lag with Logs Insights

#aws #cloudwatchlogs #cloudwatchlogsinsights

I have visualized the lag time it takes to ingest logs to CloudWatch Logs and compared it by source service!

0. "timestamp" and "ingestionTime"

CloudWatch Logs event includes the following two times.

(i) timestamp = The time the event occurred
(ii) ingestionTime =The time the event was ingested into CloudWatch Logs

There is a lag between "timestamp" and "ingestionTime".

The "timestamp" (=event occurrence time) is passed by the parameter InputLogEvent of PutLogEvents.
This is something we're rarely aware of if we're using an integrated service or CloudWatch agent

1. Visualize CloudWatch Logs ingestion lag

We can get both from the log data in Logs Insights, so we will aggregate them using the stats command.

1-1. Query the "timestamp" and "ingestionTime"

The initial query of Logs Insights only displays timestamp, but let's display ingestionTime as well

fields @timestamp, @ingestionTime, @message, @logStream, @log
| sort @timestamp desc

Result

You can see that there is a time lag of several seconds.

1-2. Query the time difference(=lag)

fields @timestamp, @ingestionTime, toMillis(@ingestionTime) - toMillis(@timestamp) as diff
| sort @timestamp desc

Result

toMillis(@ingestionTime) - toMillis(@timestamp) as diff

"toMillis" function converts the timestamp found in the named field into a number representing the milliseconds.

1-3. Visualize the time difference(=lag) with stats

fields @timestamp, @ingestionTime, toMillis(@ingestionTime) - toMillis(@timestamp) as diff
| sort @timestamp desc
| stats max(diff), pct(diff,50), avg(diff) by bin(30min)

The maximum, median (50th percentile), and mean of the difference are output at 30-minute intervals. The period is specified as 2 weeks.
The maximum number of log fields that Logs Insights can output is 1,000, so the interval and period must be adjusted accordingly.

Results

2 Comparison of lag by source service

Let's run a visualization query to see if the time difference differs depending on the log source!

2-1. AWS Lambda

AWS Lambda is integrated with CloudWatch and automatically outputs logs with appropriate permissions, so we query the group. (/aws/lambda/{function name})

The median time is about less than 9 seconds.

2-2. EC2 + CloudWatch Agent

Query the log group that EC2 is outputting via CloudWatch Agent.

The median is a little over 5 seconds, but the maximum is around 16 seconds.
It may depend on the contents of the agent configuration file (per force_flush_interval?)

2-3. API Gateway

Query the log group that is output when log output is enabled on API Gateway.
(API-Gateway-Execution-Logs_{stage ID})

The median time is over 20 seconds, the longest among those compared.

Conclusion

There were differences in lagging tendencies among the services. It would be fun to try to guess the cause.

I used Logs Insights to visualize the data, and found it easy and useful.

This post is an English rewrite of an post I wrote in Japanese.

DEV Community

Visualize CloudWatch Logs ingestion lag with Logs Insights

0. "timestamp" and "ingestionTime"

1. Visualize CloudWatch Logs ingestion lag

1-1. Query the "timestamp" and "ingestionTime"

Result

1-2. Query the time difference(=lag)

Result

1-3. Visualize the time difference(=lag) with stats

Results

2 Comparison of lag by source service

2-1. AWS Lambda

2-2. EC2 + CloudWatch Agent

2-3. API Gateway

Conclusion

Top comments (0)

Read next

AWS CloudWatch Logging and Live Tail!

Amazon DevOps Guru for the Serverless applications - Part 13 Anomaly detection on Aurora Serverless v2 with Data API (kind of)

Types of EC2 Instances: How to Choose the Right One for Your Needs

Basic Understanding of AWS Cloud Adoption Framework-(CAF)