DEV Community

Cover image for Tactical Cloud Audit Log Analysis with DuckDB - AWS CloudTrail

Tactical Cloud Audit Log Analysis with DuckDB - AWS CloudTrail

Using DuckDB to query Cloud Provider audit logs when you don't have a SIEM available.

⚠️ Just want the code? Check out my gist here

More than once, I have been in a situation where I needed to query CloudTrail logs but was working in a customer environment where they weren’t aggregated to a search interface. Another similar situation is when CloudTrail data events are disabled for cost reasons but need to be temporarily turned on for troubleshooting/audit purposes. While the CloudTrail console offers some (very) limited lookups (for management events only), and Athena is an option, what about DuckDB?

DuckDB offers both the ability to retrieve directly from S3, as well as parse JSON files into queryable tables. This blog is my documentation of working through that process! This blog assumes you already have DuckDB installed, if not, start here.


Start a DuckDB session. DuckDB can operate either fully in memory or utilize disk space to process datasets larger than your available memory. For Cloudtrail in a single account over a day, in memory should be fine, but we can use persistent storage mode to make sure our tables don't disappear when we exit:

duckdb cloudtrail-analysis
Enter fullscreen mode Exit fullscreen mode

Next, load the AWS extension:

Enter fullscreen mode Exit fullscreen mode

This lets you load AWS credentials from your CLI profiles a bit easier than the default workflow. We can load whatever credentials we have configured in our environment or AWS CLI profile using it:

CALL load_aws_credentials();
Enter fullscreen mode Exit fullscreen mode

Before we go down the SQL Rabbit-hole, lets consider the structure of CloudTrail as it gets exported to S3:

# cloudtrail_file.json

            "eventVersion": "1.09",
            "userIdentity": {
                "type": "IAMUser",
                "principalId": "EXAMPLE6E4XEGITWATV6R",
                "arn": "arn:aws:iam::123456789012:user/Mary_Major",
                "accountId": "123456789012",
Enter fullscreen mode Exit fullscreen mode

Takeaway here is: the file is structured as a single JSON object, with a top level key of Records that is an array containing our CloudTrail entries that we are after. We will need to explode the records out of that array into a table to make them useful.

DuckDB’s read_json function by default will attempt to determine the schema of JSON files, and adapt the column data types accordingly. CloudTrail entries have a few common top level fields but tend to be very dynamic when it comes to specific fields for that event (eg RequestParameters). We can use the maximum_depth parameter on our read_json call to override this functionality.

To avoid redownloading the files from S3 over and over again, we can use the CREATE TABLE … AS statement (aka CTAS in the SQL world) to create a table from our read_json query:

CREATE TABLE ct_raw AS SELECT * FROM read_json('s3://org-cloudtrail-111122223333/AWSLogs/o-123456043/111122223333/CloudTrail/us-east-1/2024/05/19/*.gz', maximum_depth=2);
Enter fullscreen mode Exit fullscreen mode

This gets us a table with a single column: Records with a data type of an array of JSON objects. Next, we can explode the list using unnest to access the individual events:

CREATE TABLE ct AS SELECT unnest(Records) AS Event FROM ct_raw;
Enter fullscreen mode Exit fullscreen mode

The JSON datatype allows us to to access the nested values using dot notation, which looks like this: event.userIdentity.arn. While this can offer us some limited querying, when we want to utilize our columns in the WHERE statement, the JSON datatype isn't ideal. To finish, we can extract the keys we care about into separate columns using json_extract_string:

CREATE TABLE cloudtrail_events AS SELECT   json_extract_string(event, '$.eventVersion') AS eventVersion,
    json_extract_string(event, '$.userIdentity.type') AS userType,
    json_extract_string(event, '$.userIdentity.principalId') AS principalId,
    json_extract_string(event, '$.userIdentity.arn') AS userArn,
    json_extract_string(event, '$.userIdentity.accountId') AS accountId,
    json_extract_string(event, '$.userIdentity.accessKeyId') AS accessKeyId,
    json_extract_string(event, '$.userIdentity.userName') AS userName,
    CAST(json_extract_string(event, '$.eventTime') AS TIMESTAMP) AS eventTime,    
    json_extract_string(event, '$.eventSource') AS eventSource,
    json_extract_string(event, '$.eventName') AS eventName,
    json_extract_string(event, '$.awsRegion') AS awsRegion,
    json_extract_string(event, '$.sourceIPAddress') AS sourceIPAddress,
    json_extract_string(event, '$.userAgent') AS userAgent,
    json_extract_string(event, '$.errorCode') AS errorCode,
    json_extract_string(event, '$.errorMessage') AS errorMessage,
    json_extract(event, '$.requestParameters') AS requestParameters,
    json_extract(event, '$.responseElements') as responseElements,
    json_extract(event, '$.resources') AS resources,
  FROM ct
Enter fullscreen mode Exit fullscreen mode

Query time!

Some sample queries:

All actions taken by a particular IAM Principal:

select eventName, eventTime, userAgent from cloudtrail_events where arn = 'REPLACE_ME';
Enter fullscreen mode Exit fullscreen mode

All the unique error messages:

select distinct errorCode from cloudtrail_events;
Enter fullscreen mode Exit fullscreen mode

Get all events in the past 24 hours:

select * from cloudtrail_events where eventtime >= (now() - INTERVAL '1 day');
Enter fullscreen mode Exit fullscreen mode

Happy querying!

Top comments (2)

jasondunn profile image
Jason Dunn [AWS]

A short, but well-structured article that delivers what's needed. Nice work!

jlgore profile image

Very cool write up!