Siddhant Khare

Posted on Dec 8

Querying S3 logs in a flash with DuckDB 🦆

#aws #devops #duckdb #productivity

Amazon S3 server access logs are a treasure trove of valuable information, detailing requests made to your S3 buckets. These logs record who accessed what and when, providing crucial insights for monitoring, auditing, and troubleshooting. But when dealing with large amounts of log data, it can quickly become a hassle to sift through manually. That’s where DuckDB comes in—a game-changing tool that allows you to run SQL-like queries directly on files, without the need for cumbersome data imports.

What's DuckDB? Let me explain like you're 5

Imagine you have a huge toy box (that's your data files), and you want to find all the blue cars. Instead of dumping everything out and sorting through it piece by piece, DuckDB is like having a magical helper that can peek inside the box and instantly tell you about all the blue cars, where they are, and how many you have. Better yet, this helper can look in multiple toy boxes at once, even if they're in different rooms (or, in our case, different storage systems)!

In grown-up terms: DuckDB is an analytical SQL database that lets you query files in various formats from different sources with incredible speed. Think of it as SQLite's cooler cousin who went to tech school.

The real-world Problem: S3 access logs

Amazon S3 creates detailed server access logs that record every request made to your bucket. These logs are like a meticulous diary of who touched what and when. They get dumped into another S3 bucket under a specified prefix, and they hold crucial information about:

Who's accessing your data
What they're requesting
When they're making requests
How they're accessing it
Whether the requests succeeded
And much more

From log chaos to query zen: the DuckDB chronicles

Step 1: Setting up the foundation

First things first, we need to give DuckDB the power to talk to S3:

INSTALL httpfs;
LOAD httpfs;

CREATE SECRET secret (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

Why this matters:

httpfs is your key to reading files over HTTP/S3 (think of it as DuckDB's passport to the cloud)
PROVIDER CREDENTIAL_CHAIN lets DuckDB use your AWS credentials seamlessly

Step 2: The failed attempts (bcoz. learning from mistakes is fun)

My first innocent attempt:

SELECT * FROM 's3://path/to/log';

Result? "Table does not exist!" Because life's never that easy.

Then I tried:

SELECT * FROM read_csv('s3://path/to/log');

This treated the log like one giant column - not exactly helpful when you're trying to analyze specific fields!

Step 3: Understanding the Log Format

S3 access logs look like this:

79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be amzn-s3-demo-bucket1 [06/Mar/2024:00:00:31 +0000] 192.0.2.3 ...

The key realization: These fields are space-delimited, not comma-delimited. This might seem obvious in hindsight, but it's crucial for proper parsing.

Step 4: Working solution

Here's the query that actually works:

SELECT column05, COUNT(1)
FROM read_csv('s3://bucketname/path/to/2024-12-07*', 
              delim=' ', 
              header=false, 
              types={'column13': 'VARCHAR'})
GROUP BY column05;

Let's break down why each part matters:

delim=' ': Tells DuckDB to split on spaces instead of commas
header=false: Prevents the first log line from being misinterpreted as column names
types={'column13': 'VARCHAR'}: Handles those pesky - characters that sometimes appear instead of numbers
2024-12-07*: Globbing pattern to read all logs from December 7th

The devil in the details

The type conversion trap

When you're reading multiple files using the * pattern, you might hit this error:

Column with name: "column13" is expected to have type: BIGINT But has type: VARCHAR

This happens because DuckDB tries to be smart about inferring types, but log formats can be inconsistent. The solution? Explicitly tell DuckDB to treat potentially problematic columns as strings.

The header conundrum

Always set header=false when dealing with logs. Why? Because log files don't have headers, and you don't want DuckDB treating your first log entry as column names. Trust me, I learned this the hard way.

Why this matters

The beauty of this solution isn't just in its simplicity - it's in its scalability. You can:

Process gigabytes of logs without breaking a sweat
Query logs directly from S3 without downloading them first
Use familiar SQL syntax instead of learning new tools
Get results blazingly fast thanks to DuckDB's columnar processing

The bigger picture

What started as a simple need to analyze access patterns turned into a discovery of a powerful tool that could replace various data processing scripts and workflows. DuckDB isn't just for logs - it's becoming my go-to for any task that involves:

Quick data analysis
File format conversions
Data aggregation
Ad-hoc querying

Conclusion

DuckDB turned what could have been a complex ETL process into a simple SQL query. It's like having the power of a data warehouse with the simplicity of SQLite. And the best part? This is just scratching the surface of what it can do.

Now if you'll excuse me, I have some more logs to analyze... and maybe a few data pipelines to simplify. 😉

P.S. - Yes, I might have been looking for an excuse to use DuckDB, but can you blame me after seeing these results?

DEV Community