Davide de Paolis

Posted on Oct 22, 2021

Find and Count files on AWS S3 by size

#aws #shell #productivity #bestpractices

TL;DR

If you want to search an AWS S3 Bucket for all the files that are empty take advantage of the aws s3api cli and jJMESpath queries for a simple and effective one-liner:

aws s3api list-objects --bucket BUCKET_NAME --output json 
--query 'Contents[?Size==`0`]' > bucket_content-not-empty.json

While working on a legacy project recently I realised that an AWS S3 Bucket that had been used for years as a recipient for files generated by different ETLs ( extract, transform, load scripts), was in a very bad shape.

Literally tons of files, some of them very old, and some of them even empty.

I wanted to understand what we were keeping in that Bucket, how and when it was used in order to do some clean up, optimise the costs of storage and simplify access and usage of those files, but navigating the files via UI console was really a pain.

How can I get a list of all the files in the bucket and immediately know what are empty?

After some reading in AWS Docs and StackOverflow here are some approaches.

Recursively list files in the bucket

aws s3 ls s3://BUCKET_NAME  --recursive --summarize

This solution was the first, a classic, in my case rather slow but was enough to get a list of all the files.

Unfortunately the possibility of filtering and searching where limited, then I discovered that alongside s3, another CLI exists: s3api which thanks to a JSON output and JMESPath queries grants more flexibility and granularity.

Check this article for more information --> Leveraging S3 and S3API Commands

Enter S3API

aws s3api list-objects --bucket BUCKET_NAME 
--output json --query "[length(Contents[])]"

Counts all files in the bucket, while

aws s3api list-objects --bucket BUCKET_NAME 
--output json --query 'Contents[].{Key: Key, Size: Size}'

nicely gives you a json list consisting of Name and Size of the files.

I was just happy with that and from there, I moved to my own territory, JQ!

I just added to the above script the command to save the result to a file ( > bucket_content.json ) and then started working with JQ filters (found the Cookbook very useful!).

cat bucket_content.json | jq -c '.[] 
| select(.Size != 0 )' > content-not-empty.json

Now just extract the names only and count them:

cat bucket_content-not-empty.json| jq -r '.[] | .Key' | wc -l

Improve the solution

Once I was done with JQ though, I was very curious about the query language used by s3api and a comment on Stack Overflow blew my mind, JMESpath queries are very powerful ( and very similar to JQ btw).

Here the command that returns all the files that are not empty without piping multiple times and using JQ:

aws s3api list-objects --bucket BUCKET_NAME --output json 
--query 'Contents[?Size!=`0`]' > bucket_content-not-empty.json

Where to go from here

What I discover was satisfying and unpleasant at the same time: Out of hundreds of thousands files in the Bucket only just a few thousands had actual content.

That opens up huge room for improvement and clean up.

remove all empty files
change where possible the ETLs to avoid saving files to S3 if after Extract and Transform we realize there is no data
Or alternatively add some Expiration policy, or rules/lambdas checking and cleaning up empty files regularly
move very old files which are not used / accessed anymore, but for which we still need some backup to other less expensive solutions like AWS Glacier. But this is a topic for different post.

Hope this helped

DEV Community