Watch out for Unexpected S3 cost when using Athena.

#aws #finops #athena #s3

Originally published at cloudforecast.io/blog

Following up on my last blog post (Using Parquet on Athena to Save Money on AWS), I wanted to share another thought about Athena, specifically how the S3 bucket is being used by Athena to store query results.

On the first use of Athena, AWS will automatically create a new bucket to store the query results (bucket name aws-athena-query-results-<ACCOUNTID>-<REGION>). Athena will store a raw result file (QueryId.csv) and a metadata file (QueryId.csv.metadata).

By storing the data using the QueryId, it allows you to access previous query's result without re-running them (saving you money since you don't need to rescan the data).

However, you are the owner of the bucket and therefore responsible for the storage on this bucket and here is a couple of reasons why it could cost you a LOT of money:

#1 All the queries are being stored! ALL OF THEM!

Athena store every query results in the bucket. Query data will just accumulate forever costing more and more $$$.

#2 Your data may be compressed but the results are not

S3 is storing the results in raw CSV. Your data may be compressed (GZIP, Snappy, ...) but the results will be in raw CSV. As an example, I ran an accidental SELECT * FROM flights.parquet_snappy_data on a 84M dataset using Parquet which resulted on a 977MB file on S3.

How to fix this?

It's actually pretty easy. If (and only if) you don't plan to re-use old query results, make sure to setup Lifecycle on your bucket using a Transition or Expiration actions. For example, you could delete query results after 1 or 7 days. At CloudForecast, we actually don't persist QueryIds since it's not useful to us so we expire the S3 files after 1 day.

Feel free to reach out if you have any questions at francois@cloudforecast.io or by Twitter: @francoislagier. Also, follow our journey @cloudforecast

Top comments (5)

Istvan • May 1 '19

Good to know. Is there a way to disable this? I have some queries when I just would like to know the outcome of the query for ETL and I do not need anything else. Not sure if that is doable with Athena.

Francois LAGIER CloudForecast.io • May 1 '19

I don't think it's possible to disable it. One solution would be to have a simple Lambda function to clean your files after a few minutes.

starpebble • Mar 12 '20

The Life Cycle transition is brilliant. The simple lambda interesting, too. Combine both with S3 object versioning to open a door!

Combining Life Cycles and object versioning can also reduce costs. Object versioning is a game changer. Two versions of a single query result may be stored with S3 object versioning. Each object with different attributes. The first object is the Athena generated object with a transition to a lower cost tier. Example: a csv and glacier. The second object could be the compressed version of the exact same object, at a more available S3 tier. Example: compressed at standard tier. Each version has a different version id. Create the compressed version automatically. Configure S3 to send an event when a query result is put in the bucket and trigger a lambda to compress the file and put the compressed file in the bucket as a second version of the query result. Combining life cycling and versioning is like opening up a door.

Jared Donboch • Mar 21 '21 • Edited

I just began consulting a company who wanted to investigate their growing S3 costs.

They had this exact problem which was exasperated by QuickSight which creates it's own S3 bucket for storing Athena query results and, depending on the SPICE caching configuration, can run large unfiltered Athena queries hourly to while it works on refreshing it's cache.

After a few years of storing those queries and cache refreshes, the bucket size was almost 20 TB, all in standard S3 storage.

Solution of adding a lifecycle policy to delete these transient files was easy and ended up saving almost $500 a month in a matter of minutes.

juvarov • Dec 9 '19

thanks! also interesting case of "coast" when Athena calls queries on S3 bucket that is placed in the other region - it is expensive since it move data between regions in S3