Originally published at cloudforecast.io/blog
Following up on my last blog post (Using Parquet on Athena to Save Money on AWS), I wanted to share another thought about Athena, specifically how the S3 bucket is being used by Athena to store query results.
On the first use of Athena, AWS will automatically create a new bucket to store the query results (bucket name
aws-athena-query-results-<ACCOUNTID>-<REGION>). Athena will store a raw result file (
QueryId.csv) and a metadata file (
By storing the data using the
QueryId, it allows you to access previous query's result without re-running them (saving you money since you don't need to rescan the data).
However, you are the owner of the bucket and therefore responsible for the storage on this bucket and here is a couple of reasons why it could cost you a LOT of money:
#1 All the queries are being stored! ALL OF THEM!
Athena store every query results in the bucket. Query data will just accumulate forever costing more and more $$$.
#2 Your data may be compressed but the results are not
S3 is storing the results in raw CSV. Your data may be compressed (GZIP, Snappy, ...) but the results will be in raw CSV. As an example, I ran an accidental
SELECT * FROM flights.parquet_snappy_data on a 84M dataset using Parquet which resulted on a 977MB file on S3.
How to fix this?
It's actually pretty easy. If (and only if) you don't plan to re-use old query results, make sure to setup Lifecycle on your bucket using a Transition or Expiration actions. For example, you could delete query results after 1 or 7 days. At CloudForecast, we actually don't persist
QueryIds since it's not useful to us so we expire the S3 files after 1 day.