DEV Community 👩‍💻👨‍💻

Danny Reed
Danny Reed

Posted on

2 Important Optimizations for Athena with S3

Do you use Athena to query data in S3? Coupled with the use of the Parquet file format, this is a really powerful combination. S3 is very cheap, Athena is very cheap, and wait a second....my bill is steadily going up? What happened?

That's what happened to me, and it took a while to identify two key configuration mistakes that could have been avoided easily. I will share them both here so you can do these right the first time and save yourself one major headache and one minor one.

Optimization 1: Athena Query Results

Did you know that by default, Athena will save a CSV of the output of every query you run? It stashes it in S3. If your system runs Athena queries periodically, this can add up (especially in the Standard storage tier). Honestly these serve no purpose for us, so this wound up creating terabytes of unnecessary data.

Fixes

  1. One fix is to create a lifecycle rule on the "folder" in which Athena stores these results. For our purposes, it's sufficient to keep just a few days worth of results for debugging purposes. You'll need to identify where those results are stored, then create the appropriate lifecycle rule to delete old files.

  2. You can also create a rule to at least move these to a cheaper-than-Standard storage tier to reduce costs without deleting any data. This should at least minimize the costs if you truly have a reason to keep all your query results.

Optimization 2: KMS

This one was sneaky. It took me quite a while to figure out why my KMS costs were going through the roof. It turns out that when Athena pulls out my .parquet files from S3, it makes kms:decrypt calls for every file it retrieves. When you're retrieving a hundred thousand files dozens of times per day, this gets really expensive.

Root Cause

The issue was that I had selected S3 Encryption for the bucket in which my .parquet files are stored. This works fine for many use cases, but when frequently retrieving many files, it's expensive, and you should prefer Bucket Keys with SSE-KMS.

Fix

Use a bucket key. I was originally using SSE-S3 encryption, but to use a bucket key, you'll need to move to SSE-KMS with Bucket Key enabled. The use of a bucket key is really the crux of the solution here.

From AWS docs:

Amazon S3 Bucket Keys reduce the cost of Amazon S3 server-side encryption using AWS Key Management Service (SSE-KMS). This new bucket-level key for SSE can reduce AWS KMS request costs by up to 99 percent by decreasing the request traffic from Amazon S3 to AWS KMS. With a few clicks in the AWS Management Console, and without any changes to your client applications, you can configure your bucket to use an S3 Bucket Key for AWS KMS-based encryption on new objects.

Note that at the end of the quote it says, "on new objects."

This is a depressing detail if you've already stored data without bucket key enabled. You'll have to do what I did and follow this process:

  1. Enable the bucket key for new objects
  2. Enable daily manifests on your bucket
  3. Wait a day or two for manifests to be generated
  4. Make a backup of your data (some will be encrypted with bucket key, most will not)
  5. Create a "Batch Operation" job in S3 to COPY your files over top of themselves, and have the job configure bucket key encryption as it overwrites.

This process is scary because you're essentially overwriting your production data. I recommend thorough testing in a test environment and good study of the related documentation.

Outcome

We have seen our S3 storage costs go down by close to 70%, and our KMS costs go down by 83% for an overall cost savings of hundreds of dollars per month.

These small details are easy to overlook since they don't impact the functionality of your application. Until these two pieces are configured correctly, however, you'll miss out on the cost effectiveness that these powerful services can offer.

Resources

AWS Docs: S3 Lifecycle Rules
AWS Docs: Athena Query Results
AWS Docs: Bucket Keys
AWS Docs: S3 Batch Operation to Re-Encrypt

Top comments (0)

🌚 Browsing with dark mode makes you a better developer by a factor of exactly 40.

It's a scientific fact.