Kelvin Tay

Posted on Aug 28, 2021

Understanding AWS Athena 101

#aws #athena #introduction #comparison

Preface

We assume the reader:

has some prior knowledge of AWS and its services, but is new to AWS Athena
hopes the understand the gist of AWS Athena quickly

As such, this document is more of an explanation, and not a how-to guide nor tutorial.

For a more in-depth guide, particular on how-tos, I do recommend the Athena Guide as a follow-up, just before you start architecting something with Athena.

What is Athena?

Athena is a serverless querying service from AWS.

You can use Athena to run data analytics, with just standard SQL (Presto).

It supports many different data sources, including S3, DynamoDB, CloudWatch metrics, and many more.

For example, we can have transactions stored as CSV files in S3, and run aggregation queries on these files via Athena.

Importantly, it is also important to discuss what Athena is not.

Athena is not intended to be an ETL solution.

AWS Glue or EMR are services closer to full ETL solutions.

You can of course run Athena queries to transform and load your data into S3, but notice how Athena does not have any built-in features to do scheduled queries. In essence, Athena is not positioning itself as any ETL solution, but merely a querying service.

Athena is not intended to be an ETL solution.

Athena does not store any data.
It simply reads from other sources and outputs the query result into S3. It does not modify any existing data.

How do we run Athena queries?

You can run Athena queries via the AWS console UI (similar to Google’s BigQuery), or via the StartQueryExecution API.

An Athena query looks something like this:

SELECT field1, field2
FROM
   "glue_database"."glue_table"
WHERE
   partitioned_field3 < 9999;

Notice the intentional example database and table names.

This is so, because Athena queries are run against Glue Data Catalog (consisting of Glue Database and Tables).

You can think of a Glue Table as a logical table that describes an underlying data source (e.g., S3).

Specifically, a Glue Table will describe:

where the data resides (e.g., S3 location), and
what the schema of the data is (e.g., Parquet format, with field1, field2, and a partitioned field3).

A Glue database is simply a collection or grouping of Glue Tables. You can think of it as a namespace.

Because query requests are asynchronous, the requester receives an Athena query execution ID which they can then poll to see if the query completes, and subsequently fetch the result.

What is the difference between Glue and Athena?

This is my attempt to differentiate the 2 services.

Glue	Athena
“Glue“ components that help build ETL solutions, particular around transforming & loading big amount of data.	Querying engine that does not deal with data preparation. It assumes your data source is already cleaned and prepared as described by the Glue Table.
Includes: Glue Database Glue Table Glue Job (for ETL, e.g., PySpark jobs) Glue Trigger (to invoke Glue jobs based on events or conditions)	Utilizes: Glue Database Glue Table S3

Before you go

These are some pointers or tips I found out and learned the hard way. I hope this list becomes useful so you can avoid building something that may not scale well.

Athena uses a specific Presto version, which is unlikely to be the latest. Depending on your workgroup and thus the Athena engine version, the supported Presto version can be rather different. Be mindful of this when looking up Presto documentation then.
DML queries (i.e., SELECT queries) have a hard limit of 30 minutes.
Try to optimize your Athena queries with these tips. Of course, it is incredibly important to know our access patterns (i.e., how would most queries on this table look like).
By default, query results are saved as CSV files in S3. Use UNLOAD if you require other formats like Parquet or JSON.
Query results saved in S3 have filenames {query_execution_id:uuid}.csv. Right now, there is no way to customize the filenames, unfortunately.
While Glue Databases and Tables require lower-case names, Athena further requires that we use underscores, and not hyphens.

DEV Community