Christopher Thompson H. for AWS Community Builders

Posted on Oct 4, 2021 • Edited on Dec 20, 2021

Best practices for AWS Athena

#watercooler

In this blog I will mention some of the best practices recommended by AWS for building queries in Athena based on my experience and the following resources:

General Recommendations

Always use WHERE on partition field

This mainly in order to speed up time and cost.
For example:
Avoid:

select * from table1 where cast(col_1 as integer) = cast('201912' as integer) - 1

and prefer:

select * from table1 where col_particion = '201911' //speedup ~85%, savings ~95%

Avoid using ORDER BY without LIMIT

It is extremely important to understand that the ORDER BY function must be done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if within the use case you are implementing you must use it, I always recommend placing a LIMIT.
For example:
Avoid:

select * from table1 order by date;

and prefer:

select * from table1 order by date limit 1000; //speedup ~98%, avoid 'Query exhausted resources at this scale factor'

Select only the columns to retrieve the final result

This recommendation is very simple. In practical effect it is to avoid the SELECT * FROM.
For example:
Avoid:

create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select * from table_1 A
left join (select * from table_2) B on A.col_1 = B.col_1

Instead use:

create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select col_1,col_2 from table_1 A
left join (select col_1,col_3 from table_2) B on A.col_1 = B.col_1

Schedule data aggregation for small files

The numbers speak for themselves:

Query	Number of files	Run time
SELECT COUNT(*) FROM lineitem	5000	8.4 seg
SELECT COUNT(*) FROM lineitem	1	2.31 seg
Speedup		72% faster

Prefer the use of regular expressions over 'LIKE'

Query	Run time
SELECT COUNT(*) FROM lineitem WHERE text_column LIKE '%wake%' OR text_column LIKE '%some%' OR text_column LIKE '%express%' OR text_column LIKE '%hello%'	20.56 seg
SELECT COUNT(*) FROM lineitem WHERE regexp_like(text_column,'...')	15.87 seg
Speedup	17% faster

Note: The expression would be

regexp_like(text_column, 'wake|some|express|hello')

When using group by for multiple fields. order them from highest to lowest cardinality

This will avoid memory errors and reduce the time to deliver results.
For instance:
Avoid:

select * from people group by column_genre,department;

Instead use:

select * from people group by department,column_genre;

In case of using Crawlers to automatically obtain the structure of the data stored in S3, respect the data types supported for the source engine. Likewise, do not forget to run the crawler after a data update that may generate changes in the structure, this in order to update the structure in the glue catalog.

Use MSCK REPAIR TABLE only if the folders are created with the structure 'field1 = / field2 = /.../ fieldN = ' and only after creating the table, since it is' msck repair table 'is expensive operation and it is preferable to use' alter table add partition 'or glue api to add partitions.

and do you have any good practice that you recommend? Comment it in the comment box.

I hope this blog is useful for you. Greetings!

Top comments (2)

Stephen Sennett AWS Community Builders • Oct 4 '21

Nice work! The hardest thing I found when first working with JSON was flattening the rows before processing it; a trick needed with Hive which I'd never seen before. Such a tricky tool to learn, but super-powerful once you have it.

Christopher Thompson H. AWS Community Builders • Oct 4 '21

Oh yeah, I think that in itself is also part of the difficulty of working with JSON with the SQL standard. In this case, as Athena is under Presto, although there are functions to work under that nomenclature, it is a bit complex if we talk about custom structures. Very good tip!