DEV Community

Cover image for Best practices for AWS Athena
Christopher Thompson H. for AWS Community Builders

Posted on • Edited on

Best practices for AWS Athena

In this blog I will mention some of the best practices recommended by AWS for building queries in Athena based on my experience and the following resources:

General Recommendations


Always use WHERE on partition field

This mainly in order to speed up time and cost.
For example:
Avoid:

select * from table1 where cast(col_1 as integer) = cast('201912' as integer) - 1
Enter fullscreen mode Exit fullscreen mode

and prefer:

select * from table1 where col_particion = '201911' //speedup ~85%, savings ~95%
Enter fullscreen mode Exit fullscreen mode

Avoid using ORDER BY without LIMIT

It is extremely important to understand that the ORDER BY function must be done in a single node, since it is a slow and time-consuming function. Ideally it should be avoided, however, if within the use case you are implementing you must use it, I always recommend placing a LIMIT.
For example:
Avoid:

select * from table1 order by date;
Enter fullscreen mode Exit fullscreen mode

and prefer:

select * from table1 order by date limit 1000; //speedup ~98%, avoid 'Query exhausted resources at this scale factor'
Enter fullscreen mode Exit fullscreen mode

Select only the columns to retrieve the final result

This recommendation is very simple. In practical effect it is to avoid the SELECT * FROM.
For example:
Avoid:

create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select * from table_1 A
left join (select * from table_2) B on A.col_1 = B.col_1
Enter fullscreen mode Exit fullscreen mode

Instead use:

create table tmp_table
as select
A.col_1
A.col_2
B.col_3
from select col_1,col_2 from table_1 A
left join (select col_1,col_3 from table_2) B on A.col_1 = B.col_1
Enter fullscreen mode Exit fullscreen mode

Schedule data aggregation for small files

The numbers speak for themselves:

Query Number of files Run time
SELECT COUNT(*) FROM lineitem 5000 8.4 seg
SELECT COUNT(*) FROM lineitem 1 2.31 seg
Speedup 72% faster

Prefer the use of regular expressions over 'LIKE'

Query Run time
SELECT COUNT(*) FROM lineitem WHERE text_column LIKE '%wake%' OR text_column LIKE '%some%' OR text_column LIKE '%express%' OR text_column LIKE '%hello%' 20.56 seg
SELECT COUNT(*) FROM lineitem WHERE regexp_like(text_column,'...') 15.87 seg
Speedup 17% faster

Note: The expression would be

regexp_like(text_column, 'wake|some|express|hello')


When using group by for multiple fields. order them from highest to lowest cardinality

This will avoid memory errors and reduce the time to deliver results.
For instance:
Avoid:

select * from people group by column_genre,department;
Enter fullscreen mode Exit fullscreen mode

Instead use:

select * from people group by department,column_genre;
Enter fullscreen mode Exit fullscreen mode

In case of using Crawlers to automatically obtain the structure of the data stored in S3, respect the data types supported for the source engine. Likewise, do not forget to run the crawler after a data update that may generate changes in the structure, this in order to update the structure in the glue catalog.

Use MSCK REPAIR TABLE only if the folders are created with the structure 'field1 = / field2 = /.../ fieldN = ' and only after creating the table, since it is' msck repair table 'is expensive operation and it is preferable to use' alter table add partition 'or glue api to add partitions.

and do you have any good practice that you recommend? Comment it in the comment box.

I hope this blog is useful for you. Greetings!

Top comments (2)

Collapse
 
ssennettau profile image
Stephen Sennett

Nice work! The hardest thing I found when first working with JSON was flattening the rows before processing it; a trick needed with Hive which I'd never seen before. Such a tricky tool to learn, but super-powerful once you have it.

Collapse
 
cthompsonh profile image
Christopher Thompson H.

Oh yeah, I think that in itself is also part of the difficulty of working with JSON with the SQL standard. In this case, as Athena is under Presto, although there are functions to work under that nomenclature, it is a bit complex if we talk about custom structures. Very good tip!