DEV Community

Amzar
Amzar

Posted on • Updated on

How to query CSV file stored in AWS S3 using Athena?

CSV (Comma Separated Values) can consider as a database, and it stores data in a table format. As far as I know, CSV has been used everywhere; from reports to machine learning to store the dataset and else.

Believe it or not, most companies still use CSV as a 'db.' Then, when we talk about data and you're a passionate Data Analyst, we must analyze it using SQL (in BI tools) so, if you're working in AWS environment do follow this steps.


AWS Logo

1- Prepare a CSV File

Skip if you already have CSV file on your local.

  1. As we will use Python, do install Python on your local
  2. Install faker library to get dummy data
  3. Prepare the CSV with 'first_name', 'last_name', 'age' as headers
  4. Now, open the folder and dummy-data.csv should created inside it
import csv
from faker import Faker

fake = Faker()
number_of_records = 8000

with open('dummy-data.csv', mode='w+') as file:
    file_writer = csv.writer(file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    file_writer.writerow(['first_name', 'last_name', 'age'])

    for _ in range(number_of_records):
        file_writer.writerow([fake.first_name(), fake.last_name(), fake.numerify("@#")])
Enter fullscreen mode Exit fullscreen mode

2- Create a bucket in Amazon S3

Skip if you already have a bucket and use the existing one

  1. Go to AWS Console -> search S3 -> click Create Bucket create bucket
  2. Fill up the necessary fields -> Create bucket fill up fields
  3. New bucket is created. Now you can decide whether to create a new folder inside the bucket. For me, I want to have a separate folder for dummy data; hence I created dummy_data folder
  4. Click Create folder Create folder
  5. Fill up the necessary fields -> Create folder
  6. New folder is created in the bucket
  7. Now, upload the .csv file by drag to the page upload file
  8. You can refresh or go back to the bucket folder to see the imported CSV file Image description
  9. Please don't forget to copy the S3 URI of the folder to create Athena table

3- Create a table in AWS Athena

Note: If you want to create another database, can use below query
CREATE DATABASE myDataBase

  1. Create table in Athena; you need to define the structure of the table so that Athena table able to know the data structure in CSV file.
    1. AWS Management Console
    2. AWS CLI
  2. For this post, I am using AWS Management Console. Like the S3 service, you can open your AWS Console and search Athena.
  3. Populate the table using below query
CREATE EXTERNAL TABLE dummy_person (
  first_name string,
  last_name string,
  age string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
  'quoteChar' = '"',
  'escapeChar' = '\\'
)
LOCATION '[S3 URI]'
TBLPROPERTIES ("skip.header.line.count"="1");

Enter fullscreen mode Exit fullscreen mode

Table is created.
table

Run SQL queries to check the data has been populated

select * from dummy_person 
Enter fullscreen mode Exit fullscreen mode

At this moment, you can already see the data

data


Next, you can connect Athena to your BI tools and start to develop the dashboard 😊

Top comments (0)