DEV Community

Cover image for How to ETL with AWS Glue and AWS Glue Studio to transform your data - Part 1
Wendy Wong for AWS Community Builders

Posted on • Updated on

How to ETL with AWS Glue and AWS Glue Studio to transform your data - Part 1

How do you clean data quickly?

Real world data is very dirty - from startups, consulting, construction, government and insurance, the first use case for cloud adoption could be how to integrate data and make it clean. The journey to cloud is a long transformation project demonstrating the different data analytics life cycle maturity of customers from descriptive, prescriptive and predictive.

Different organizational structures will determine if there are budgets and therefore headcount for data engineers available to prepare data for analytics and machine learning which is the future state.

If an organization does not have FTE for data engineers, data scientists and business analysts may complete ETL (extract, transform and load) jobs to clean and transform small datasets using various tools such as Tableau Prep, Alteryx, Microsoft Power Query.

Learning Outcomes

In the AWS Management console, you will learn:

  • Create IAM permissions to use multiple AWS services

  • Use AWS Glue and AWS Glue Studio to clean and prepare your data for analytics

  • Create an external table using AWS Athena

  • Create visualizations to draw business insights using Amazon QuickSight

cover

What is AWS Glue?

AWS Glue is a serverless data integration service which means that you do not need to preconfigure servers in advance for ETL jobs for cloud data warehouse Amazon Redshift or an Amazon S3 data lake.

Data Engineer and Developers: Glue Studio may use a visual editor to create an ETL job which renders a Python script and schedule ETL jobs. AWS Glue may be used to add crawlers to identify the schema of datasets, create a database and table from data stored in an AWS S3 bucket location.

Data Scientists: AWS Glue DataBrew is a tool for visual data preparation that may be used to normalize data such as changing date formats, view distribution, create logical statements and inspect the data quality of missing values.

Business Analysts: May use all three services AWS Glue DataBrew, Glue Studio and Glue for different business use cases for their stakeholders.

AWS Glue - High Level Architecture

a) AWS Glue - create a unified catalog to access data from data stores

architecture 1

b) AWS Glue Studio - monitor ETL jobs without coding

Architecture 2

Tutorial 1: Getting started in ETL with AWS Glue

Pre-requisites

Step 1: Read the blog Create an AWS account

Step 2: Read the blog Create IAM access for Administrator

Step 3: To run a AWS Glue ETL job, use AWS Glue Studio, Amazon Athena and Amazon QuickSight you will need to attach additional policies to the role for permissions to access these AWS Services.

Step 4: To run an ETL job using AWS Glue you will need to read the official AWS documentation and follow the steps listed below to Create an IAM Role
to create a role 'AWSGlueServiceRole' which can access the crawler and create a role for ETL jobs:

a) Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.

IAM sydney

Click on the left hand-side for Users. There will be an existing administrator role.

b) In the left navigation pane, choose Roles.

c) Choose Create role.

d) For role type,find and choose Glue, and choose Next: Permissions.

Glue Service Role

e) On the Attach permissions policy page, choose the policies that contain the required permissions. E.g. AWSGlueServiceRole for general AWS Glue Studio and AWS Glue permissions and the AWS managed policy AmazonS3FullAccess for access to Amazon S3 resources.

IPermission

f) Add additional policies as needed for additional data stores or services.

g) Choose Next: Review.

h) For Role name, enter a name for your role; for example, AWSGlueServiceRole.

Glue Service Role

i) Choose Create Role to finish creating the role.

Step 5: Read the AWS documentation and follow the steps Set up IAM permissions for AWS Glue Studio to create a Glue Studio role.

a) Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.

IAM sydney

b) In the navigation pane, choose Policies.

c) In the list of policies, select the check box next to the AWSGlueConsoleFullAccess. You can use the Filter menu and the search box to filter the list of policies.

d) Choose Policy actions, and then choose Attach.

Repeat the previous steps to attach additional policies to the user.

Give the role a name e.g.AWSGlueServiceRole-Studio

Give the role a name

Step 5: Download the sales_records.csv file from the course Getting Started with AWS Glue in AWS Skillbuilder.

sales records

Step 6: Create an Amazon S3 bucket and upload the csv file for sales_records.csv

You can read the blog to Create an Amazon S3 bucket

a) In the search bar type 'S3'.

S3

b) Click Create bucket

c) Under Bucket name provide details e.g. learn-glue-

d) Create a folder and name it 'sales-data'

sales folder

e) Within the sales-data folder, create two sub-folders and name them 'raw' and 'processed'.

raw folder

f) Upload the sales_records.csv into the S3 bucket in the folder named 'raw'.

raw sales

g) Type the word 'Glue' in the search bar to navigate to AWS Glue

Glue service

(Note: follow the recommended steps from AWS Skill Builder course 'Getting Started with Glue') link

h) Click 'Crawlers' from the menu on the left.

Click Add Crawler

Imadd crawler

i) Name the crawler 'learn-glue-crawler' and click Next.

learn glue cralwe

j) Select the options data stores and crawl all folders within S3 buckets.Click Next.

crawl all folders

k) Navigate to your S3 bucket for the saved sales_records data and search for the path for raw folder.

path

l) Click Yes to create another data store.

yes data store

m) Under the specified path details, navigate to the S3 bucket related to the processed folder.

second

n) Additional data stores will not be created. Click No and Next.

o) Select Create an IAM role and name it 'learn-glue-role'

p) Schedule to run the crawler on-demand.

schedule crawler

q) Click Add database and name the database 'Sales-data' and click Next.

add db

r) Inspect the details entered and click Finish.

s) Click 'learn-glue-crawler' and Run Crawler.

run crawler

Irun now

t) After the crawler has completed with a status of 'ready', click on the menu on the left Table and you will see a table created called raw.

table rwady

Tutorial 2: Getting started with ETL using AWS Glue Studio Note: This tutorial is based on this link

Step 1: On the search bar type 'Glue Studio' to navigate to AWS Glue Studio.

Step 2: Click on View Jobs

view jobs

Step 3: Select the option Visual with a source and target and click Create.

create

Step 4: Under the S3 bucket node, select the option Data catalog, select the table 'raw'.

Choose Apply Mapping node.

data catalog

Step 5: Click on the tab Output Schema, you can preview the variables within the dataset and you can drop variables and change data types.

Output schema

Step 6: Click on node Data target - S3 bucket. Under the drop-down selections choose Format as Parquet, Compression Type as Snappy and the S3 path for the 'processed' folder.

parquet

Step 7: Navigate to the tab 'Script' and you may inspect the Python script that was generated from the visual ETL editor. This script can be downloaded.

script

Step 8: Click on the tab 'Job Details', provide a name for the job and a description for the data transformation. Change the number of workers to 2.

job details

Step 9: Click Save and run the ETL job.

run jobs

Step 10: The ETL job will take a few minutes to run and will change to a status of 'Succeeded'.

Step 11: Navigate to the AWS Glue console and select Crawlers on the left hand menu and re-run the'learn-glue-crawler' and after the status is 'ready' click on Table to inspect a 'processed' table in the Data catalog.

data processed table

Tutorial 3: Use AWS Athena to create an external table and create business insights using Amazon QuickSight

Step 1: Type 'IAM' in the search bar to navigate to Identity and Access Management (IAM).

Sydney IAM

Step 2: Manually attach the two policies 'AmazonAthenaFullAccess' and 'AWSQuicksightAthenaAccess' managed policies to the existing IAM administrator user link

attach policies

Step 3: Review the policies for:

  • Amazon Athena
  • Amazon QuickSight
  • AWS Glue
  • AWS Glue DataBrew
  • AWS Glue Studio
  • Amazon S3

Administrator

Step 4: Type in the search bar 'Athena'

homepagen

Step 5: Click on the Settings tab to ensure you have provided AWS Athena access to your S3 bucket.

sttings

Step 6: Click Create and select the option to create a table from S3 bucket location

Step 7 Open a second tab of AWS Glue and navigate to Tables and inspect the schema of the 'processed' table.

metadata

Step 8: Create a database from this link

Provide the following details:

a) Create a table name

b) Under Database Configuration, select the option 'Create a database' and provide a database name.

c) Dataset - provide the S3 bucket path for the 'processed' folder

d) Select the data format 'csv'.

e) Enter the column details and their data type.

daat type

f) Preview the SQL code generated for table and click Create table

g) Save the query

AThena table created

Step 9: Create dashboards using Amazon QuickSight

a) Check that you have the correct policy attached to IAM for Amazon QuickSight.

permission

b) Navigate to the search bar and type 'QuickSight'

Quicksight homepage

c) Create a QuickSight account

create an account

d) Select an appropriate support plan

Enterprise

e) You will receive a confirmation after completing the sign-up process.

congrats

f) Click New dataset

create a new dataset

g) Click on the data source Athena

Data sources

h) Click on the data catalog and select the 'processed sales' table from Athena

choose athena

choose database table

choose processed table

i) Click Visualize to finish creating the data set

SPICE

j) If you need to trouble-shoot to connect AWS Athena table to Amazon QuickSight, navigate to your Amazon account on the right-handside and click Manage QuickSight then click Security and Permissions

security

Click manage

manage

Ensure you check the boxes that relate to the connection to the S3 bucket

AWS servicen

AWS s3 bucket

k) Data is successfully imported using SPICE

import

l) Click new analysis to commence creating visuals for your dashboard

new analysis

m) Create your dashboard, publish it and share the analysis.

dashboard

n) Clean up and delete resources. Delete the glue crawler.

Navigate to Glue and on the left-hand side click 'crawler' and check the box. Under the drop-down menu Actions select 'Delete crawler. It's important to delete the crawler to ensure that you do not incur any necessary costs.

delete crawler

Confirm the step and click Delete

confirm delete

The crawler is deleted.

deleted crawler

o) The final step is to delete S3 bucket that stores input and output files for Amazon Glue

Navigate to Amazon S3 and delete the two S3 buckets:

  • aws-glue-assets (i.e. used to transform data with Amazon Glue Studio)
  • learn-glue (i.e. used to transform data with Amazon Glue)

Empty the S3 bucket

empty S3

Delete the S3 bucket. Repeat the same steps for any remaining S3 bucket.

Firstly, select the S3 bucket and click Empty. Type the word in the box 'permanently delete' and click the button 'Empty'.

Empty 2

Return to the main menu and click on the radio button to select the S3 bucket and click Delete.

delete S32

Final Note

  • Amazon QuickSight Q is not currently available in the region Asia-Pacific (Sydney) - ap-southeast-2 at the time this blog was published.

Happy Learning! 😁

Next Tutorial: How to ETL with AWS Glue and Amazon Glue Studio to transform your data - Part 2

In this next tutorial it will describe ETL using AWS Glue with your own dataset

Register for Australia's biggest data engineering conference DataEngBytes:

  • Melbourne: 27 September 2022
  • Sydney: 29 September 2022

Register here: https://dataengconf.com.au/

Data Eng Bytes

References

Top comments (0)