Introduction
In the previous blog you mastered ETL using sample sales data for both Amazon Glue Studio and AWS Glue.
The third tutorial explored how you may query data from your Amazon S3 bucket using Amazon Athena and create data visualizations from an Amazon Athena table in Amazon QuickSight that references the sales database stored in the AWS Data Catalog.
In this new tutorial, you will learn how to:
a) Locate open source datasets for your passion project, build a test or production workload that you can use with AWS services.
b) Complete data transformation with your own data using both Amazon Glue and Amazon Glue Studio.
Where can you find interesting open source datasets for analytics and machine learning projects?
- Government websites e.g. data.gov, data.gov.uk, data.gov.au.
- Census data e.g. Australian census
Solution Architecture overview - AWS Glue
Tutorial 1: Data Transformation with AWS Glue
Step 1: In this tutorial, download the DOHMH New York City Restaurant Inspections data as a csv file from this link
A preview of the raw dataset:
Step 2: Log into your AWS account. If you do not have an AWS account you may create one here
Step 3: You may follow the instructions here to create your first S3 bucket and give it a name new-york-restaurant-inspections.
Step 4: Click the bucket and select Create Folder.
Create two folders within the Amazon S3 bucket and name them raw and processed respectively as shown below:
Step 5: Upload and click add file to the folder raw
Step 6: Set up basic IAM permissions to use AWS Glue console. You may follow the blog for more details.
If you need to access more Amazon resources such as Amazon Sagemaker Notebooks, you may follow the instructions here.
Step 7: Navigate to the Amazon Glue console and on the left-hand menu, click crawler followed by Add crawler.
Step 8: Provide a name for the crawler. For this example, the we will name the crawler new york-restaurant-inspection-crawler and click Next.
Step 9: For crawler source type, select Data stores and then crawler all folders.
Step 10: In this section, add the S3 bucket path and selecting the csv file from the raw folder and click Next.
Step 11: Click Yes to add another data store and Next.
Step 12: Repeat Step 10 but select the Processed* folder for the S3 bucket path. Click Next.
Step 13: There are no more data stores to be created, click No. Click Next.
Step 14: You may select an existing IAM role for AWS Glue that was created earlier for example 'AWSGlueServiceRole-learn-glue-role'. Click Next.
Step 15: Create a schedule for the crawler, select from the drop-down menu Run on demand.
Step 16: Configure the crawler output. Click Add database and provide a name for the database 'restaurant-inspections'. Add a prefix for the table name called Raw. Click Next.
Step 17: Check the input details and click Finish.
The crawler details are show below:
Step 18: Click the crawler'new-york-restaurant-inspection-crawler' and select Run crawler to execute the job.
Step 19: After the crawler job is completed with a 'ready' status , you can inspect the meta data by clicking Tables on the left-hand side.
You may inspect the following schema from the data catalog:
Solution Architecture overview - Amazon Glue Studio
Tutorial 2: Data Transformation with Amazon Glue Studio
Step 1: Access 'AWS Glue Studio' from the AWS Glue console.
Step 2: From the AWS Glue Studio console, click View jobs.
Step 3: Select the option Visual with a source and target and click Create.
Step 4: Under Node Properties select S3 bucket followed by Amazon S3
Step 5: Under the S3 bucket node, select from the option Data catalog the database name followed by the table name as shown in the image below.
Choose Apply Mapping node.
Step 6: Under the'Apply Mapping node, click on the tab Transform.
Are there any variables that have been incorrectly categorized in the data catalog?
Are there any variables that you need to change the data type?
Are there any columns that you need to drop from your dataset that won't be used in your data analysis or for machine learning?
You may complete some tranformatons such as changing date variables from the data type 'string' to 'date'.
You may view a listing of common data types in AWS Glue in the documentation
Step 7: Click on the tab Output Schema, check the data transformations.
Step 8: Click on S3 bucket- data target node, followed by Parquet format, Compression Type as Snappy and lastly select the S3 bucket path for the processed data folder to output the transformed data.
Select Create a data catalog on subsequent runs and name this table processed.
Step 9: Click on the Script tab to view the Python generated code for this ETL job.
Step 10: Click on the tab 'Job Details', provide a name for the job e.g.restaurant inspection job and a description for the data transformation. Change the number of workers to 2.
Step 11: Click Save and Run to commence the ETL job.
Step 12: The ETL job will take a few minutes to run and will change to a status of 'Succeeded'.
Step 13: Navigate back to the AWS Glue console and select Crawlers on the left hand menu and re-run the'restaurant-inspection-crawler' by clicking Run crawler.
After the status changes to 'ready' click on Table to inspect a processed table in the Data catalog for transformed data types.
Resources
Setting up IAM permissions for AWS Glue
Working with geospatial data in QuickSight
Best practice working with geospatial data in Amazon Athena and Glue
AWS Innovate Online Conference APJ - 23 & 25 August 2022
Register at this link
🚀 Hot off the press: Introducing AWS Glue Flex jobs: Cost savings on ETL workloads
AWS Glue Flex jobs is perfect for use cases including:
- Scheduling ETL jobs to run overnight
- Scheduling test workloads
- One-time bulk upload
- Time-sensitive workloads
You can read about the new feature and flex pricing from the AWS Big Data Blog here.
Until the next lesson, happy learning ! 😁
Top comments (4)
Congrats! Very detailed!
Thanks Pablo!
Great blog, wendy 😊
Flex jobs is a great feature to the existing AWS Glue service. excited to try it out 🔥
Also AWS Glue Streaming ETL Auto Scaling is GA and it looks good 👌
Thanks Adit, I am excited to try AWS Glue Streaming ETL for a business use case :)