In the previous blog you mastered ETL using sample sales data for both Amazon Glue Studio and AWS Glue.
The third tutorial explored how you may query data from your Amazon S3 bucket using Amazon Athena and create data visualizations from an Amazon Athena table in Amazon QuickSight that references the sales database stored in the AWS Data Catalog.
In this new tutorial, you will learn how to:
a) Locate open source datasets for your passion project, build a test or production workload that you can use with AWS services.
b) Complete data transformation with your own data using both Amazon Glue and Amazon Glue Studio.
- Government websites e.g. data.gov, data.gov.uk, data.gov.au.
- Census data e.g. Australian census
Step 1: In this tutorial, download the DOHMH New York City Restaurant Inspections data as a csv file from this link
A preview of the raw dataset:
Step 2: Log into your AWS account. If you do not have an AWS account you may create one here
Step 3: You may follow the instructions here to create your first S3 bucket and give it a name new-york-restaurant-inspections.
Step 4: Click the bucket and select Create Folder.
Create two folders within the Amazon S3 bucket and name them raw and processed respectively as shown below:
Step 5: Upload and click add file to the folder raw
Step 6: Set up basic IAM permissions to use AWS Glue console. You may follow the blog for more details.
If you need to access more Amazon resources such as Amazon Sagemaker Notebooks, you may follow the instructions here.
Step 7: Navigate to the Amazon Glue console and on the left-hand menu, click crawler followed by Add crawler.
Step 8: Provide a name for the crawler. For this example, the we will name the crawler new york-restaurant-inspection-crawler and click Next.
Step 10: In this section, add the S3 bucket path and selecting the csv file from the raw folder and click Next.
Step 11: Click Yes to add another data store and Next.
Step 12: Repeat Step 10 but select the Processed* folder for the S3 bucket path. Click Next.
Step 13: There are no more data stores to be created, click No. Click Next.
Step 14: You may select an existing IAM role for AWS Glue that was created earlier for example 'AWSGlueServiceRole-learn-glue-role'. Click Next.
Step 15: Create a schedule for the crawler, select from the drop-down menu Run on demand.
Step 16: Configure the crawler output. Click Add database and provide a name for the database 'restaurant-inspections'. Add a prefix for the table name called Raw. Click Next.
The crawler details are show below:
Step 18: Click the crawler'new-york-restaurant-inspection-crawler' and select Run crawler to execute the job.
Step 19: After the crawler job is completed with a 'ready' status , you can inspect the meta data by clicking Tables on the left-hand side.
You may inspect the following schema from the data catalog:
Step 1: Access 'AWS Glue Studio' from the AWS Glue console.
Step 2: From the AWS Glue Studio console, click View jobs.
Step 3: Select the option Visual with a source and target and click Create.
Step 4: Under Node Properties select S3 bucket followed by Amazon S3
Step 5: Under the S3 bucket node, select from the option Data catalog the database name followed by the table name as shown in the image below.
Choose Apply Mapping node.
Step 6: Under the'Apply Mapping node, click on the tab Transform.
Are there any variables that have been incorrectly categorized in the data catalog?
Are there any variables that you need to change the data type?
Are there any columns that you need to drop from your dataset that won't be used in your data analysis or for machine learning?
You may complete some tranformatons such as changing date variables from the data type 'string' to 'date'.
You may view a listing of common data types in AWS Glue in the documentation
Step 7: Click on the tab Output Schema, check the data transformations.
Step 8: Click on S3 bucket- data target node, followed by Parquet format, Compression Type as Snappy and lastly select the S3 bucket path for the processed data folder to output the transformed data.
Select Create a data catalog on subsequent runs and name this table processed.
Step 9: Click on the Script tab to view the Python generated code for this ETL job.
Step 10: Click on the tab 'Job Details', provide a name for the job e.g.restaurant inspection job and a description for the data transformation. Change the number of workers to 2.
Step 11: Click Save and Run to commence the ETL job.
Step 12: The ETL job will take a few minutes to run and will change to a status of 'Succeeded'.
Step 13: Navigate back to the AWS Glue console and select Crawlers on the left hand menu and re-run the'restaurant-inspection-crawler' by clicking Run crawler.
After the status changes to 'ready' click on Table to inspect a processed table in the Data catalog for transformed data types.
Register at this link
AWS Glue Flex jobs is perfect for use cases including:
- Scheduling ETL jobs to run overnight
- Scheduling test workloads
- One-time bulk upload
- Time-sensitive workloads
You can read about the new feature and flex pricing from the AWS Big Data Blog here.
Until the next lesson, happy learning ! 😁