If you are looking to deploy enterprise-level analytics workloads in the cloud on AWS, one of the most important choices you need to make is your ETL development tool, integration, and data pipeline platform. On top of this, you need to decide on the most critical aspect of any ETL project: data preparation.
Why do you need to focus on data preparation?
Data volumes in organizations are increasing at an unprecedented rate, exploding from terabytes to petabytes and in some cases exabytes! Learning about the vast sources of data that you need to work with for your ETL projects can be overwhelming. Data preparation is a critical complex stage itself. It is the building foundation of your data lakes and analytics.
Most data preparation projects usually involve pooling large data inputs and talking to different business owners and analysts to understand this vast amount of data. Data preparation process should be correct and accurate right from the start. It is what you will feed your ETL processes or tasks. Garbage in. Garbage out. Similar to your daily health regimen, you don’t want to feed your ETL projects, garbage right?
With the AWS Glue Studio, data preparation for ETL jobs can be done without much code scripting.
What is AWS Glue?
AWS Glue is a fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams. AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.
AWS Glue is totally serverless, so there’s no infrastructure to set up or manage. ETL developers and data scientists need not be spending so much time on developing scripts.
Who is the target audience for using AWS Glue?
Fundamentally, it's you and me. If you are the cloud engineer or architect that is involved in ETL projects that needs to leverage the AWS platform, then AWS Glue is for you. AWS Glue is also for those
- Users who are experienced with traditional ETL tools with GUI interfaces like Talend or Informatica.
- Engineers who define the data cleansing and data normalization jobs who are familiar with SQL but not with Python or Scala. I fall in this group.
- Engineers who do not have experience with Apache Spark, I also fall in this group.
How do you start building ETL projects with AWS Glue?
Simple. Use AWS Glue Studio.
AWS Glue Studio is a fun GUI experience that helps you put all together the components of your ETL project. These components are your source data location, your transformation tasks, and your target data location. The transformation tasks are changes that you need to apply to your source data before they are transferred to the target location.
In AWS Glue Studio, these ETL components and transformation tasks are all just a click away and there’s no complex scripting involved.
Most Common Transformation tasks:
ApplyMapping - when you need to change the column names, data types or drop columns.
Join - when you need to join datasets (SQL for inner, outer, right outer, left inner)
Filter - when you need to filter data from datasets, like you need only the first 100 rows that meet a specific criteria
SQL Transform - when you need to write SparkSQL to transform data.
Let me quickly share a short demo of creating an ETL job on AWS Glue Studio using a prepared data set from Toronto Parking Ticket . Here’s the cloudformation stack to help get the parameters and quickly get started on using AWS Glue. Make sure that you are already logged in to your AWS console account before launching the stack creation.
If you do not have an AWS account yet, you can start to create an account here .
Once the creation of the stack is complete. Launch the AWS Glue console and the left side menu, click on Database and then Tables. You should have the database and tables prepared the same as mine. Refer to below screenshots of 1 database and 3 tables created.
As I have mentioned above, AWS Glue supports a plethora of source and target data platforms. For the purpose of this article, we will be using a table as our source and S3 as our target.
Now that you have completed the cloudformation stack, let's open our AWS Glue Studio.
Click the ‘Create’ button on top right and select a blank editor. Now let’s add our first data source which will be an S3 bucket created from the cloudformation stack earlier.
Click on ‘Source’ and select S3 from the list of source options.
On the ‘Data Source Properties’ tab, select the database ‘yyz-tickets’ and select ‘trials’ for the table dropdown list. Create another S3 bucket for source and do the same thing. Select ‘tickets’ from the table dropdown list this time. At this point, you should have 2 S3 buckets as your source. Feel free to name each of S3 bucket sources as tickets and trials respectively.
Let’s do some transformation for ‘tickets’ source. From the Transform in the top menu, select ‘ApplyMapping’. On the ‘Apply mapping’ the columns and data types for the ‘tickets’ are displayed. For the column name ticket_number, change the data type to ‘int’ and drop the columns location1, location2, location3, location4, and province by clicking on the checkbox beside it under the Drop.
For the ‘trials’ source, let’s apply some mapping changes too. Click on the ‘Transform’ tab and select ‘ApplyMapping’. On the right window where the columns and data types are displayed, change the data type for parking_ticket_number column to ‘int’.
Let’s create a Join transformation next. Click on the ‘Transform’ tab and select ‘Join’ from the options. On the right pane, on the ‘Node Properties’, select both ApplyMapping transformations for tickets and trials we created in the previous step. This should create 2 arrows from each of the ApplyMapping transformations going to the Join transformation. See screenshot below.
Let's try an Aggregate transformation too. To create an aggregation, let's try to count all the infraction_code grouped by court_location. In effect, we want to apply the SQL statement below:
SELECT court_location as location, infraction_description as infraction, count(infraction_code) as total
GROUP BY infraction_description, infraction_code, court_location
ORDER BY court_location ASC
At this point, your visual editor should be similar to below screenshot:
For the target destination, let’s start to create a target by clicking on the ‘Target’ from the top menu and select ‘S3’. Enter the S3 bucket details or simply click ‘Browse’ to open a popup window displaying all your S3 buckets and select. Select ‘parquet’ for format and GZIP for compression type.
Your data properties for S3 should be something like below screenshot:
From the top tab, select ‘Job Details’. This will open the job details screen. For the IAM role, select the default AWSGlueStudioRole. For the job bookmark option, select ‘Disable’ and for the optional retries, key in 1. Leave all other entries as is. Refer to my screenshot below.
Click the ‘Save’ button to save the entire ETL job. Click ‘Run’ to start the ETL job.
While the ETL job is running, you can monitor the task details on the ‘Runs’ tab or navigate to the ‘Monitoring’ menu on the left pane to see more details about your job run.
For job runs that you need to schedule, click on the ‘Schedule’ tab and create a new schedule.
To verify that the job run is successful, go back to the tab ‘Runs’ and check the status is ‘Succeeded’. You can navigate to your target S3 bucket destination you entered in the ‘Target’ tab and verify the zip files are there.
For this post, I was able to share information about AWS ETL tool that is AWS Glue, briefly discussed data preparation and how it is critical to your ETL projects. We also went through a simple demo of AWS Glue Studio and how you easily navigate in the editor and create ETL jobs using the GUI with less scripting involved.
I hope you are able to progress to more complex use cases and eventually launch to production using AWS Glue and AWS Glue Studio.