Business users always format, transform and summarize data to derive insights. Even with data being ETL to Data Warehousing systems; users had to apply additional transformations in BI tools.
As modern enterprises are increasingly adopting Lake House architectures, there is an even greater need to simplify this process for data prep on large data sets that can scale.
AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning.
It allows you to:
- Clean and Transform
- Track Lineage
- Orchestrate and Automate
In the remaining sections, I will demonstrate these capabilities using a simple customer segmentation dataset - Link
First thing we do on a dataset is to profile - either with code or here with a few clicks in Data Brew.
This would take us to "Create job" page where we create a "Profile Job" and this will generate summary statistics on our data.
Once Profile job is complete, it provides a profile overview, detailed column statistics and lineage.
Next, we will transform the data by creating a project on the dataset. We can create Project from the Datasets view by specifying Project Name and Recipe details. For recipe we can either create a new one or use from existing list.
While creating recipes, transforms are done with sample data. For this Project, we will select 5000 random rows.
Default view of this page shows us sample rows, distribution and recipe steps.It also shows the available list of transformations that can be done on the dataset.
Transformations steps can be added and together makes a recipe.
DataBrew provides from over 250 built-in transformations to visualize, clean, and normalize your data with an interactive, point-and-click visual interface.
For this dataset, we will create a recipe with two transformations:
- Filter Customers with age greater than equal to 30
- Encode Gender to 'M' or 'F'
From the clean category > select Replace value or Pattern
Select source column and the value to be replaced
Note - We need two steps here, one for each value to be replaced.
We had earlier created a Profile Job. We will now create a recipe job that will transform the entire dataset.
Once job is created, we get the transformed dataset, job run and history details.
Lineage shows the various components and their relationships to the output.
Data Brew provides a powerful abstraction for data preparation. As explained, it fastens and simplifies the Data Prep process allowing Users to focus on the insights and decisions that drives their business.