Chatchai Komrangded (Bas) for AWS Community ASEAN

Posted on Nov 21, 2021

Build your own data quality rules with AWS Glue DataBrew

#aws #awsthai #bigdata #datascience

In previous post, I have showed How AWS Glue DataBrew can help you to Handling PII data (The post is Thai language) For an English reader, the screenshot speaks for itself, you can easily follow.

There was a recent announcement AWS Glue DataBrew users can now develop data quality rules, which are customized validation tests that set business needs for specific data, according to the company. As a result, Any data person who does not want to invest in a high-cost DQ licensing product or does not want to use an open-source framework that requires coding knowledge can define their own quality rules and populate them in a data quality dashboard and validation report, allowing customers to quickly view rule outcomes and determine whether their data is fit for use. As a Builder, with New two announcement AWS GlueDataBrew with DQ, and PII Handing will help us a lot in term of design DQ process at scale, as it's no compute server to maintenance. In this post I will walk you though How to do validate your data quality with AWS Glue DataBrew.

What is AWS Glue Databrew?
What do you mean by Data Quality rule?

Architecture Diagram
Pre-requisites
Preparing dataset
Run data profile without DQ ruleset
Create DQ ruleset
Re-run data profile with DQ ruleset
View DQ dashboard
Conclusion

Architecture Diagram

Pre-requisites

AWS Account❗
Download Data here
Unzip, and upload only patient.csv upload to Amazon S3.

Preparing dataset

Go to AWS Glue DataBrew console here

On the left you click Datasets to create Dataset for Glue DataBrew

click Create new dataset

Choose patient.csv that you just uploaded, you also have an option to get the data from Glue data catalog, Amazon Redshift, Appflow, or Snowflake as well. Click Create dataset

Run data profile

At Datasets, choose Patient dataset, and Click "Run data profile"

You will wait few minutes for data profile to populate Data profile overview, Column statistics, Data quality rules suggestion, and Data lineage, once it's done click "View data profile"

You expect to see all the data statistics that useful to understand your dataset as follow

Dataset preview

Data profile overview

Potential PII detection

Attribute Correlations

Individual Column level statistic

All Columns level statistic summary such as min, max, distribution, column type, unique value, and etc

You can also drill down to columns that identified as PII as well in All Column level statistic

You can also see Data lineage as well!

Enough for Data exploration. The purpose of this post related to Data quality, let's take a look on the Data quality tab, you will see nothing, as you don't have a rule yet! but after you run Data profile job, it suggest you based on standard DQ such as uniqueness, completeness, and etc on the right hand side

Create DQ ruleset

You can pick what that reasonable to your business criteria, Again there's zero code require! I choose to check uniqueness in my Id, SSN, and length for SSN, and etc

Again this is recommendation from AWS DataBrew. you can create your own DQ rule later, after I have done, I just click "Create ruleset"

I can add, remove, adjust, or review DQ rule I have just added

I just click "Create ruleset", I can see DQ rulesets here

click DQ rulesets name, in my case is "DQ Rulesets for Patient data" I should be able to review all my DQ rules.

Associate DQ rulesets, with Data profile job, Click here to see all your Data profile job

For my case it's "Patients - Data Profile job, choose, and click Actions, and Edit

Search for "Data quality rules", and click "Apply data quality ruleset"

Choose your DQ Rulesets, and click Apply selected rulesets, and click Save

Re-run data profile with DQ ruleset

You choose your Profile jobs, and Click "Run job"

View DQ dashboard

Wait a few minutes, you should be able to see Result from Data profile job, with DQ result, based on your DQ rules!

My dataset has 4 pass, with 6 failed, based on My DQ rules, obviously I have to fix my data, before downstream process will consume!

Conclusion

AWS Glue DataBrew is a visual data preparation tool that makes it easy to clean and normalize data using over 250 pre-built transformations, all without the need to write any code, with New feature for DQ, and PII handing it will help you to add your own data quality rules, and PII tokenization in your automated data pipelines (AWS managed airflow, or AWS step function) to make sure your data are clean, and protected.

You can think about How add to PII obfuscate, and DQ zone like this