In previous post, I have showed How AWS Glue DataBrew can help you to Handling PII data (The post is Thai language) For an English reader, the screenshot speaks for itself, you can easily follow.
There was a recent announcement AWS Glue DataBrew users can now develop data quality rules, which are customized validation tests that set business needs for specific data, according to the company. As a result, Any data person who does not want to invest in a high-cost DQ licensing product or does not want to use an open-source framework that requires coding knowledge can define their own quality rules and populate them in a data quality dashboard and validation report, allowing customers to quickly view rule outcomes and determine whether their data is fit for use. As a Builder, with New two announcement AWS GlueDataBrew with DQ, and PII Handing will help us a lot in term of design DQ process at scale, as it's no compute server to maintenance. In this post I will walk you though How to do validate your data quality with AWS Glue DataBrew.
What is AWS Glue Databrew?
What do you mean by Data Quality rule?
Table Of Contents
- Architecture Diagram
- Preparing dataset
- Run data profile without DQ ruleset
- Create DQ ruleset
- Re-run data profile with DQ ruleset
- View DQ dashboard
- AWS Account❗
- Download Data here
- Unzip, and upload only patient.csv upload to Amazon S3.
Go to AWS Glue DataBrew console here
On the left you click Datasets to create Dataset for Glue DataBrew
click Create new dataset
Choose patient.csv that you just uploaded, you also have an option to get the data from Glue data catalog, Amazon Redshift, Appflow, or Snowflake as well. Click Create dataset
Run data profile
At Datasets, choose Patient dataset, and Click "Run data profile"
You will wait few minutes for data profile to populate Data profile overview, Column statistics, Data quality rules suggestion, and Data lineage, once it's done click "View data profile"
You expect to see all the data statistics that useful to understand your dataset as follow
Individual Column level statistic
All Columns level statistic summary such as min, max, distribution, column type, unique value, and etc
You can also drill down to columns that identified as PII as well in All Column level statistic
You can also see Data lineage as well!
Enough for Data exploration. The purpose of this post related to Data quality, let's take a look on the Data quality tab, you will see nothing, as you don't have a rule yet! but after you run Data profile job, it suggest you based on standard DQ such as uniqueness, completeness, and etc on the right hand side
Create DQ ruleset
You can pick what that reasonable to your business criteria, Again there's zero code require! I choose to check uniqueness in my Id, SSN, and length for SSN, and etc
Again this is recommendation from AWS DataBrew. you can create your own DQ rule later, after I have done, I just click "Create ruleset"
I can add, remove, adjust, or review DQ rule I have just added
I just click "Create ruleset", I can see DQ rulesets here
click DQ rulesets name, in my case is "DQ Rulesets for Patient data" I should be able to review all my DQ rules.
Associate DQ rulesets, with Data profile job, Click here to see all your Data profile job
For my case it's "Patients - Data Profile job, choose, and click Actions, and Edit
Search for "Data quality rules", and click "Apply data quality ruleset"
Choose your DQ Rulesets, and click Apply selected rulesets, and click Save
Re-run data profile with DQ ruleset
You choose your Profile jobs, and Click "Run job"
View DQ dashboard
Wait a few minutes, you should be able to see Result from Data profile job, with DQ result, based on your DQ rules!
My dataset has 4 pass, with 6 failed, based on My DQ rules, obviously I have to fix my data, before downstream process will consume!
AWS Glue DataBrew is a visual data preparation tool that makes it easy to clean and normalize data using over 250 pre-built transformations, all without the need to write any code, with New feature for DQ, and PII handing it will help you to add your own data quality rules, and PII tokenization in your automated data pipelines (AWS managed airflow, or AWS step function) to make sure your data are clean, and protected.
You can think about How add to PII obfuscate, and DQ zone like this
Top comments (0)