DEV Community

Cover image for Build your own data quality rules with AWS Glue DataBrew

Build your own data quality rules with AWS Glue DataBrew

In previous post, I have showed How AWS Glue DataBrew can help you to Handling PII data (The post is Thai language) For an English reader, the screenshot speaks for itself, you can easily follow.

There was a recent announcement AWS Glue DataBrew users can now develop data quality rules, which are customized validation tests that set business needs for specific data, according to the company. As a result, Any data person who does not want to invest in a high-cost DQ licensing product or does not want to use an open-source framework that requires coding knowledge can define their own quality rules and populate them in a data quality dashboard and validation report, allowing customers to quickly view rule outcomes and determine whether their data is fit for use. As a Builder, with New two announcement AWS GlueDataBrew with DQ, and PII Handing will help us a lot in term of design DQ process at scale, as it's no compute server to maintenance. In this post I will walk you though How to do validate your data quality with AWS Glue DataBrew.

What is AWS Glue Databrew?
What do you mean by Data Quality rule?

Table Of Contents

Architecture Diagram

Image description

Pre-requisites

  • AWS Account❗
  • Download Data here
  • Unzip, and upload only patient.csv upload to Amazon S3.

Preparing dataset

Go to AWS Glue DataBrew console here

Image description

On the left you click Datasets to create Dataset for Glue DataBrew

Image description

click Create new dataset
Image description

Choose patient.csv that you just uploaded, you also have an option to get the data from Glue data catalog, Amazon Redshift, Appflow, or Snowflake as well. Click Create dataset
Image description

Run data profile

At Datasets, choose Patient dataset, and Click "Run data profile"
Image description

You will wait few minutes for data profile to populate Data profile overview, Column statistics, Data quality rules suggestion, and Data lineage, once it's done click "View data profile"
Image description

You expect to see all the data statistics that useful to understand your dataset as follow

Dataset preview
Image description

Data profile overview
Image description

Potential PII detection
Image description

Attribute Correlations
Image description

Individual Column level statistic
Image description
Image description

All Columns level statistic summary such as min, max, distribution, column type, unique value, and etc
Image description

You can also drill down to columns that identified as PII as well in All Column level statistic
Image description

You can also see Data lineage as well!
Image description

Enough for Data exploration. The purpose of this post related to Data quality, let's take a look on the Data quality tab, you will see nothing, as you don't have a rule yet! but after you run Data profile job, it suggest you based on standard DQ such as uniqueness, completeness, and etc on the right hand side
Image description

Create DQ ruleset

You can pick what that reasonable to your business criteria, Again there's zero code require! I choose to check uniqueness in my Id, SSN, and length for SSN, and etc
Image description

Again this is recommendation from AWS DataBrew. you can create your own DQ rule later, after I have done, I just click "Create ruleset"

I can add, remove, adjust, or review DQ rule I have just added
Image description

I just click "Create ruleset", I can see DQ rulesets here
Image description

click DQ rulesets name, in my case is "DQ Rulesets for Patient data" I should be able to review all my DQ rules.
Image description

Associate DQ rulesets, with Data profile job, Click here to see all your Data profile job

For my case it's "Patients - Data Profile job, choose, and click Actions, and Edit
Image description

Search for "Data quality rules", and click "Apply data quality ruleset"
Image description

Choose your DQ Rulesets, and click Apply selected rulesets, and click Save
Image description
Image description

Re-run data profile with DQ ruleset

You choose your Profile jobs, and Click "Run job"
Image description

View DQ dashboard

Wait a few minutes, you should be able to see Result from Data profile job, with DQ result, based on your DQ rules!
Image description
Image description

My dataset has 4 pass, with 6 failed, based on My DQ rules, obviously I have to fix my data, before downstream process will consume!
Image description

Conclusion

AWS Glue DataBrew is a visual data preparation tool that makes it easy to clean and normalize data using over 250 pre-built transformations, all without the need to write any code, with New feature for DQ, and PII handing it will help you to add your own data quality rules, and PII tokenization in your automated data pipelines (AWS managed airflow, or AWS step function) to make sure your data are clean, and protected.

You can think about How add to PII obfuscate, and DQ zone like this
Image description

good_stuff

Discussion (0)