Brew your data and perform privacy treatment using AWS Glue Databrew

#aws #glue #gluedatabrew #analytics

I recently got a chance to play with Glue DataBrew so want to share some tips to setup it.

How Glue DataBrew is different to Glue:
Both perform ETL tasks. But we don't need to write any code to perform ETL in Glue DataBrew. Glue can use any kind of transformation but the DataBrew works with the build-in transformation. DataBrew has a profiler and cool UI/Console (for quick analysis) but glue doesn’t have something like this.

Prerequisites

IAM Role- Create a DataBrew IAM role/policy with the right set of permissions- follow Databrew IAM Policy
Upload data to S3 bucket

Steps

Create Dataset:
a) On the Datasets page of the DataBrew console, choose Connect new dataset.
b) For Dataset name, enter a name (Customer).
c) Enter the S3 bucket path where you uploaded the data files as part of the prerequisite steps.
d) For File type¸ select CSV and choose Comma (,) for CSV delimiter.
e) For Column header values, select Treat first row as header.
f) Choose Create dataset.
Create Project:
a) On the DataBrew console, on the Projects page, choose Create project.
b) For Project Name, enter customer-privacy-treatment.
c) For Attached recipe, choose Create new recipe.
The recipe name is populated automatically (customer-privacy-treatment-recipe).
d) For Select a dataset, select My datasets.
e) Select the order dataset.
f) For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
g) Choose Create project.
Basic transformations:
Add steps to Recipe:
a) Rename: Add a source column and provide a new name
b) Filter: Filter the rows based on an 'age' value greater than 30 and add the condition as a recipe step.
c) Sort: choose the column age. Select Sort descending way
d) Choose Apply
Check Lineage:
a) Check the lineage: Project -> Lineage

b) Verify the recipe:
Create Job:
a) On the project details page, choose Create job -> enter job name "customer-privacy-treatment-job"
b) In Job output settings-> Change the job type to 'parquet' with compressions as 'snappy'
c) For Output to, choose Amazon S3.
Enter the S3 path to store the output file.
d) For Role name, choose the IAM role to be used with DataBrew.
e) Choose Create and run job.
f) Check output S3 bucket to confirm that a single output file is stored there.
Advance transformations to perform the privacy treatment:
Add steps to Recipe:
a) Replace/Hide the 'email' column values with a dummy variable. Option 'Clean' -> 'Replace' -> 'Replace value or pattern'.
From the value to be replaced, select the Regex option and use the following regex as ^[a-zA-Z0-9+_.-]+@ Replace with value 'testing@'

b) Show only last 4 digits of credit card number:
Select the column -> select the Function option -> Text -> Right.
Source column - credit_card_num
Number of chars - 4
Destination column name - treated_credit_card_num

c) Delete the original column- 'credit_card_num'
d) Create Data Profile job and enable PII Statistice (by default it is default)

e) Verify the data profiler job results:

f) Add more privacy treatment recipes:
Apply data masking transformations -> Redact and use '#' as Redact symbol

g) Apply data masking transformations -> Hash on a source data column.
h) Apply Deterministic encryption on 'lname' column (for this, you need one secretmnagers secret beforehand.

After adding all the steps in Recipe, execute the databrew job and verify the results-

Conclusion
Following best practices, it is always good to play/wrangle on privacy treated (non-pii) data and Glue Databrew is the perfect managed serverless solution to achieve this.