DEV Community

Cover image for Brew your data and perform privacy treatment using AWS Glue Databrew
Yogesh Sharma
Yogesh Sharma

Posted on • Edited on

Brew your data and perform privacy treatment using AWS Glue Databrew

I recently got a chance to play with Glue DataBrew so want to share some tips to setup it.

How Glue DataBrew is different to Glue:
Both perform ETL tasks. But we don't need to write any code to perform ETL in Glue DataBrew. Glue can use any kind of transformation but the DataBrew works with the build-in transformation. DataBrew has a profiler and cool UI/Console (for quick analysis) but glue doesn’t have something like this.

Prerequisites

  1. IAM Role- Create a DataBrew IAM role/policy with the right set of permissions- follow Databrew IAM Policy databrew-iam-policy
  2. Upload data to S3 bucket data-s3-bucket

Steps

  1. Create Dataset:
    a) On the Datasets page of the DataBrew console, choose Connect new dataset.
    b) For Dataset name, enter a name (Customer).
    c) Enter the S3 bucket path where you uploaded the data files as part of the prerequisite steps.
    d) For File type¸ select CSV and choose Comma (,) for CSV delimiter.
    e) For Column header values, select Treat first row as header.
    f) Choose Create dataset.
    create-dataset

  2. Create Project:
    a) On the DataBrew console, on the Projects page, choose Create project.
    b) For Project Name, enter customer-privacy-treatment.
    c) For Attached recipe, choose Create new recipe.
    The recipe name is populated automatically (customer-privacy-treatment-recipe).
    d) For Select a dataset, select My datasets.
    e) Select the order dataset.
    f) For Role name, choose the AWS Identity and Access Management (IAM) role to be used with DataBrew.
    g) Choose Create project.
    create-databrew-project
    Databrew-project

  3. Basic transformations:
    Add steps to Recipe:
    a) Rename: Add a source column and provide a new name
    b) Filter: Filter the rows based on an 'age' value greater than 30 and add the condition as a recipe step.
    c) Sort: choose the column age. Select Sort descending way
    d) Choose Apply
    Recipe

  4. Check Lineage:
    a) Check the lineage: Project -> Lineage
    open-lineage
    lineage-console
    b) Verify the recipe:
    recipe

  5. Create Job:
    a) On the project details page, choose Create job -> enter job name "customer-privacy-treatment-job"
    b) In Job output settings-> Change the job type to 'parquet' with compressions as 'snappy'
    c) For Output to, choose Amazon S3.
    Enter the S3 path to store the output file.
    d) For Role name, choose the IAM role to be used with DataBrew.
    e) Choose Create and run job.
    f) Check output S3 bucket to confirm that a single output file is stored there.
    create-job
    S3-output

  6. Advance transformations to perform the privacy treatment:
    Add steps to Recipe:
    a) Replace/Hide the 'email' column values with a dummy variable. Option 'Clean' -> 'Replace' -> 'Replace value or pattern'.
    From the value to be replaced, select the Regex option and use the following regex as ^[a-zA-Z0-9+_.-]+@ Replace with value 'testing@'
    replace-mail
    b) Show only last 4 digits of credit card number:
    Select the column -> select the Function option -> Text -> Right.
    Source column - credit_card_num
    Number of chars - 4
    Destination column name - treated_credit_card_num
    treated-credit-card-num
    c) Delete the original column- 'credit_card_num'
    d) Create Data Profile job and enable PII Statistice (by default it is default)
    enable pii
    e) Verify the data profiler job results:
    data-profiler
    f) Add more privacy treatment recipes:
    Apply data masking transformations -> Redact and use '#' as Redact symbol
    redact
    g) Apply data masking transformations -> Hash on a source data column.
    h) Apply Deterministic encryption on 'lname' column (for this, you need one secretmnagers secret beforehand.
    Deterministic encryption

After adding all the steps in Recipe, execute the databrew job and verify the results-
Image description

Conclusion
Following best practices, it is always good to play/wrangle on privacy treated (non-pii) data and Glue Databrew is the perfect managed serverless solution to achieve this.

Top comments (1)

Collapse
 
pbisht4 profile image
pbisht4

Great article