DEV Community

Chirag (Srce Cde) for AWS Community Builders

Posted on • Edited on • Originally published at srcecde.me

AWS Glue | CSV to Parquet transformation | Getting started

AWS Glue is a fully managed serverless ETL service. It makes it easy to discover, transform and load data that would be consumed by various processes and applications. If you want to learn more about AWS Glue then please refer to the video on AWS Glue Overview

Objective (CSV to Parquet)

In this article, we will go through the basic end-to-end CSV to Parquet transformation using AWS Glue. We will use multiple services to implement the solution like IAM, S3 and AWS Glue. As a part of AWS Glue, we will use crawlers, Data Catalog including Database & Tables and ETL jobs.

Architecture

Workflow

Let’s understand the above flow.

  1. Create a crawler, which will connect to the S3 data store
  2. Post successful connection, it will infer or determine the structure of the CSV file using a built-in classifier
  3. The crawler will write the metadata in the form of a table in the AWS Glue Data Catalog
  4. After populating the data catalog, create the ETL job to transform CSV into parquet
  5. The data source for the ETL job will be the AWS Glue Data Catalog table and as a part of the transformation, we will apply the mappings. Post transformation, the data will be stored in S3 as a parquet file

Hands-On

Enough of text, let’s jump to the AWS Management Console

Create IAM role

As a first step, create the IAM role to provide the necessary permissions to AWS Glue to access various services. (For ex: S3)

  • Go to IAM Management Console → Roles → Create role
  • Select Glue as a service & add AWSGlueServiceRole permission
  • Create the role

With the above permission, AWS Glue will be able to access the S3 buckets which contain aws-glue as a part of the bucket name for the defined permissions in this policy. Also, it contains additional permissions.

Create S3 Bucket

Create the S3 bucket with the name which contains aws-glue as a part of the bucket name Or else you have to modify the policy in the above step.

Post bucket creation, create the following folder structure



s3-bucket
├── data-store/
│   └── annual-reports/
│       └── csv_reports
└── target-data-store/
    └── parquet-reports/


Enter fullscreen mode Exit fullscreen mode

We will create the database in Glue with the same name as annual_reports, and the table name would be csv_reports. However, it is not necessary that the database & table should have the same name as the folders. It’s just that, it is easy to map things with this kind of naming convention.

As a next step, upload the CSV file in the csv_reports folder. I have used the CSV data from here

Create database

Go to AWS Glue Console → Click Databases → Add database as follows

Create database

Create Crawler to populate Table

The table can be created manually or via a crawler. The crawler is a program that connects to the data store, infers structure/schema using a built-in or custom classifier and writes the Metadata as a table.

To create a crawler → Click on Crawlers from the left panel → Configure the details as follows and create the crawler

Crawler config

After creating the crawler, run the crawler and if everything is configured as above, it will populate the table under the selected database.

Create ETL job

Once the table is populated, create the ETL job to transform & load the CSV file as a parquet file.

To create an ETL job → Click on Jobs from the left panel

To create a job, there are a couple of options and we will select Visual with a blank canvas. Apart from that it also allows you to write & upload the python shell scripts, spark scripts and interactive development with jupyter notebook. So, you can try these options.

Here we have two side-by-side views, Visual on the left side and its respective options on the right side.

Click on Source under Visual and select the data source. In our case, it’s AWS Glue Data Catalog. Select the node and configure the respective properties on the right side which are Database & Table.

Config Glue Job

The next step is the transformation, click on Transform → Select Apply Mapping or any other transformation based on the requirements. For the purpose of this article, Apply Mapping is selected.

Config Glue Job1

Finally, click on Target → select S3 to load the data as a target data store. Configure the properties.

Config Glue Job2

After this configuration, you can check the generated script under the Script tab. The script can be modified as per the requirement. As a next step, click on Job details and update the job name and select the IAM role. Save the job and click on Run.

Job details

Check the job Run details under Runs and check the CloudWatch logs.

Job run

At this point, the CSV to Parquet transformation is successful. Check the output in the S3 Target Location configured in the above step.

For a detailed step-by-step tutorial and the implementation of the above use case please refer to the below video.

If you have any questions, comments, or feedback then please leave them below. Subscribe to my channel for more.

Top comments (0)