Creating and Integrating Data Pipeline Using Amazon S3 and Snowflake Data Warehouse. (Using SnowSQL).

Data pipeline is widely used in the field of Data Engineering and Analytics to fetch data from external sources for instance AWS Redshift, S3 (Simple Storage Service), GCP (Google Cloud Platform), Oracle, Azure and much more industry –used technologies.

Data Pipeline – according to Snowflake is concerned with moving data from one place to a destination (such as a data warehouse or large storage service) while simultaneously optimizing and transforming the data. As a result, the data arrives in a state that can be analyzed and used to develop business insights.

This article aims to cover how integrate AWS S3 with Snowflake which is convenient when dealing and working with large data sources.

Requirements:

Knowledge of SQL for working in Snowflake.
Snowflake account and AWS
Data file in preferably CSV (Comma Separated Value), it can be large up to 160 GB of size. Beyond that AWS requires the Cloud developer to use other tools – this is beyond this article’s scope.

Step 1
Setup an AWS account as a root user.

Step 2
After a successful login into your account go to services as shown. For new users locate storage by scrolling downward then select S3.

Step 3
Upon selecting S3 we expect the display as follows – design might change later.

Click Create bucket then give it a name – in our case mybootcampbucket:

Click Create bucket to finish the process.

To upload the file select the newly created bucket (highlighted).

Upload the file now as depicted below.

Step 4
After uploading the file/dataset. Policy need to be set up. The policy facilitates permission for sharing to external integration when associated with identity and compliance to respective AWS account.
To setup policy click services then head to Security, Identity and Compliance. See below:

Select IAM (Identity Access Management) then open Policies.

After clicking ‘Create policy’, allocate the name to the policy to be created. Our case is Bootcamp2023.

Under policies, click JSON Policy, this is because Policies are written in JavaScript Object Notation (JSON) which is a data representation Key – Value pair syntax.