Data pipeline is widely used in the field of Data Engineering and Analytics to fetch data from external sources for instance AWS Redshift, S3 (Simple Storage Service), GCP (Google Cloud Platform), Oracle, Azure and much more industry –used technologies.
Data Pipeline – according to Snowflake is concerned with moving data from one place to a destination (such as a data warehouse or large storage service) while simultaneously optimizing and transforming the data. As a result, the data arrives in a state that can be analyzed and used to develop business insights.
This article aims to cover how integrate AWS S3 with Snowflake which is convenient when dealing and working with large data sources.
Requirements:
- Knowledge of SQL for working in Snowflake.
- Snowflake account and AWS
- Data file in preferably CSV (Comma Separated Value), it can be large up to 160 GB of size. Beyond that AWS requires the Cloud developer to use other tools – this is beyond this article’s scope.
Step 1
Setup an AWS account as a root user.
Step 2
After a successful login into your account go to services as shown. For new users locate storage by scrolling downward then select S3.
Step 3
Upon selecting S3 we expect the display as follows – design might change later.
Click Create bucket then give it a name – in our case mybootcampbucket:
Click Create bucket to finish the process.
To upload the file select the newly created bucket (highlighted).
Upload the file now as depicted below.
Step 4
After uploading the file/dataset. Policy need to be set up. The policy facilitates permission for sharing to external integration when associated with identity and compliance to respective AWS account.
To setup policy click services then head to Security, Identity and Compliance. See below:
Select IAM (Identity Access Management) then open Policies.
After clicking ‘Create policy’, allocate the name to the policy to be created. Our case is Bootcamp2023.
Under policies, click JSON Policy, this is because Policies are written in JavaScript Object Notation (JSON) which is a data representation Key – Value pair syntax.
After setting up the policy, next we create the Role.
Give your role a name, in this scenario – our role is named Bootcamp_2023.
Setup the permission.
I have highlighted to note that we deal with This account in particular.
Next select the policy to setup a role for.
Finish up with the role creation.
Now copy the ARN (Amazon Resource Names) that will help in identifying resources uniquely.
Step 5 – Setting Up/Creating a Snowflake Account.
To create a Snowflake account head to Snowflake
After creating an account, we need to create our Warehouse – as shown below.
Displaying the pipeline dataset loaded from the AWS S3.
That is all for this article.
Top comments (0)