DEV Community

Cover image for Amazon Redshift ingest data with auto-copy from Amazon S3
Wendy Wong for AWS Heroes

Posted on • Updated on

Amazon Redshift ingest data with auto-copy from Amazon S3

Introduction

You may be a data scientist, business analyst or data analyst familiar with loading data from Amazon S3 into Amazon Redshift using the COPY command, at AWS re:invent 2022 to help AWS customers move towards a zero-ETL future without the need for a data engineer to build an ETL pipeline, data movements can be simplified with auto-copy from Amazon S3 into Amazon Redshift.

drs

Lesson Outcome

  • In this lesson you will learn how to ingest data with auto-copy from Amazon S3 into Amazon Redshift.

What is Amazon Redshift Auto-Copy from Amazon S3 ?

Amazon Redshift auto-copy from Amazon S3 simplifies data ingestion from Amazon S3.

How does this feature work?

Auto-copy from Amazon S3 is a simple, low code data ingestion that automatically loads new files that are detected in your S3 bucket into Amazon Redshift.

New files can be ingested in the formats csv, json, parquet or avro.

You may store a COPY command in a COPY job in Amazon Redshift which will detect new files stored in Amazon S3 and load the data into your table.

What are the benefits?

  • You can save time for your team by avoiding manual uploads of new data from an S3 bucket into Amazon Redshift with COPY statements.

  • COPY jobs will be able to detect data that was previously loaded in the data ingestion process.

  • COPY jobs can prevent duplicated data when an automated job is not required.

  • COPY jobs can be manually created to reuse copy statements

Solution Architecture

In this high-level architecture, new files are ingested into a single Amazon S3 bucket into separate folders and a copy job inserts the data into the corresponding Table A and Table B.

new archit

Prerequisites

  • Download an interesting open source dataset and save it in your directory. This tutorial includes data from Kaggle at this link

  • Login into your AWS account as an IAM Admin user. You may view the blog to create an account.

  • Create an Amazon S3 bucket. You may view this blog to create your first S3 bucket.

  • Use AWS Glue to create a crawler to inspect the meta data in the data catalogue table. You may follow this blog to learn how to create a crawler.

Dataset

The World Happiness Report dataset is available to download from Kaggle here. There are five datasets ranging from 2015 to 2019

Data Description

There are 12 columns in each csv file:

  • Country
  • Region
  • Happiness Rank
  • Happiness Score
  • Standard Error
  • Economy - GDP per capita
  • Family
  • Health - Life Expectancy
  • Freedom
  • Trust - Government Corruption
  • Generosity
  • Dystopia

Using AWS Glue to create a crawler to inspect the schema in one file from the data catalogue:

glue

Tutorial 1: Getting started with auto-copy files from Amazon S3(Preview) into Amazon Redshift

For this tutorial we will follow the steps outlined in the Amazon Redshift Database Developer Guide

Step 1: Sign into the AWS Management Console with your AWS IAM Admin User account details.

Step 2: Navigate to Amazon Redshift by typing 'Redshift' in the search bar.

redshift console

Step 3: Create a cluster in Preview by selecting Provision clusters dashboard

In this preview, I will use the AWS region US-East (Ohio) (us-east-2).

provisioned clsuters

Step 4: Click Create preview cluster in the banner.

create prewview

Step 5: Create a unique name to identify your cluster in lowercase letters.

Next select from the drop-down menu a Preview track e.g. Preview 2022 to allow you to test the new features.

unique name

Step 6: Inspect the size of your data or estimate the size.

datasets

Step 7: Enter other configurations for your cluster. I selected Help me choose. I have selected additional options to indicate that my estimate is for compressed data in Amazon Redshift and my data is not time-based.

hep me choosen

The calculated configuration summary provided recommendations for 2 nodes and compute size using dc2.large.

calc

Step 8: Under Database Configurations, create an admin user password that includes at least one Uppercase letter and a number.

db

Step 9: Under Cluster Permissions, I created a new IAM role that has AmazonRedshiftAllCommandsAccess policy attached to allow you to perform the following SQL queries such as COPY, UNLOAD and SELECT. The new IAM role can access any S3 bucket and it replaces the default IAM role.

Type 'IAM' in the navigation search bar and click Roles create a new role.

Add additional permissions to access Query Editor 2, Amazon S3 and Amazon Redshift Full Access.

Provide a name of the new service role e.g.'redshift-autocopys3-custom-role'

create custom role

Add permissions and then click Create role.

Confirm that the new service role has been created in IAM.

Step 10: Click Create cluster in preview, it will take a few minutes to provision. (Note: do not use the preview cluster for your production workloads).

create

Image

The preview cluster has been created successfully.

cluster preview

Step 11: On the navigation pane on the left, click Query editor v2

query editor

Step 12: Double click on the preview cluster (i.e.database) and open the arrow to inspect the tables.

You can inspect the new preview cluster that has been created.

auto copy table

Step 13: Create an Amazon S3 bucket with a globally unique name e.g. world-hapiness-report using default settings and click Create bucket.

world happiness

bucket

Step 14: Create a single folder in the Amazon S3 bucket for automated file ingestion e.g. 's3-autocopy-file'

create folder

Create a subfolder named e.g.'datasets'

Upload five csv files.

all datasets

Tutorial 2: Load data Amazon Redshift and query data.

Step 1: Create a table in Amazon Redshift.

The following SQL code creates the world_happiness table:



Create table world_happiness (
country VARCHAR(100),
region VARCHAR(100),
happiness_rank int,
happiness_score float4,
standard_error float4,
economy float4,
family float4,
health float4,
freedomm float4,
trust float4,
generosity float4,
dystopia float4)
;


Enter fullscreen mode Exit fullscreen mode

Step 18: Create a COPY job to automatically load csv files from Amazon S3 path into the world happiness table in Amazon Redshift.



COPY world_happiness
FROM 's3://world-happiness-report/s3-autocopy-file/datasets/published_reports'
IAM_ROLE 'arn:aws:iam::***********:role/redshift-autocopys3-custom-role'
DELIMITER ','
IGNOREBLANKLINES
REGION 'us-east-1'
IGNOREHEADER 1
JOB CREATE job_world_happiness1 AUTO ON;

Enter fullscreen mode Exit fullscreen mode




Final Notes

  • Autocopy files from Amazon S3 to Amazon Redshift should be used for test workloads only during public preview.

  • After you have finished testing, please delete your provisioned Amazon Redshift cluster to avoid billing charges.

  • IAM role does not support default during public preview. You may read other best practice guidelines in this blog.

AWS re:Invent 2022

If you missed the keynote from Dr Swami Sivasubramanian, VP of Data and Machine Learning at Amazon Web Services you may watch on demand and also keep informed of the other keynotes here.

If you would like to watch the announcement of the auto-copy files from Amazon S3 (Preview) into Amazon Redshift you may fast forward to the timing at 1 hour and 18 minutes.

Reference

What's new next week? 🌎

AWS Innovate

Register for AWS Pi Day at this link

pie day

Until the next lesson, happy learning! 🙂

Top comments (2)

Collapse
 
benbpyle profile image
Benjamen Pyle

Nice article Wendy!

Collapse
 
abc_wendsss profile image
Wendy Wong

Thank you Benjamen for your kind words!