Dataform is a powerful tool for managing data transformations in your warehouse. With Dataform you can automatically manage dependencies, schedule queries and easily adopt engineering best practices with built in version control. Currently Dataform integrates with Google BigQuery, Amazon Redshift, Snowflake and Azure Data Warehouse. However, often the “root” of your data is in another external source e.g. Amazon S3. If this is the case and you’re considering using a tool like Dataform to start building out your data stack, then there are some simple scripts you can run to import this data into your cloud warehouse using Dataform.
We’re going to talk about how to import data from Amazon S3 to Amazon Redshift in just a few minutes, using the COPY command. This allows you to load data in parallel from multiple data sources. The COPY command can also be used to load files from other sources e.g. Amazon EMR or an Amazon DynamoDB table.
Before you begin you need to make sure you have:
An Amazon Web Services (AWS) account. Signing up is free - click here.
Permissions in AWS Identity Access Management (IAM) that allow you to create policies, create roles, and attach policies to roles. This is required to grant Dataform access to your S3 bucket.
Verified that column names in CSV files in S3 adhere to your destination’s length limit for column names. If a column name is longer than the destination’s character limit it will be rejected. In Redshift’s case the limit is 115 characters.
An Amazon S3 bucket containing the CSV files that you want to import.
A Redshift cluster. If you do not already have a cluster set up, see how to launch one here.
A Dataform project set up which is connected to your Redshift warehouse. See how to do that here.
Ok now you’ve got all that sorted, let’s get started!
Once you’re in Dataform, create a new .sqlx file in your project under the definitions/ folder. Using Dataform’s enriched SQL this is what the code should look like:
config {
type: "operations",
// Declare to Dataform that this script writes out a dataset.
hasOutput: true
}
js {
// Dataform’s name() function returns the name of the table (not inclusive
// of the schema) that Dataform expects this script to create.
const finalTableName = name();
// The resolve() function is used to compute a fully-qualified table name.
const fullyQualifiedTempTableName = resolve(`${finalTableName}_temp`);
// self() returns the fully-qualified name of the table that Dataform
// expects this script to create.
const fullyQualifiedFinalTableName = self();
}
/* Step 1: If the S3 tables already exist in Redshift (perhaps from a previous run),
you will need to drop the tables to remove them from the database before you
create them again in the next step. */
DROP TABLE IF EXISTS ${fullyQualifiedTempTableName} CASCADE
---
/* Step 2: Create a temporary table by executing the following CREATE TABLE command. */
CREATE TABLE ${fullyQualifiedTempTableName} (country VARCHAR(MAX), revenue FLOAT)
---
/* Step 3: Copy data from S3 into the temporary table using the COPY command. */
COPY ${fullyQualifiedTempTableName}
FROM
's3://dataform-integration-tests-us-east-n-virginia/sample-data/sample_data' IAM_ROLE 'arn:aws:iam::161427895535:role/RedshiftS3ReadRole' IGNOREHEADER 1 DELIMITER ','
---
/* Step 4: Delete the output table (if it exists). */
DROP TABLE IF EXISTS ${fullyQualifiedFinalTableName} CASCADE
---
/* Step 5: Rename the temporary table to the output table. */
ALTER TABLE
${fullyQualifiedTempTableName} RENAME TO ${finalTableName}
To execute the COPY command you need to provide the following values:
- Table name: The target table in S3 for the COPY command. The table must already exist in the database and it doesn’t matter if it’s temporary or persistent. The COPY command appends the new input data to any existing rows in the table.
COPY ${fullyQualifiedTempTableName}
- Data source: When loading from Amazon S3, you must provide the name of the bucket and the location of the data files, by providing either an object path for the data files or the location of a manifest file that explicitly lists each data file and its location.
FROM 's3://dataform-integration-tests-us-east-n-virginia/sample-data/sample_data'
Once you have your S3 import ready
Finally, you can push your changes to GitHub and then publish your table to Redshift. Alternatively, you can run this using the Dataform CLI: dataform run.
And Voila! Your S3 data is now ready to use in your Redshift warehouse as a table and can be included in your larger Dataform dependency graph. This means you can now run it alongside all other code, add dependencies on top of it (so any datasets that rely on this will only run if it is successful), you can use the ref() or resolve() functions on this dataset in another script and you can document it's data catalog entry using your own descriptions.
For more information about how to get setup on Dataform please see our docs.
Top comments (0)