DEV Community

Paige Niedringhaus
Paige Niedringhaus

Posted on • Originally published at paigeniedringhaus.com on

Create Upsert Yourself for Amazon Redshift Databases

Digital sign boards displaying data hanging from the ceiling

Introduction

One of the many things I enjoy about working at a startup is the variety of people I get to work with across the company and the interesting problems I get to solve. A few months back, the marketing team was working on improving our analytics reporting and wanted to store data in our Amazon Redshift database that could then be used to build more detailed reports.

While this would normally be a relatively straightforward INSERT SQL request if all the data was brand new, one thing that made it a little tricky was that I needed to update certain rows in the data table AND insert new rows as well. Normally, I'd use an UPSERT SQL statement to do both things at once, but Redshift doesn't support UPSERT statements, so I had to get a little more creative to make this happen manually. With some research and a lot of testing, I learned that by using a temporary data table and a few SQL statements strung together it was possible, and it wasn't as complicated as I thought it would be.

Let's look at how to create our own version of an UPSERT SQL statement in a PostgreSQL-based Amazon Redshift database, updating data that already exists and adding new data as well.


Amazon Redshift

Redshift is Amazon Web Service's PostgreSQL-based cloud data warehouse built with added functionality to manage very large datasets and support high-performance analysis and reporting of those data.

It's similar to Google's BigQuery or Microsoft's Azure cloud data warehouses, if you've ever used either of those before.

And although Redshift is based on PostgreSQL it does have some differences, one of those being: a lack of UPSERT functionality.

No built in UPSERT functionality

Amazon acknowledges this shortcoming readily enough, and suggests using a staging table to perform a merge. This sounds sensible, but the AWS documentation on how exactly to do this is a little sparse.

So after I figured it out, I wanted to share it, with code examples, for each step. Let's get to it.

Create your own UPSERT

NOTE: This tutorial assumes you're familiar with basic SQL syntax. If you wish to reference any of the commands in the article, I recommend checking the Redshift SQL command docs.

If there's no native UPSERT, we'll make our own. There's more than one way to recreate UPSERT but the option I chose to go with essentially deletes all the existing rows in the production table that match rows in the temporary table, then inserts all the data from the temporary table into the production table.

It sounds complex (because it is a bit), but it's not as bad once each step in the process is broken down.

For this article, I'll set up a table (USER_UPDATES) that's a copy of a fictitious USERS table. In USER_UPDATES, I'll modify some existing data in the table (updates), and create some new data in the table (inserts), then we'll get to updating that newly modified data into the USERS table.

Prerequisite data modification setup

-- Create a sample table as a copy of the fake USERS table 

CREATE TABLE user_updates AS
SELECT * FROM users;

-- Change every third row so we have updated users

UPDATE user_updates
SET 
    firstname = upper(firstname),
    lastname = upper(lastname),
    lastseen = '2022-11-13'
AND mod(userid, 3) = 0;;

-- Add some new rows of users so we have insert examples 
-- This example creates a duplicate of every fifth row of user data

INSERT INTO user_updates
SELECT 
      (userid + 127) AS userid, 
      firstname, 
      lastname, 
      getdate() AS lastseen,
FROM user_updates
AND mod(userid, 5) = 0;
Enter fullscreen mode Exit fullscreen mode

Now that there's some updated user data, it's time to add it back to the USERS table, while ensuring the user data for users already in the table gets overwritten and updated, and the new user data gets added.

1. Make a staging table

The first thing we'll need to do is write a SQL command that creates a temporary table that is an exact copy of the production table the data will eventually be written into.

-- Create temporary table that's a duplicate of the production users table

CREATE TEMP TABLE temp_users (LIKE users);
Enter fullscreen mode Exit fullscreen mode

Creating a temporary table is almost the same as creating a regular table in SQL (just add TEMP), and instead of having to list out each column in detail, we can simply use LIKE <table_name> to tell SQL to make this table's columns a copy of the table it will eventually add its data to.

2. Insert the new data into the temp table

With our temp table now created, it's time to insert the modified data in the prerequisite setup step into that temp table. In this INSERT statement, we'll take all the data from the USER_UPDATES table and insert it into the temporary TEMP_USERS table.

-- Take all the data from the user_updates table and insert it into the temporary table

INSERT INTO temp_users
SELECT
       userid,
       firstname,
       lastname,
       lastseen,
FROM user_updates;
Enter fullscreen mode Exit fullscreen mode

Here, all the data from USER_UPDATES is put in the TEMP_USERS table for the moment.

3. Begin the transaction

A transaction is a unit of work done in a database - the work can be anything from creating tables to deleting them. The transaction taking place here, will be deleting particular rows of user data from the production table USERS (these are existing users who have updated data in the temp table) before adding their new data into the table.

-- Start the work of deleting existing user data from the table and adding the new data instead

BEGIN TRANSACTION;
Enter fullscreen mode Exit fullscreen mode

The thing about transactions in PostgreSQL is that a commit is implicitly performed at the end of the statement as long as the execution is successful. If the execution fails, a rollback is done instead. This becomes especially convenient because if new user data fails to get added to the USERS table, it will rollback and automatically replace the old user data back in the table.

4. Delete the rows in the permanent table that need to be replaced with new data

Once the transaction's begun, the first thing to do is delete any rows from the permanent table (USERS) where the user ID matches a user ID in the TEMP_USERS table.

-- Delete the users from the database whose data needs to be updated

DELETE FROM users 
USING 
    temp_users 
WHERE users.userid = temp_users.userid;
Enter fullscreen mode Exit fullscreen mode

By doing this DELETE first, we ensure that there will be no duplicate user data when the new data is added to the USERS table.

5. Insert all the data from the temp table to the permanent table

Now it's safe to insert everything from the TEMP_USERS table into the USERS table, without fear of duplicating existing user data.

-- Add all the user data from the temp table to the prod table 

INSERT INTO users
SELECT * FROM temp_users;
Enter fullscreen mode Exit fullscreen mode

6. End the transaction

Once all the parts of the transaction have completed successfully, it can be ended just like it began.

-- End the work

END TRANSACTION;
Enter fullscreen mode Exit fullscreen mode

There's one last thing to do now: a little clean up.

7. Delete the temp table

Finally, since the TEMP_USERS table has done its job and there's no further need for it, it's dropped (deleted).

-- Delete the temporary table now that its job is done

DROP TABLE temp_users;
Enter fullscreen mode Exit fullscreen mode

Creating our own UPSERT's not so bad once each of the steps are broken down and explained.


Conclusion

Amazon Redshift is a popular, powerful data warehouse, but even though it's based on PostgreSQL, it lacks some of the nicer features like UPSERT, which is exactly the feature I needed to help out one of the teams at work.

Since I needed to update a data table with new information and update existing data in that table, I needed something more than a simple SQL INSERT. Lucky for me, creating UPSERT functionality wasn't as difficult as I thought it would be with the help of a temporary table and a couple of targeted DELETE and INSERT queries. Not too shabby.

Check back in a few weeks — I’ll be writing more about JavaScript, React, IoT, or something else related to web development.

If you’d like to make sure you never miss an article I write, sign up for my newsletter here: https://paigeniedringhaus.substack.com

Thanks for reading. I hope you find it helpful to see how to create your own SQL UPSERT function. This probably won't be the first or the last time you need to insert and update data into a table, and it's nice to have options for how best to accomplish it.


References & Further Resources

Top comments (0)