Dynamic ETL from RDS to Redshift using AWS Glue

#aws #redshift #database #etl

Incent is a transactional based system that provides rewards for trackable actions. We can reward pretty much anything that our customers and clients want. We do this using a cryptocurrency as the reward. This article is not about cryptocurrency or blockchain, but rather how to set up a dynamic periodic ETL using existing AWS technologies.

For background, we implement RDS Postgres as parts of our database layer. We run a variety of databases to store and manage various aspects of our data. For transactional stuff it's really good and meets our needs. However running analytical queries across these multiple databases for use outside of the product is not efficient and also they literally don't connect with each other.

So we set up Redshift as a data warehouse that handles all the datasources across our platforms (including other 3rd party services). We ran several optimisations to ensure that Redshift was processing data and queries efficiently and provides an analytical layer that we don't get with a transactional approach to our data.

On the frontend, we have implemented Metabase as the analytical BI tool of choice for the team. Mainly because its simple and easy to use for both novice and power analytical users. It also connects to Redshift really nicely.

So to the ETL part...Getting data into Redshift can be tricky. There are various 3rd party solutions that can do it for you, however the costs can be quite high as you start to ingest more data. Also I'm not a fan of providing database access to services outside of our VPCs.

After digging around and playing with both AWS Data Pipeline and AWS Glue, I settled with AWS Glue mainly because it takes a per use cost approach that can be scaled up or down quickly, thus managing cost expectations and ensuring that data is migrated timely.

However out of the box, AWS Glue surprisingly doesn't support dynamic data loads. What do I mean by dynamic, well, as a startup, the tables in our backend have changes that occur due to introduction of features etc. You know the usual startup stuff. AWS Glue does have cataloging features that help figure out what has changed but their default scripts do not provide the ability to easily utilise the AWS Glue catalog features. Also the AWS Glue documentation, like many of AWS documentation, usually misses some key aspects.

So I've created this script that does the following:

Looks at all the AWS Glue Catalogs that you have
Grabs the Data Catalog that you are interested in and recursively grabs the data from those tables and migrates them to the destination tables.

Some notes:

You will need to set up AWS Glue catalog
You can use the crawler settings to setup periodic updates. This will check to see if your DBs have been updated with new tables etc
You can then add this script as a job and modify/duplicate it as you desire to suit your needs
You can then set up a trigger or workflow that goes through a set of jobs in a standard process on regular intervals that suit your needs (noting that increased frequency will increase cost.)

Important Quirks of AWS Glue Jobs

For the job scripts to work properly make sure the following are set
Job bookmarks - Disable
Job metrics - Disable
Continuous logging - Disable
Server-side encryption - Disabled

Link to script - https://gitlab.com/incent/aws-glue-etl