DEV Community

Cover image for AWS DMS and Prefect: The Key to Building a Robust Data Warehouse
Fady khallaf
Fady khallaf

Posted on

AWS DMS and Prefect: The Key to Building a Robust Data Warehouse

AWS DMS (Database Migration Service) is a fully managed service that makes it easy to migrate data from multiple sources to a variety of targets, including Amazon RDS, Aurora, and Redshift, as well as other databases on-premises or in the cloud. One of the key features of DMS is the ability to capture data changes in real-time and replicate them to the target database.

There are two main ways to use DMS to capture data changes:

Full load and change data capture (CDC)

This method involves performing a full load of data from the source to the target, and then subsequently capturing and replicating only the changes made to the source data. This method can be used when migrating data from a source database that supports CDC, such as Oracle, SQL Server, or MySQL.

To set up full load and CDC in DMS, you need to create a replication instance, a source endpoint, and a target endpoint. You can then create a replication task and configure it to use the full load and CDC method. DMS will automatically capture and replicate changes made to the source data in real-time.

Replication using CDC-enabled API

This method involves using a CDC-enabled API, such as AWS AppSync or AWS Lambda, to capture data changes and send them to DMS. This method can be used when migrating data from a source database that does not support CDC, such as MongoDB or PostgreSQL.

Using DMS to capture data changes in real-time can save time and effort when migrating data, as it eliminates the need to manually track and replicate changes. Additionally, it ensures that the target database is always up-to-date with the latest changes in the source database.

In the following paragraph, I will share with you how we used AWS DMS to capture data changes using binlog from our MySQL database and store it in an S3 bucket.

First, to capture binlog in MySQL database, you should check that the binlog format is of type ROW in the parameter group so that you can find the data changes like any (Insert, Updated or Delete) process in the binlogs.

Second, use AWS DMS to move binlogs from the RDS to an S3 bucket by setting the DMS source endpoint to the RDS and DMS target point to the S3. Now all your binlogs are collected in the S3 bucket, so you can use any data orchestration tool like (Prefect, Dagster, Airflow, or any other data engineering pipeline tool)

In our use case, we used Prefect, which is a modern, open-source workflow management system that can help you schedule, execute, and monitor your data pipeline tasks.

With Prefect, you can easily create workflows that include tasks such as data extraction, cleaning, and loading. You can set up your data pipeline to run on a schedule, such as Hourly, daily, or weekly. It also offers built-in monitoring and alerting, so you can easily keep track of the status of your pipeline and troubleshoot any issues that may arise.

To use DMS and Prefect together, you can create a Prefect task that triggers a DMS migration, and then chain other tasks in the workflow to perform additional actions on the data (Binlogs in the S3), such as cleaning or transforming it before loading it into Redshift as a staging schema.

And then, we used another Prefect workflow to do data validation and transformation on the staging schema to load it into the data warehouse presentation layer so that analytics tools and data science models can rely on this data.

In summary, AWS DMS and Prefect can be significant players in your journey to build your own data warehouse.

Data warehouse Architecture

Top comments (0)