I am preparing to migrate 1B records from AWS RDS PostgreSQL to AWS DynamoDB. Before moving 1B records I want to build some POCs using a smaller data set (101M records) to find the most optimal way of getting those records into AWS DynamoDB.
My first choice was obvious: AWS Data Migration Service. AWS DMS supports AWS RDS PostgreSQL as a source and AWS DynamoDB as a target. This can be setup very easily.
However, it took me a few iterations, trying out different instance types, different migration task parameters, to finally get to a stage where I was fully using 20000 WCU when writing data to DynamoDB.
I also noticed that AWS DMS was using
PutItem operation rather than more efficient
The fastest of my migrations tasks was:
I watched a few videos from the online AWS re:Invent about Glue and how super efficient it was in transforming and moving data around. I also knew that Glue added a support for writing to DynamoDB.
I wanted to try it out and see if it can be of any use for my migration project.
I used AWS Data Migration Service to export data to AWS S3 in Parquet format. Exporting table that has 101M records was quite fast and took 54m:
I haven't used Glue before. In one of the AWS re:Invent videos I saw that AWS recently added Glue Studio where you can build jobs using visual editor and for more complex jobs you can add custom code.
I built a simple job in Glue Studio, but noticed that DynamoDB is not yet supported (in Studio). Doing a quick Google search I found a sample code right from the Glue documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb.
I pasted that code into my "Transformation - Custom code" block. I only had to change variables' names as Glue Studio generates code using camelCase and the documentation was using snake_case. Anyway it was just 1 operation: a call to
glue_context.write_dynamic_frame_from_options() (I hope that in the near future it will be supported out of the box in Glue Studio).
And that was it. I ran the job and it took 3h 7m to complete:
In DynamoDB I saw all my 101M records:
The environment looked like this:
- VPC with VPC Endpoints to AWS S3 and AWS DynamoDB
- AWS RDS PostgreSQL running in private subnet, db.m5.large
- AWS DMS instance running in private subnet, dms.c5.2xlarge
- DynamoDB table provisioned with 20000 WCU
- AWS DMS migration task with
ParallelLoadThreadsset to 130 and
ParallelLoadBufferSizeset to 1000 was running at 74% CPU and was generating 20000 WCU
- AWS Glue job had 10 workers,
dynamodb.throughput.write.percentset to 1.0 was maxing 20000 WCU
The results were the following:
RDS to DynamoDB
RDS to S3
S3 to DynamoDB
|RDS Reads IOPS||100||350||-|
|Time||2h 51m||54m||3h 7m|
Direct migration between AWS RDS to AWS DynamoDB using AWS DMS was 2h 51m.
Staged migration with RDS -> S3 Parquet -> DynamoDB was: 54m + 3h 7m = 5h 1m.
After all DMS was faster than the DMS + S3 + Glue trio.
As you can see in the table above, in both approaches the pressure on AWS RDS PostgreSQL instance was minimal.
There are still some optimisations I want to try, like speeding up the import to DynamoDB by setting WCU even higher. That will require some AWS DMS task parameters tuning and probably increasing the DMS instance size as well.
Pros of using DMS are:
- it's fast, even with
- one-stop shop for your migration tasks
- can run additional validation logic for your migration
Cons of using DMS are:
- need to rightsize the instance and tune task parameters to max DynamoDB writes
Pros of using S3 Parquet and Glue:
- serverless solution - no need to manage the infrastructure
- didn't have to do any job tuning, setting
dynamodb.throughput.write.percentto 1.0 was all I had to do to max DynamoDB writes
- you get some bonuses, think - data lake!
Cons of using S3 Parquet and Glue:
- slower than native DMS migration
Since I had the data already in AWS S3 in Parquet format I wanted to play around with AWS Athena. I used Athena before and already knew it was super fast, but I never had a chance to play around with 20 columns wide 101M records data set. The experience was mind blowing. Complex queries running in just a few seconds. The same queries running on the original AWS RDS PostgreSQL instance were taking 20 minutes to complete!
Since I already had the data in AWS Athena database I wanted to go one step further and easily visualise the data I had using AWS QuickSight. It took me a few minutes to set up some Pie Charts and Heat maps and publish my dashboard. All by drag and dropping columns in the QuickSight UI.