I am preparing to migrate 1B records from AWS RDS PostgreSQL to AWS DynamoDB. Before moving 1B records I want to build some POCs using a smaller data set (101M records) to find the most optimal way of getting those records into AWS DynamoDB.
My first choice was obvious: AWS Data Migration Service. AWS DMS supports AWS RDS PostgreSQL as a source and AWS DynamoDB as a target. This can be setup very easily.
However, it took me a few iterations, trying out different instance types, different migration task parameters, to finally get to a stage where I was fully using 20000 WCU when writing data to DynamoDB.
I also noticed that AWS DMS was using PutItem
operation rather than more efficient BatchWriteItem
.
The fastest of my migrations tasks was:
I watched a few videos from the online AWS re:Invent about Glue and how super efficient it was in transforming and moving data around. I also knew that Glue added a support for writing to DynamoDB.
I wanted to try it out and see if it can be of any use for my migration project.
Export tables to AWS S3 in Parquet format
I used AWS Data Migration Service to export data to AWS S3 in Parquet format. Exporting table that has 101M records was quite fast and took 54m:
Glue & Glue Studio
I haven't used Glue before. In one of the AWS re:Invent videos I saw that AWS recently added Glue Studio where you can build jobs using visual editor and for more complex jobs you can add custom code.
I built a simple job in Glue Studio, but noticed that DynamoDB is not yet supported (in Studio). Doing a quick Google search I found a sample code right from the Glue documentation: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-dynamodb.
I pasted that code into my "Transformation - Custom code" block. I only had to change variables' names as Glue Studio generates code using camelCase and the documentation was using snake_case. Anyway it was just 1 operation: a call to glue_context.write_dynamic_frame_from_options()
(I hope that in the near future it will be supported out of the box in Glue Studio).
And that was it. I ran the job and it took 3h 7m to complete:
In DynamoDB I saw all my 101M records:
Environment setup and summary
The environment looked like this:
- VPC with VPC Endpoints to AWS S3 and AWS DynamoDB
- AWS RDS PostgreSQL running in private subnet, db.m5.large
- AWS DMS instance running in private subnet, dms.c5.2xlarge
- DynamoDB table provisioned with 20000 WCU
- AWS DMS migration task with
ParallelLoadThreads
set to 130 andParallelLoadBufferSize
set to 1000 was running at 74% CPU and was generating 20000 WCU - AWS Glue job had 10 workers,
dynamodb.throughput.write.percent
set to 1.0 was maxing 20000 WCU
The results were the following:
DMS RDS to DynamoDB |
DMS RDS to S3 |
Glue S3 to DynamoDB |
|
---|---|---|---|
RDS CPU | 4% | 11% | - |
RDS Reads IOPS | 100 | 350 | - |
RDS Connections | 6 | 6 | - |
DMS CPU | 74% | 11% | - |
DynamoDB WCU | 20000 | - | 20000 |
Time | 2h 51m | 54m | 3h 7m |
Direct migration between AWS RDS to AWS DynamoDB using AWS DMS was 2h 51m.
Staged migration with RDS -> S3 Parquet -> DynamoDB was: 54m + 3h 7m = 5h 1m.
After all DMS was faster than the DMS + S3 + Glue trio.
As you can see in the table above, in both approaches the pressure on AWS RDS PostgreSQL instance was minimal.
There are still some optimisations I want to try, like speeding up the import to DynamoDB by setting WCU even higher. That will require some AWS DMS task parameters tuning and probably increasing the DMS instance size as well.
DMS
Pros of using DMS are:
- it's fast, even with
PutItem
operations - one-stop shop for your migration tasks
- can run additional validation logic for your migration
Cons of using DMS are:
- need to rightsize the instance and tune task parameters to max DynamoDB writes
S3 Parquet and Glue
Pros of using S3 Parquet and Glue:
- serverless solution - no need to manage the infrastructure
- didn't have to do any job tuning, setting
dynamodb.throughput.write.percent
to 1.0 was all I had to do to max DynamoDB writes - you get some bonuses, think - data lake!
Cons of using S3 Parquet and Glue:
- slower than native DMS migration
Bonus #1
Since I had the data already in AWS S3 in Parquet format I wanted to play around with AWS Athena. I used Athena before and already knew it was super fast, but I never had a chance to play around with 20 columns wide 101M records data set. The experience was mind blowing. Complex queries running in just a few seconds. The same queries running on the original AWS RDS PostgreSQL instance were taking 20 minutes to complete!
Bonus #2
Since I already had the data in AWS Athena database I wanted to go one step further and easily visualise the data I had using AWS QuickSight. It took me a few minutes to set up some Pie Charts and Heat maps and publish my dashboard. All by drag and dropping columns in the QuickSight UI.
Top comments (2)
Hi. Can you post here your full job script ? just erase or put dummy data on the details. Thank you!
Hi @kimsean ,
Sorry was on annual leave and wasn't checking my emails and notifications. No problem here is my code, I have replaced any account-specific details with XXX. I also removed the automatically generated comments to keep it short.