André

Posted on Jun 1, 2020

Why You Can't (yet) Deprecate Write Sharding on DynamoDB

#dynamodb #serverless #aws #nosql

Disclaimer: I'm not affiliated with AWS and this is not technical advice.

Motivation

A couple of months ago, revisiting DynamoDB's documentation, this piece caught my attention:

Immediately after that, I remembered all the thousand times I watched Rick Houlihan's talks, specially the parts where he explains how to make use of write sharding on your DynamoDB table designs, in order to maximize throughput by avoiding hot partitions.

It was clear to me that this piece of documentation was about a feature that presumably addressed the same issue as write sharding (hot partitions), but in an adaptive way. Although the documentation doesn't tell us much, it does make a bold promise: to automatically repartition your items across as many nodes as needed, down to a single-item partition. It looked like a very handy feature, since hot partitions are a real thing for highly scalable applications, and it's something you really need to consider before committing your table design. Also, not having to deal with write sharding means not pushing it down the pipe, forcing the application layer to handle the bundled complexity.

Most importantly, write sharding strategies increase your throughput capacity consumption. Since this "auto-split" feature comes at no cost, making use of it means saving money.

Naturally, an adaptive solution at the service layer screams for a loosely coupled, fault tolerant system (which is not necessarily the case for write sharding, where the splitting happens on your side, under your control). But then again, what's the matter? The problem we're trying to address concerns applications running at high scale, and these are most likely built on top of those best practices, anyway.

Too good to be true?

This whole thing got me intrigued. How come this feature exists and no one talks about it? So I started asking questions:

Kirk Kirkconnell's answers aside, which confirmed that the feature existed and there was an upcoming documentation overhaul that would make things clearer, no one else had an actual answer. In a recent tweet, I even tried to tease Rick Houlihan himself, but had no luck there.

edit: A while later, Rick attentively answered all my questions. Thanks, Rick!

At this point, I was already getting paranoid.

I needed to do something about it.

Tests

I finally dedicated some time to build a CDK app to test this. If you're interested, check out this blog post for a walkthrough.

Considerations

A few things to consider before load testing a DynamoDB table:

There's a default limit of 40k WCUs and 40k RCUs.
Tables running in On-Demand capacity mode have an initial "previous peak" value of 2k WCUs or 6k RCUs.
The above means that the initial throughput limit for an On-Demand table is 4k WCUs or 12k RCUs (or a linear combination of the two, eg.: 0 WCUs and 12k RCUs).
The new "previous peak" is established approximately 30 minutes after the new peak is reached, then effectively doubling the throughput limit.

It's also relevant to remember what exactly we are testing. We want to prove the hypothesis of performing read and write operations in a single table, using a single DynamoDB Partition Key, at a sustained throughput that has to be higher than the advertised limit (1k WCUs / 3k RCUs per partition).

Smoke Testing

The first thing I wanted to test, was if this feature existed at all. And I thought that if it was to exist, it would have to at least support reads.

In order to make read requests, I needed to populate the table. So, I triggered the execution of an insert batch, making sure the load didn't reach the throughput limit of the partition.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
3 units	600 secs	300 items	1 sec	900 WCU	540k WCU

Insert Batch Completed in Step Functions

With the items in place, I ran another batch, this time with a target throughput of about 6k RCUs, which is twice the partition limit:

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
20 units	600 secs	300 items	0.8 sec	6000 RCU	3600k RCU

As you can see, we could reach the target throughput. This is our custom CloudWatch Dashboard Widget after performing both steps of this test:

However, we have experienced some throttling. DynamoDB Metrics says we had around 1.40% of throttled reads.

Since the node limit is 3k RCU, populating the table at around 900 WCU might have split our data into two nodes, allowing us to reach 6k RCUs.

Knowing how DynamoDB throughput limits work at the table level, I thought that maybe we have reached a new plateau at 6k RCUs. This would explain this marginal throttling rate.

Smoking Test: Higher Load

Then, I ran another test. Starting again with a new table, populated within the partition limits. This time though, I'll try to reach something just below 10k RCUs. My goal here is to find another plateau and establish a pattern on the feature behaviour.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
33 units	600 secs	300 items	0.9 sec	9900 RCU	5940k RCU

DynamoDB Widgets Reads and Throttles — Throttled at 6k RCU, then reached the new peak.

As suspected, we indeed had reached a plateau at 6k RCU. It took around 8 minutes for the auto-split to kick in and repartition our data, so we could again reach another peak. Then we could run for the remaining 2 minutes without throttling.

Now let's see what happens when trying to read at 12k RCUs on the same table. The idea is to test for a new plateau and see if the feature can handle a sustained throughput at around the previous peak without throttling.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
42 units	600 secs	300 items	0.9 sec	12600 RCU	7200k RCU

Just like we have guessed, whatever the "auto-split" feature did, it's there. A relatively small throttle count happened when the throughput surpassed the 12k plateau, either because of the On-Demand initial table limit or because we have reached another plateau:

Testing the Plateau Pattern in a Real World Scenario

Now that we have a better hypothesis on how this feature works, let's try a real world scenario. Let's imagine that an application needs to support the case where a new Partition Key suddenly starts to suffer a big amount of reads and writes. Since it's a new partition, there's no chance it was already re-partitioned.

Our write workload will target 2.5k WCUs, and our read workload will target 6k RCUs

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
5 units	600 secs	500 items	1 sec	2500 WCU	1500k WCU
20 units	600 secs	300 items	0.9 sec	6000 RCU	3600k RCU

As you can see, many bad things happened here. As we could have guessed, the re-partitioning isn't instantaneous, and it's unclear what exactly makes it happen.

It's also interesting to see that the throughput capacity randomly reaches another plateau as if the feature didn't follow a pattern over time.

Another Test Case

Now let's imagine an application that needs its partitions to have their throughput capacity linearly increasing over time. Based on the tests performed so far, I'll run a test that simultaneously reads and writes to a new partition at an initial velocity of 100WCUs / 300RCUs and accelerates 15% every minute.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
5 units	1200 secs	20-277 items	1 sec	100-1385 WCU	600k WCU
20 units	1200 secs	15 items	1 sec	300-4230 RCU	1830k RCU

Still throttling... — Much better, but still some throttling.

If we look at all the tests so far, besides the fact that the split seems to be happening randomly across the test duration, we can see that the capacity is always increased by 1k WCUs each time a new plateau is reached. This behaviour makes me think that regardless of how much throttling is happening, the feature acts by adding a single node to the partition.

Running another batch on the same table and partition, now loading 6000 RCU + 2000 WCU steadily (which is actually 4 times the parition capacity, since the operations are running simultaneously), we can see that there's no throttling.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
5 units	600 secs	400 items	1 sec	2000 WCU	1200k WCU
20 units	600 secs	300 items	1 sec	6000 RCU	3600k RCU

It looks like the repartitioning is somewhat persistent.

Settling down

In the next test, we'll apply the exact same load as the previous batch, which is again 4 times the initial partition capacity. But this time, we'll do it for 30 minutes in a new partition. The idea is to validate the behavior we could infer from last test. We'll do it on the same table, so the table limits don't masquerade the results.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
5 units	1200 secs	400 items	1 sec	2000 WCU	1200k WCU
20 units	1200 secs	300 items	1 sec	6000 RCU	3600k RCU

Again, as you can see in the graph above, the time it takes for the partition to be split is pretty random, and it certainly depends on many variables we cannot control.

To make things even more clear, we'll do another batch on the same table and new partition, but now with a load of 5k WCUs for over an hour.

Workers	Duration	Load	Interval	Target Throughput	Total Capacity
10 units	4800 secs	400 items	1 sec	4000 WCU	19200k WCU

Plateau changes — Random intervals, 1k WCUs per change.

Test Conclusions

It looks like DynamoDB, in fact, has a working auto-split feature for hot partitions. Also, there are reasons to believe that the split works in response to a high usage of throughput capacity on a single partition, and that it always happens by adding a single node, so that the capacity is increased by 1kWCUs / 3k RCUs each time. The "split" also appears to be persistent over time.

Final Thoughts

The "auto-split" feature seems to be a best effort to accommodate unpredicted loads that require more throughput capacity than available in the partition.

It's important to note that the feature is very briefly described on the official docs, and there's no promise on how it actually performs. If I had to guess, I would say this is something still being worked on. My instincts say it'll be improved and launched as an "auto-sharding" feature when it's mature enough.

In the meantime though, since this clearly isn't a replacement for write sharding and considering the auto-split as proven to be persistent over time, is there any other way we could make use of it?

For some applications that have to make use of write sharding to increase the throughput capacity of their partitions, here is an architecture suggestion:

For more details and hopefully a discussion about this suggestion, check this post.

Thanks for reading!

--
André

DEV Community

Why You Can't (yet) Deprecate Write Sharding on DynamoDB

Motivation

Too good to be true?

Tests

Considerations

Smoke Testing

Smoking Test: Higher Load

Testing the Plateau Pattern in a Real World Scenario

Another Test Case

Settling down

Test Conclusions

Final Thoughts

Top comments (0)

Read next

Amazon Q Developer Tips: No.19 Amazon Q Developer Agents - /doc

🚀 Amazon Nova: AWS's New Foundation Model for GenAI🤖

Aurora Limitless - Resharding

AWS Re:Invent announcement: Lambda SnapStart for .Net - let's try it!