DEV Community

Cover image for Why You Can't (yet) Deprecate Write Sharding on DynamoDB
André
André

Posted on

Why You Can't (yet) Deprecate Write Sharding on DynamoDB

Disclaimer: I'm not affiliated with AWS and this is not technical advice.

Motivation

A couple of months ago, revisiting DynamoDB's documentation, this piece caught my attention:

Very brief documentation

Immediately after that, I remembered all the thousand times I watched Rick Houlihan's talks, specially the parts where he explains how to make use of write sharding on your DynamoDB table designs, in order to maximize throughput by avoiding hot partitions.

It was clear to me that this piece of documentation was about a feature that presumably addressed the same issue as write sharding (hot partitions), but in an adaptive way. Although the documentation doesn't tell us much, it does make a bold promise: to automatically repartition your items across as many nodes as needed, down to a single-item partition. It looked like a very handy feature, since hot partitions are a real thing for highly scalable applications, and it's something you really need to consider before committing your table design. Also, not having to deal with write sharding means not pushing it down the pipe, forcing the application layer to handle the bundled complexity.

Most importantly, write sharding strategies increase your throughput capacity consumption. Since this "auto-split" feature comes at no cost, making use of it means saving money.

Naturally, an adaptive solution at the service layer screams for a loosely coupled, fault tolerant system (which is not necessarily the case for write sharding, where the splitting happens on your side, under your control). But then again, what's the matter? The problem we're trying to address concerns applications running at high scale, and these are most likely built on top of those best practices, anyway.

Too good to be true?

This whole thing got me intrigued. How come this feature exists and no one talks about it? So I started asking questions:

Kirk Kirkconnell's answers aside, which confirmed that the feature existed and there was an upcoming documentation overhaul that would make things clearer, no one else had an actual answer. In a recent tweet, I even tried to tease Rick Houlihan himself, but had no luck there.

edit: A while later, Rick attentively answered all my questions. Thanks, Rick!

At this point, I was already getting paranoid.

Paranoid

I needed to do something about it.

Tests

I finally dedicated some time to build a CDK app to test this. If you're interested, check out this blog post for a walkthrough.

Considerations

A few things to consider before load testing a DynamoDB table:

  • There's a default limit of 40k WCUs and 40k RCUs.
  • Tables running in On-Demand capacity mode have an initial "previous peak" value of 2k WCUs or 6k RCUs.
  • The above means that the initial throughput limit for an On-Demand table is 4k WCUs or 12k RCUs (or a linear combination of the two, eg.: 0 WCUs and 12k RCUs).
  • The new "previous peak" is established approximately 30 minutes after the new peak is reached, then effectively doubling the throughput limit.

It's also relevant to remember what exactly we are testing. We want to prove the hypothesis of performing read and write operations in a single table, using a single DynamoDB Partition Key, at a sustained throughput that has to be higher than the advertised limit (1k WCUs / 3k RCUs per partition).

Smoke Testing

The first thing I wanted to test, was if this feature existed at all. And I thought that if it was to exist, it would have to at least support reads.

In order to make read requests, I needed to populate the table. So, I triggered the execution of an insert batch, making sure the load didn't reach the throughput limit of the partition.

Workers Duration Load Interval Target Throughput Total Capacity
3 units 600 secs 300 items 1 sec 900 WCU 540k WCU

Insert Batch Completed in Step Functions

Insert Batch Completed in Step Functions

DynamoDB Write Capacity Widget

DynamoDB Write Capacity Widget

With the items in place, I ran another batch, this time with a target throughput of about 6k RCUs, which is twice the partition limit:

Workers Duration Load Interval Target Throughput Total Capacity
20 units 600 secs 300 items 0.8 sec 6000 RCU 3600k RCU

DynamoDB Read Capacity Widget

DynamoDB Read Capacity Widget

As you can see, we could reach the target throughput. This is our custom CloudWatch Dashboard Widget after performing both steps of this test:

CloudWatch Dashboard

CloudWatch Dashboard Widget

However, we have experienced some throttling. DynamoDB Metrics says we had around 1.40% of throttled reads.

DynamoDB Throttled Read Events Widget

DynamoDB Throttled Read Events Widget

Since the node limit is 3k RCU, populating the table at around 900 WCU might have split our data into two nodes, allowing us to reach 6k RCUs.

Knowing how DynamoDB throughput limits work at the table level, I thought that maybe we have reached a new plateau at 6k RCUs. This would explain this marginal throttling rate.

Smoking Test: Higher Load

Then, I ran another test. Starting again with a new table, populated within the partition limits. This time though, I'll try to reach something just below 10k RCUs. My goal here is to find another plateau and establish a pattern on the feature behaviour.

Workers Duration Load Interval Target Throughput Total Capacity
33 units 600 secs 300 items 0.9 sec 9900 RCU 5940k RCU

DynamoDB Widgets Reads and Throttles

Throttled at 6k RCU, then reached the new peak.

As suspected, we indeed had reached a plateau at 6k RCU. It took around 8 minutes for the auto-split to kick in and repartition our data, so we could again reach another peak. Then we could run for the remaining 2 minutes without throttling.

CloudWatch Dashboard

Surpassing the 6k plateau, peaking at around 9k RCU throttle-free.

Now let's see what happens when trying to read at 12k RCUs on the same table. The idea is to test for a new plateau and see if the feature can handle a sustained throughput at around the previous peak without throttling.

Workers Duration Load Interval Target Throughput Total Capacity
42 units 600 secs 300 items 0.9 sec 12600 RCU 7200k RCU

Just like we have guessed, whatever the "auto-split" feature did, it's there. A relatively small throttle count happened when the throughput surpassed the 12k plateau, either because of the On-Demand initial table limit or because we have reached another plateau:

CloudWatch Dashboard

No throttling at 6k RCU

Testing the Plateau Pattern in a Real World Scenario

Now that we have a better hypothesis on how this feature works, let's try a real world scenario. Let's imagine that an application needs to support the case where a new Partition Key suddenly starts to suffer a big amount of reads and writes. Since it's a new partition, there's no chance it was already re-partitioned.

Our write workload will target 2.5k WCUs, and our read workload will target 6k RCUs

Workers Duration Load Interval Target Throughput Total Capacity
5 units 600 secs 500 items 1 sec 2500 WCU 1500k WCU
20 units 600 secs 300 items 0.9 sec 6000 RCU 3600k RCU

Throttling fest!

Throttling fest!

As you can see, many bad things happened here. As we could have guessed, the re-partitioning isn't instantaneous, and it's unclear what exactly makes it happen.

Throttled Events vs Consumed WCUs

Throttled Events vs Consumed WCUs

Throttled Events vs Consumed RCUs

Throttled Events vs Consumed RCUs

It's also interesting to see that the throughput capacity randomly reaches another plateau as if the feature didn't follow a pattern over time.

Another Test Case

Now let's imagine an application that needs its partitions to have their throughput capacity linearly increasing over time. Based on the tests performed so far, I'll run a test that simultaneously reads and writes to a new partition at an initial velocity of 100WCUs / 300RCUs and accelerates 15% every minute.

Workers Duration Load Interval Target Throughput Total Capacity
5 units 1200 secs 20-277 items 1 sec 100-1385 WCU 600k WCU
20 units 1200 secs 15 items 1 sec 300-4230 RCU 1830k RCU

Still throttling...

Much better, but still some throttling.

If we look at all the tests so far, besides the fact that the split seems to be happening randomly across the test duration, we can see that the capacity is always increased by 1k WCUs each time a new plateau is reached. This behaviour makes me think that regardless of how much throttling is happening, the feature acts by adding a single node to the partition.

Running another batch on the same table and partition, now loading 6000 RCU + 2000 WCU steadily (which is actually 4 times the parition capacity, since the operations are running simultaneously), we can see that there's no throttling.

Workers Duration Load Interval Target Throughput Total Capacity
5 units 600 secs 400 items 1 sec 2000 WCU 1200k WCU
20 units 600 secs 300 items 1 sec 6000 RCU 3600k RCU

No throttling

No throttling!

It looks like the repartitioning is somewhat persistent.

Settling down

In the next test, we'll apply the exact same load as the previous batch, which is again 4 times the initial partition capacity. But this time, we'll do it for 30 minutes in a new partition. The idea is to validate the behavior we could infer from last test. We'll do it on the same table, so the table limits don't masquerade the results.

Workers Duration Load Interval Target Throughput Total Capacity
5 units 1200 secs 400 items 1 sec 2000 WCU 1200k WCU
20 units 1200 secs 300 items 1 sec 6000 RCU 3600k RCU

Random plateau changes

Random plateau changes

Again, as you can see in the graph above, the time it takes for the partition to be split is pretty random, and it certainly depends on many variables we cannot control.

To make things even more clear, we'll do another batch on the same table and new partition, but now with a load of 5k WCUs for over an hour.

Workers Duration Load Interval Target Throughput Total Capacity
10 units 4800 secs 400 items 1 sec 4000 WCU 19200k WCU

Plateau changes

Random intervals, 1k WCUs per change.

Test Conclusions

It looks like DynamoDB, in fact, has a working auto-split feature for hot partitions. Also, there are reasons to believe that the split works in response to a high usage of throughput capacity on a single partition, and that it always happens by adding a single node, so that the capacity is increased by 1kWCUs / 3k RCUs each time. The "split" also appears to be persistent over time.

Final Thoughts

The "auto-split" feature seems to be a best effort to accommodate unpredicted loads that require more throughput capacity than available in the partition.

It's important to note that the feature is very briefly described on the official docs, and there's no promise on how it actually performs. If I had to guess, I would say this is something still being worked on. My instincts say it'll be improved and launched as an "auto-sharding" feature when it's mature enough.

In the meantime though, since this clearly isn't a replacement for write sharding and considering the auto-split as proven to be persistent over time, is there any other way we could make use of it?

For some applications that have to make use of write sharding to increase the throughput capacity of their partitions, here is an architecture suggestion:

Lazy Sharding

Lazy Sharding

For more details and hopefully a discussion about this suggestion, check this post.

Thanks for reading!

--
André

Top comments (0)