DEV Community

swyx
swyx

Posted on • Updated on • Originally published at swyx.io

5 Things I Learned from The DynamoDB Book

I've written before that reading books cover to cover is one of the best ways for intermediate devs to discover gaps in knowledge and strengthen their knowledge of design patterns.

Over the past couple weeks I've had the opportunity to review Alex DeBrie's new The DynamoDB Book. It is excellent and sets a very high bar for accessible technical writing of a very abstract and complex subject.

My Context

I've never used DynamoDB in production, so I am not qualified to assess technical merit - there are other experts to do that. I write in order to share my experience of the book as a DynamoDB beginner.

I'm not new to NoSQL - I've actually spent more time with Fauna, Firebase and MongoDB than I have Postgres, MySQL and SQLite - but NoSQL kind of follows a perversion of the Anna Karenina principle

"All SQL DBs are alike; each NoSQL DB is NoSQL in its own way."

My interest is personal though - my next job will use DynamoDB extensively, and in any case I expect that knowing foundational AWS services will be relevant for the rest of my life, and therefore a very good use of time.

Structure

This 450 page book has 22 chapters. There's no explicit grouping of them but here are my unofficial groupings:

I write these details down to emphasize the ridiculous amount of thought put into this book. A lesser book would stop at Chapter 9 - mostly explaining factual statements in a more accessible way than the official docs, and then ending with some advice.

But, in fact, the majority of the book comes in Chapters 10-22 - chock full of hard won advice, and worked examples for you to apply what you just learned, designed to prepare you for every real world scenario. This truly goes above and beyond, and turns the book from a read-once-and-discard deal into a reusable reference tome you will consult for the entirety of your DynamoDB usage.

5 Things I Learned

There's way too much to write down, but I figured I should force myself to make some notes to process in public for myself and others.

1. Why Generic PK and SK names

DynamoDB is marketed as a "key value store". The keys are split into Partition Keys (PK) and Sort Keys (SK) which helps DynamoDB scale behind the scenes, but also opens up some query patterns that let you do a lot more than simple key value lookup.

Because of the benefits of Single Table Design, we overload the meaning of PKs and SKs to accommodate multiple domain objects, and use {TYPE}#{ID} conventions within the key rather than designate a fixed type to each key. An SK can consist of both ORG#ORGNAME and USER#USERNAME in the same item collection. This lets you query both in a single query.

If this makes you queasy, I have't fully made peace with it too. It's basically the developer contorting themselves to fit the abstraction leak of DynamoDB. I rationalize it by basically regarding DynamoDB as a low level tool - it is closer to a linear memory address register than a DB.

Think about it - DynamoDB promises single digit millisecond latency, but in exchange you have to be hyperaware which address you are slotting your data in and manage it carefully. That's just like dealing with low level memory!

With a low level tool, you understand that you need to manage more of its details, but in exchange it gives you more performance than anything else possible. This is backed up by Alex quoting Forrest Brazeal:

A well-optimized single-table DynamoDB layout looks more like machine code than a simple spreadsheet.

You could build a DB layer atop DynamoDB that abstracts away these PK/SK shenanigans, but you do run the risk of making very inefficient queries because you have no control/knowledge over the data modeling. Maybe you care, maybe you don't (maybe you're no worse off from the inefficient queries made in SQL). There's a whole section called "Don't use an ORM" in the book, though Jeremy Daly's DynamoDB Toolbox and AWS' Document Client are OK. But it is clear that for stable data access patterns (eg you intend to run Amazon.com until the heat death of the universe), taking over low level PK/SK modeling details for DynamoDB will yield best possible results.

2. Why Global Secondary Indexes

There are two types of Secondary Indexes in DynamoDB - Local and Global (aka LSI and GSI). LSIs can use the same PKs as the main table, but a different SK, whereas GSIs can use any attributes for both PK and SK.

LSI GSI
Pros Option for strongly-consistent reads PK flexibility, Creation time flexibility.
Cons Must use same PK as table. Must be created when table is created. Eventual consistency. Needs additional throughput provisioning

Given the importance of flexibility over strong consistency, it's clear why GSIs are so much more popular than LSIs. I don't have any numbers but vaguely also recall seeing on the Twitterverse that GSI replication delays are very rarely a problem.

I wonder if AWS publishes p99 GSI replication numbers.

3. KSUIDs for Unique, Sortable IDs

K-Sortable Unique Identifiers (KSUIDs) are a modification of the UUID standard by the fine folks at Segment that encodes a timestamp while also retaining chronological ordering when sorting as a string.

  • Here's a UUIDv4: 96fb6bdc-7507-4879-997f-8978e0ba0e68
  • Here's a KSUID: 1YnlHOfSSk3DhX4BR6lMAceAo1V

The benefit of using a KSUID compared to a UUID is that KSUIDs are lexicographically sortable. KSUIDs embed a timestamp, which you can decode and sort if you have a ksuid implementation handy - but also if you simply sort by the generated ID's they will sort themselves out chronologically (without any knowledge of how to decode KSUIDs!).

This feature makes KSUIDs ideal as unique identifiers for DynamoDB keys, where you can use a condition expression like #id BETWEEN :start and :end where :start and :end represent starting and ending ID's of a range you want to query.

I don't know how widely KSUIDs are known given this idea was only released in 2017, but I think this is useful even beyond DynamoDB.

4. Sparse Indexes for Filtering

A sparse (secondary) index intentionally excludes certain items from your table to help satisfy a query (aka not merely as a result of PK/SK overloading).

When you write an item, DynamoDB only copies it to the secondary index if the item has elements of the specified key schema for that index. This is useful in two ways:

  • Using sparse indexes to provide a global filter on an item type
    • Example: You have a list of Organizations (PK), each Organization has Members (SK), a few Members are Admins (role represented attributes). You want to query for Admins, without pulling ALL members of every organization. Setting up a sparse index that only includes Admins then lets you quickly and efficiently query them.
  • Using sparse indexes to project a single
    type of entity
    • Example: You have customers, orders, and inventory linearly laid out in PK/SKs in a single table. You want to query for all customers only. Setting up a sparse index that only includes customers helps you do that.

I think you can regard this as a "filter" or "projection", depending on your functional/linear/relational algebra preferences :)

5. Normalization is OK

You'd think that Normalization is anathema in NoSQL, but it is actually recommended as one of the many-to-many strategies in the book! The given example is Twitter - a user follows other users, and other users follow yet more users. A social network. Every tweet or change in profile would cause a huge fan-out, aka "write thrashing".

The recommended solution is storing User as PK, and then the SK has both User (again) and FOLLOWING#<Username>. When one user wants to view users they follow, we:

  • use the Query API to fetch the User's info and initial few users they follow
  • use the BatchGetItem API to fetch detailed User info for each user followed.

This is making multiple requests, but there is no way around it. This pattern is also applicable in ecommerce shopping carts.

Conclusion

I mean. The premium package costs $249 ($199 for launch) for now. If you were to hire Alex to even do a consultation phone call or workshop for you you'd need at least 10x that. If you use or will use DynamoDB in any serious way, you would save a ton of time and pain and money by going through this book.

Disclaimers

I reviewed a free prepublication draft of the book, which was provided unconditionally to me as I happened to be responding to Alex's email newsletter for the book. I receive no other compensation from doing this.

Top comments (2)

Collapse
 
theburningmonk profile image
Yan Cui

For no. 5, DynamoDB is not the right solution for storing relationships, it's just far too cost-inefficient as you scale up, especially as you usually fetch ALL of someone's followers every time (whenever they post, retweet, etc.). For social networks, 90% of your users won't have many followers, but there's always a few that has 1000s, 10s of 1000s or even millions of followers. That is as true for Twitter as for early stage social networks. At the social network I worked at, we only had about 1m users and at that point, we already had users with over 50,000 followers.

If you have to use DynamoDB, you're better off putting IDs of users they follow into a list to maximize the utility of those read units - if you use KSUID as IDs then 4kb read unit can return ~150 followers with a single get request. And when you approach the 400kb item size limit, split out the list into multiple items and use the SK to store some sort of hashing range. But this only gets you so far, at some point, you just have to use a different data store for those power users with huge number of followers to improve cost efficiency of those read patterns.

Collapse
 
swyx profile image
swyx • Edited

right, thank you! I recall in Martin Kleppmann's Designing Data Intensive Applications he discussed a hybrid approach for Twitter, where it is "pull on demand" for power users, and "fan-out" for normal users. So it's a different access pattern.

I don't know if that means abandoning DDB/NoSQL entirely - since if you tried to do this entirely in SQL, you also have a different set of issues! After all, wasn't Twitter partly responsible for the rise of NoSQL in the first place?