Single table design is a data modelling approach for NoSQL where you group different data items by partition key on the same table. When you use this approach with DynamoDB, Global Secondary Indexes (GSIs) are an important component to ensure data can be accessed efficiently.
I have put together a simple example to illustrate how and why you should use a GSI with a single table design. I will use a hands on example to show how they can improve latency and reduce cost.
My sample application will track the players of different sport teams and a count of players in each team. These are 2 access patterns that I will support:
1) List all sports teams and the count of players per team
2) List all players of a team plus a count of players in that team
In a traditional DBMS, you would store these items in two separate tables and run two operations to retrieve the data.
With a single table design, you can use the same table to store these items and retrieve both of them in one operation. To get the most from this article, it would be good to understand the 3 options DynamoDB supports for retrieving data. These options are:
getItem - returns a set of attributes for a given primary key. If no item is found, nothing is returned. Only returns one item at a time.
Query - returns all items for a given partition key. You can optionally filter down items by providing a sort key and a filterexpresion.
Scan - returns one or more items and their attributes by accessing every item in a table or secondary index.
These operations can be executed against both a table and any indexes associated with a table. As a rule of thumb, you want to avoid scan operations as they are very costly. Each item scanned counts towards your read capacity unit (RCU) quota. The more targeted you can make the operation, the less items will be read and the more efficient it will be. Therefore, getItem will always be the most efficient, followed by Query.
Without a GSI, this is how the data could look within our DynamoDB table. As you can see, there are 2 types of items in here. One item per player per team and then an aggregate item per team that stores the count of players.
Each team is grouped by the partition key, pk1, and all records for a team will be stored together allowing them to be retrieved quickly in a single query operation. This addresses our second access pattern.
However, to address our first access pattern without a scan, we would need to swap the values in the partition key and sort key fields.
But in this scenario you would need a scan operation to address the second access pattern.
To avoid using a scan for either operation, a GSI can be implemented. A GSI effectively creates a sidecar table with the data rewritten in a way to best support your secondary access pattern. GSIs don't need to be maintained separately and are populated by the same action that populates the table. Items are inserted, updated and deleted asynchronously to the table and GSIs have their own RCUs and WCUs specified separately. This means that operations executed against a GSI are isolated from the table operations.
For our example, creating a GSI with the sk1 and pk1 fields from scenario 1 reversed will group all of the count records (where sk1=0 from the table) on the same partition key and ensure the items can be accessed efficiently.
I can optimize this further by creating separate fields for the GSI partition key and sort key values on the table. I can then create a GSI based on these fields.
Now the GSI only contains items where the gsi_pk1 and gsi_sk1 fields are populated. This reduces the size of index to hold only the count records.
This gives us 4 possible physical table options to address both access patterns
|Table 1||Table 2||Table 3||Table 4|
|No Index||No index||GSI||Filtered GSI|
|Access Pattern 1||Scan||Query||Query||Query|
|Access Pattern 2||Query||Scan||Query||Query|
To further illustrate the usefulness of GSIs, I created the four tables and performance tested them to show the results. I loaded 999 items for 5 teams giving 5000 items in total when aggregation items were inserted. I monitored the following two metrics in Cloudwatch during the tests.
SuccessfulRequestLatency - this is the response time in milliseconds of successful requests
ConsumedReadCapacityUnits - number of read capacity units consumed
I used an average of these metrics for each run in the graphs below.
Here we can see how inefficient a Scan operation can be. To retrieve the count of players per team, a scan operation had to be run against all 5000 items in the table and then filtered down to just 5 records.
Taking out the Scan query for table 1, we can see that the table 4 design with the filtered index is the most efficient for latency. RCUs consumed are consistent across tables as the 5 items are retrieved with a query.
Here again we see the inefficiency of the Scan operation but this time against table 2. We are retrieving a larger dataset, 1000 items per team, 999 players plus 1 aggregation item.
Taking out the Scan on table 2, we see that the RCUs consumed are consistent due to the same query operation being executed. Latency is slightly lower for table 4 but nothing substantial.
I hope this helps you better understand how a GSI can be implemented to support a successful single table design. All the code I used for my tests is available in this repo. It's not production ready, just good enough for my tests but there might be something useful in there for you.