Franck Pachot for AWS Heroes

Posted on Dec 16, 2024

Multi-Region Distributed SQL Transaction Latency

#aws #aurora #dsql #database

All distributed systems have trade-offs. Aurora DSQL uses Optimistic Concurrency Control and, combined with Multi-Version Concurrency Control, it doesn't synchronize the transaction intents before the final commit. This has some drawbacks, like the need to retry the transaction in case of conflict or failure. However, it has a significant advantage: it minimizes the cross-region synchronization.

To expose these trade-offs, I'll compare the two Distributed SQL databases based on PostgreSQL, deployed in the same regions: North Virginia (us-east-1), Ohio (us-east-2), and Oregon (us-west-2). However, the deployments are different:

Aurora DSQL defines two regions as 'linked' which contain data replicas, while a third one is a witness only where you cannot connect.
YugabyteDB has multiple multi-region deployment options, but I've chosen the most common for those regions: preferred leaders in one region (us-east-1) and followers in the others.

I've used this configuration in the previous blog posts of this series. The reason for a preferred region is that multi-region deployments have two main reasons:

Guaranteed high availability and resilience against regional failures. Connecting to a primary region with low read and write latencies makes sense in this context.
Geo-partitioning allows data to be localized to a region, ensuring tolerance for zone failures but no replication to another region due to data sovereignty regulations.

Aurora DSQL doesn't differentiate the two linked regions regarding data placement, and they expose the same performance.

Understanding the difference is crucial when connecting to another region, such as us-east-2, which has a round-trip time of 10 milliseconds with us-east-1. In summary, while both regions are equivalent in Aurora DSQL, they differ in YugabyteDB when leader preference is assigned to one region.

I have created a 1000 rows table in Aurora DSQL and YugabyteDB with the following SQL script:

drop table if exists demo;
create table demo (
 id    int primary key,
 value int
);
\d demo
with start as ( select count(*) from demo)
 insert into demo 
 select count+n, count+n 
 from start, generate_series(1,1000) n
\watch c=10

I'll run PostgreSQL PgBench from a host in us-east-1 but connect to us-east-1 (client-server roundtrip is 1 millisecond) and us-east-2 (client-server roundtrip is 10 milliseconds).

Here is the PgBench script that updates one row randomly:

-- random_update.sql
\set id random(1, 1000)
\set new_value random(1, 1000000)
update demo set value=:new_value where id=:id;

Here is the PgBench command to run it from ten connections:

 pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

I'll run the same test on Aurora DSQL and YugabyteDB, connected to the local and remote regions.

Single-Shard Transaction

Let's start with the table defined above, which has only a primary key, no secondary indexes, no foreign keys, and no triggers. Its single-row DML is a single-shard transaction.

Aurora DSQL

Connected to us-east-1:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 16.5)
progress: 60.0 s, 419.2 tps, lat 23.079 ms stddev 4.445, 321 failed
progress: 120.0 s, 429.2 tps, lat 23.228 ms stddev 3.407, 292 failed
progress: 180.0 s, 433.2 tps, lat 23.006 ms stddev 3.238, 326 failed
progress: 240.0 s, 435.5 tps, lat 22.885 ms stddev 3.117, 343 failed
progress: 300.0 s, 435.2 tps, lat 22.904 ms stddev 3.288, 305 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 129145
number of failed transactions: 1587 (1.214%)
latency average = 23.019 ms
latency stddev = 3.527 ms
initial connection time = 1760.508 ms
tps = 432.997686 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        23.017        1587  update demo set value=:new_value where id=:id;

Connected to us-east-2:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 16.5)
progress: 60.0 s, 252.6 tps, lat 37.312 ms stddev 7.074, 258 failed
progress: 120.0 s, 264.8 tps, lat 37.374 ms stddev 3.093, 238 failed
progress: 180.0 s, 265.9 tps, lat 37.193 ms stddev 2.806, 271 failed
progress: 240.0 s, 265.6 tps, lat 37.196 ms stddev 2.821, 285 failed
progress: 300.0 s, 265.6 tps, lat 37.209 ms stddev 2.830, 274 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 78882
number of failed transactions: 1326 (1.653%)
latency average = 37.256 ms
latency stddev = 4.045 ms
initial connection time = 2761.693 ms
tps = 265.359983 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        37.254        1326  update demo set value=:new_value where id=:id;

PgBench measures the elapsed time from the client, and I execute the statement from the same region (us-east-1). When connected to a remote region, the response time increases by 14 milliseconds, although the statement's execution remains consistent across both regions.

YugabyteDB

Connected to us-east-1:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 11.2-YB-2024.1.3.1-b0)
progress: 60.0 s, 487.3 tps, lat 20.346 ms stddev 3.066, 0 failed
progress: 120.0 s, 489.3 tps, lat 20.382 ms stddev 2.990, 0 failed
progress: 180.0 s, 490.7 tps, lat 20.327 ms stddev 2.664, 0 failed
progress: 240.0 s, 492.2 tps, lat 20.254 ms stddev 2.658, 0 failed
progress: 300.0 s, 492.0 tps, lat 20.268 ms stddev 2.806, 0 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 147104
number of failed transactions: 0 (0.000%)
latency average = 20.315 ms
latency stddev = 2.841 ms
initial connection time = 331.403 ms
tps = 490.863364 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        20.313           0  update demo set value=:new_value where id=:id;

Connected to us-east-2:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 11.2-YB-2024.1.3.1-b0)
progress: 60.0 s, 276.3 tps, lat 35.464 ms stddev 15.942, 0 failed
progress: 120.0 s, 285.0 tps, lat 35.044 ms stddev 4.955, 0 failed
progress: 180.0 s, 283.6 tps, lat 35.222 ms stddev 4.916, 0 failed
progress: 240.0 s, 282.8 tps, lat 35.328 ms stddev 4.925, 0 failed
progress: 300.0 s, 281.1 tps, lat 35.536 ms stddev 4.937, 0 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 84535
number of failed transactions: 0 (0.000%)
latency average = 35.318 ms
latency stddev = 8.332 ms
initial connection time = 1128.837 ms
tps = 282.808689 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        35.316           0  update demo set value=:new_value where id=:id;

When comparing performance, it is essential to understand what is run and how it is run. In this case, YugabyteDB looks faster, with PostgreSQL concurrency control behavior, even when the Raft leaders are remote. But I've created the table without any secondary index so that the transaction is single-shard. In this case, there's only one multi-region synchronization as the write and the commit go to the same node, and the write is buffered and flushed only at the commit time.

Multi-Shard Transaction

I'll create a secondary index and, as mentioned in the previous post of this series, the index entries may be in a different node than the table rows. This multi-shard transaction requires a transaction table to coordinate the commit in one atomic place, and higher latency can be expected for the transaction, especially in YugabyteDB.

Let's create the secondary index in YugabyteDB:

yugabyte=> create index on demo(value);
CREATE INDEX

By default, YugabyteDB creates indexes CONCURRENTLY, with backfill, and waits for completion.

In the preview of Aurora DSQL you must mention ASYNC to backfill it and monitor the job to know the completion:

dsql=> create index async on demo(value);
           job_id
----------------------------
 qk45qq2lyve6jexjokbi34ek6q
(1 row)

dsql=> select * from sys.jobs;
           job_id           |  status   | details
----------------------------+-----------+---------
 qk45qq2lyve6jexjokbi34ek6q | completed |
(1 row)

Aurora DSQL

In Aurora DSQL, the response time for multi-shard transactions remains largely unaffected since coordination occurs solely at commit. Both reading and writing operations happen within the same region.

Connected to us-east-1:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 16.5)
progress: 60.0 s, 432.9 tps, lat 22.379 ms stddev 3.946, 375 failed
progress: 120.0 s, 449.9 tps, lat 22.165 ms stddev 1.202, 346 failed
progress: 180.0 s, 451.1 tps, lat 22.100 ms stddev 1.224, 370 failed
progress: 240.0 s, 453.1 tps, lat 22.006 ms stddev 1.693, 379 failed
progress: 300.0 s, 454.0 tps, lat 21.964 ms stddev 1.267, 368 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 134475
number of failed transactions: 1838 (1.348%)
latency average = 22.120 ms
latency stddev = 2.127 ms
initial connection time = 1683.871 ms
tps = 450.751089 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        22.118        1838  update demo set value=:new_value where id=:id;

Connected to us-east-2:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 16.5)
progress: 60.0 s, 251.5 tps, lat 37.580 ms stddev 6.357, 233 failed
progress: 120.0 s, 261.5 tps, lat 37.841 ms stddev 1.207, 262 failed
progress: 180.0 s, 261.5 tps, lat 37.833 ms stddev 1.144, 259 failed
progress: 240.0 s, 261.5 tps, lat 37.808 ms stddev 1.064, 272 failed
progress: 300.0 s, 261.3 tps, lat 37.846 ms stddev 1.221, 271 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 77853
number of failed transactions: 1297 (1.639%)
latency average = 37.783 ms
latency stddev = 2.988 ms
initial connection time = 2665.704 ms
tps = 261.805967 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        37.7

YugabyteDB

YugabyteDB needs to synchronize more frequently to ensure transaction resiliency (the transaction continues rather than being rolled back in the event of a failure) and PostgreSQL-compatible concurrency control (where reads observe both the committed changes and the uncommitted changes of the ongoing transactions). This leads to increased latency for distant connections from the leader-preference region.

Connected to us-east-1:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 11.2-YB-2024.1.3.1-b0)
progress: 60.0 s, 279.9 tps, lat 35.518 ms stddev 6.884, 0 failed
progress: 120.0 s, 281.2 tps, lat 35.556 ms stddev 6.437, 0 failed
progress: 180.0 s, 282.0 tps, lat 35.448 ms stddev 6.576, 0 failed
progress: 240.0 s, 282.3 tps, lat 35.420 ms stddev 6.408, 0 failed
progress: 300.0 s, 283.0 tps, lat 35.329 ms stddev 6.398, 0 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 84512
number of failed transactions: 0 (0.000%)
latency average = 35.454 ms
latency stddev = 6.545 ms
initial connection time = 320.726 ms
tps = 281.942544 (without initial connection time)
statement latencies in milliseconds and failures:
         0.001           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        35.452           0  update demo set value=:new_value where id=:id;

Connected to us-east-2:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 11.2-YB-2024.1.3.1-b0)
progress: 60.0 s, 153.4 tps, lat 63.912 ms stddev 24.717, 0 failed
progress: 120.0 s, 158.8 tps, lat 62.984 ms stddev 8.729, 0 failed
progress: 180.0 s, 159.0 tps, lat 62.886 ms stddev 8.546, 0 failed
progress: 240.0 s, 159.0 tps, lat 62.877 ms stddev 8.742, 0 failed
progress: 300.0 s, 158.8 tps, lat 62.973 ms stddev 8.770, 0 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 47350
number of failed transactions: 0 (0.000%)
latency average = 63.121 ms
latency stddev = 13.411 ms
initial connection time = 1129.718 ms
tps = 158.405517 (without initial connection time)
statement latencies in milliseconds and failures:
         0.002           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
        63.119           0  update demo set value=:new_value where id=:id;

There is extra latency due to the transaction table update and commit and the flush of the writes before it.

Note that EXPLAIN ANALYZE shows some details:

yugabyte=> explain (analyze, dist, costs off) 
           update demo set value=2 where id=42
;
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Update on demo (actual time=10.431..10.431 rows=0 loops=1)
   ->  Index Scan using demo_pkey on demo (actual time=10.348..10.350 rows=1 loops=1)
         Index Cond: (id = 42)
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 10.232 ms
         Storage Table Rows Scanned: 1
         Storage Table Write Requests: 1
         Storage Index Write Requests: 2
 Planning Time: 0.076 ms
 Execution Time: 44.949 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 10.232 ms
 Storage Rows Scanned: 1
 Storage Write Requests: 3
 Catalog Read Requests: 0
 Catalog Write Requests: 0
 Storage Flush Requests: 1
 Storage Flush Execution Time: 34.440 ms
 Storage Execution Time: 44.672 ms
 Peak Memory Usage: 24 kB
(20 rows)

Time: 65.691 ms

Without Optimistic Concurrency Control, the update statement must be read first to check for the non-existence of the key or potential intents from concurrent transactions. This one, Table Read Requests, incurs latency to the leader region. The three writes are buffered (updating the table row, deleting the old index entry, and inserting the new one) in one Flush Request. Because it is a multi-shard transaction, these go to the IntentsDB as provisional records. Finally, a commit (not accounted for in EXPLAIN ANALYZE) updates the transaction status of the commit to make it visible to others.

With YugabyteDB and Replication Factor 3, connecting to the third region, us-west-2, is possible, but the latency is higher. Here is the same workload:

pgbench -c 10 -P 60 -T 300 -r -nf random_update.sql

pgbench (16.4, server 11.2-YB-2024.1.3.1-b0)
progress: 60.0 s, 30.5 tps, lat 294.202 ms stddev 253.955, 0 failed
progress: 120.0 s, 36.4 tps, lat 274.675 ms stddev 24.144, 0 failed
progress: 180.0 s, 36.4 tps, lat 274.973 ms stddev 23.572, 0 failed
progress: 240.0 s, 36.4 tps, lat 274.678 ms stddev 24.214, 0 failed
progress: 300.0 s, 36.5 tps, lat 274.306 ms stddev 23.158, 0 failed
transaction type: random_update.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 300 s
number of transactions actually processed: 10581
number of failed transactions: 0 (0.000%)
latency average = 278.035 ms
latency stddev = 108.140 ms
initial connection time = 5915.564 ms
tps = 35.948550 (without initial connection time)
statement latencies in milliseconds and failures:
         0.003           0  \set id random(1, 1000)
         0.001           0  \set new_value random(1, 1000000)
       278.031           0  update demo set value=:new_value where id=:id;

In practice, tablespaces and partitions determine the placement of leaders, ensuring data is located near its users. Additional methods, such as duplicate covering indexes (ideal for infrequently updated tables like lookup tables) and follower reads (appropriate for read-only transactions), enhance read speeds from distant regions.

Another essential output from PgBench is the initial connection time, which may affect you depending on the connection pool usage and ongoing DDL. Establishing an SQL session necessitates reading metadata from the cluster, which can involve cross-region reads, like the SQL catalog. Some metadata can be cached but must be re-read when DDL updates the schema. This impacts connection time, depending on the region you connect to.

Conclusion

Design choices involve trade-offs between availability, performance, and PostgreSQL compatibility. Results from Aurora DSQL and YugabyteDB show how architectural decisions like concurrency controls and replication strategies impact these trade-offs.
Aurora DSQL reduces cross-region synchronization, improving latency and providing a response time that is less subject to the complexity of the transaction. However, it necessitates conflict retry handling.
YugabyteDB offers PostgreSQL-compatible concurrency control and transactional resilience but may have higher latencies due to extra synchronization.

There is no "best" database for all scenarios. Each shows better numbers in specific situations based on workload, latency, fault tolerance needs, and schema design. Recognizing these trade-offs enables appropriate database use. Performance tests are tools for understanding, not absolute measures dictating which system is better.

Here is a summary of the PgBench output from these tests:

Initial Connection Time

Connected from us-east-1	Aurora DSQL	YugabyteDB
connected to us-east-1	1683 ms	320 ms
connected to us-east-2	2665 ms	1129 ms
connected to us-west-2	(no endpoint)	5915 ms

Primary Key Only (No Secondary Indexes)

Connected from us-east-1	Aurora DSQL	YugabyteDB
connected to us-east-1	23 ms	20 ms
connected to us-east-2	37 ms	35 ms

One Secondary Index (Multi-Shard Transaction)

Connected from us-east-1	Aurora DSQL	YugabyteDB
connected to us-east-1	22 ms	35 ms
connected to us-east-2	37 ms	63 ms
connected to us-west-2	(no endpoint)	278 ms

This addresses multi-region deployments. The latency between Availability Zones in an AWS region is similar to cross-node latency within a zone, resulting in minor differences. I might explore single-region deployments in a future post of this series.

DEV Community

Multi-Region Distributed SQL Transaction Latency

Single-Shard Transaction

Aurora DSQL

YugabyteDB

Multi-Shard Transaction

Aurora DSQL

YugabyteDB

Conclusion

Initial Connection Time

Primary Key Only (No Secondary Indexes)

One Secondary Index (Multi-Shard Transaction)

Top comments (0)

Read next

Exploring new AWS Aurora DSQL. What is it ? Why it is important ? How to quickstart ?

Handling Paginated Results Seamlessly with AWS Step Functions

How to Create a Lambda Function to Export IAM Users to S3 as a CSV File

Set budget alerts on GCP & AWS