Franck Pachot for YugabyteDB

Posted on Jun 27, 2022 • Edited on May 23, 2023

Local reads from 🚀YugabyteDB Read Replicas

#postgres #yugabytedb #distributed #sql

This is an alternative to the previous post. Most users are in eu, where we want fast writes and consistent reads. We allow local reads from other regions, accepting a bounded staleness. The previous post, with Replication Factor RF=5 had to wait for a remote region to acknowlege writes. Here, I'll setup a RF=3 cluster and still be able to read from 5 regions.

Setup the YugabyteDB universe

I'll create a 3 nodes RF=3 cluster, to avoid cross-region wait for quorum, and extend it to a 5 nodes universe where the additional nodes holds replicas but do not participate to the quorum.

I create a network for my lab and start two nodes in Europe (region eu, zones 1 and 2):

docker network create -d bridge yb

docker run -d --network yb --name yb-eu-1 -p5001:5433 -p7001:7000 -p9001:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-eu-1.yb \
 --master_flags="placement_zone=1,placement_region=eu,placement_cloud=cloud" \
--tserver_flags="placement_zone=1,placement_region=eu,placement_cloud=cloud"

docker run -d --network yb --name yb-eu-2 -p5002:5433 -p7002:7000 -p9002:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-eu-2.yb --join yb-eu-1 \
 --master_flags="placement_zone=2,placement_region=eu,placement_cloud=cloud" \
--tserver_flags="placement_zone=2,placement_region=eu,placement_cloud=cloud"

This is where I'll have most of my users and I've created two nodes with the goal to get all leaders there, and one of the followers in the same region (other zone) so that reads and writes are fast from Europe.

For this Replication Factor RF=3, I need another node. I choose the region with the lower latency. In case of one node failure in Europe, the database will still be available, but with this additional latency on writes to get the quorum. I choose us:

docker run -d --network yb --name yb-us-1 -p5003:5433 -p7003:7000 -p9003:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-us-1.yb --join yb-eu-1 \
 --master_flags="placement_zone=1,placement_region=us,placement_cloud=cloud" \
--tserver_flags="placement_zone=1,placement_region=us,placement_cloud=cloud"

I add two more nodes for read replicas, naming it RO for "Read Only", in ap and au:

docker run -d --network yb --name yb-ap-1 -p5004:5433 -p7004:7000 -p9004:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-ap-1.yb --join yb-eu-1 \
 --master_flags="placement_uuid=RO,placement_zone=1,placement_region=ap,placement_cloud=cloud" \
--tserver_flags="placement_uuid=RO,placement_zone=1,placement_region=ap,placement_cloud=cloud"

docker run -d --network yb --name yb-au-1 -p5005:5433 -p7005:7000 -p9005:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-au-1.yb --join yb-eu-1 \
 --master_flags="placement_uuid=RO,placement_zone=1,placement_region=au,placement_cloud=cloud" \
--tserver_flags="placement_uuid=RO,placement_zone=1,placement_region=au,placement_cloud=cloud"

For the moment, this is very similar to the previous post except that I named ap and as nodes with a different cluster name, with placement_uuid=RO to extend the 3 nodes cluster to a 5 nodes universe. Those two additional nodes are in the REad Replicas set and I'll use the RO name to define the placement.

Preferred tablet leaders

I define the placement with 3 replicas and a minimum of 1 in each eu zone:

docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
modify_placement_info cloud.eu.1:1,cloud.eu.2:1,cloud.us.1:1 3

The format is a comma-separated list of cloud.region.zone:min_replicas followed by the total replication factor.

I set the Europe zones as the preferred ones for leaders

docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
set_preferred_zones cloud.eu.1 cloud.eu.2

The format is a space separated list of cloud.region.zone

Masters

At this point, the difference with the previous post is that the Replication Factor is 3 here, instead of 5, and two nodes do not participate to the write quorum.

See the previous post about the master placement. This post focuses on tablet placement.

Read Replicas

I define the placement in the RO Read Replica extension: RF=2 with one replica on each region so that I have a full copy of the database in ap and eu:

docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
add_read_replica_placement_info cloud.ap.1:1,cloud.au.1:1 2 RO

The format is a comma-separated list of cloud.region.zone:min_replicas followed by the total replication factor. As they don't participate to the quorum, there's no constraint on the replication factor, except that the sum of the minimums must be lower or equal to the total.

The placement information can be read from http://localhost:7001/cluster-config:

The primary nodes, participating to reads and writes, with leaders and followers, are referred as live_replicas here.

Here is the compact form:

curl -s http://localhost:7001/api/v1/cluster-config
{"version":6,"replication_info":{"live_replicas":{"num_replicas":3,"placement_blocks":[{"cloud_info":{"placement_cloud":"cloud","placement_region":"us","placement_zone":"1"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placement_region":"eu","placement_zone":"2"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placement_region":"eu","placement_zone":"1"},"min_num_replicas":1}]},"read_replicas":[{"num_replicas":2,"placement_blocks":[{"cloud_info":{"placement_cloud":"cloud","placement_region":"ap","placement_zone":"1"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placement_region":"au","placement_zone":"1"},"min_num_replicas":1}],"placement_uuid":"RO"}],"affinitized_leaders":[{"placement_cloud":"cloud","placement_region":"eu","placement_zone":"1"},{"placement_cloud":"cloud","placement_region":"eu","placement_zone":"2"}]},"cluster_uuid":"0ab16ad8-ccba-4469-81a9-83531981ce91"}

I can also check the topology from the SQL yb_servers() - this is how the cluster-aware drivers know how to balance the connections:

psql -p 5005 -c "
> select host,port,node_type,cloud,region,zone
> from yb_servers()
> order by node_type,cloud,region,zone
> "
  host   | port |  node_type   | cloud | region | zone
---------+------+--------------+-------+--------+------
 yb-eu-1 | 5433 | primary      | cloud | eu     | 1
 yb-eu-2 | 5433 | primary      | cloud | eu     | 2
 yb-us-1 | 5433 | primary      | cloud | us     | 1
 yb-ap-1 | 5433 | read_replica | cloud | ap     | 1
 yb-au-1 | 5433 | read_replica | cloud | au     | 1
(5 rows)

You see the node type primary for each node participating to the quorum. When both regions in eu are available, the quorum (of 2 followers for an RF=3 cluster) will be local. If one eu zone is down, the database will still be available, with all leaders in the remaining eu zone, but waiting for us for the quorum.

Writes

I connect to the last node and run some DDL and DML. Even if they are read replicas, DDL and DML are redirected to the read-write nodes:

psql -p 5005 -e <<SQL
drop table if exists demo;
create table demo as select generate_series(1,1000) n;
update demo set n=n+1;
\watch 0.01
SQL

The http://localhost:7001/tablet-servers shows that Read ops/sec and Write ops/sec, from my update query, are on eu only:

The difference from the previous post (which is not visible in this lab) is that the writes are faster, without waiting for a remote region.

The http://localhost:7001/tables endpoint shows leaders on eu only, followers in eu (the other zone) and us, with additional read replicas in ap and au:

Local reads

Now, when connected to au with follower read accepted (read only transaction with yb_read_from_followers enabled) I can see that all reads are from this region:

psql -p 5005 -e <<SQL
set yb_read_from_followers=on;
set default_transaction_read_only = on;
explain analyze select * from demo;
\watch 0.01
SQL

All reads are local, on the region I'm connected to (port 5005 is the exposed 5433 for yb-au-1

This is a cluster where the Replication Factor (RF=3 here as set by modify_placement_info) can be defined independently of the number of additional replicas that are added for local reads. The preferred leader is set on the main region (set by set_preferred_zones) for this cluster. Reads from eu are consistent and fast. Reads from other regions are fast if they accept a staleness of yb_follower_read_staleness_ms (default 30 seconds). High availability allows one node down. The writes are also fast when both zones in eu are up, because the quorum is local. The system of one cluster plus read replicas is called a YugabyteDB universe. It allows the same "follower reads" but with more control on which node participate to the quorum or not.

Staleness

Let's explain a bit more about this staleness. This is not Eventual Consistency. All is consistent there, but as-of a previous point in time, 30 seconds before. This is probably ok for reports. You can reduce this staleness, but it is limited by the defined maximum clock skew:

set yb_read_from_followers=on;
SET
set default_transaction_read_only = on;
SET
set yb_follower_read_staleness_ms=10;
ERROR:  cannot enable yb_read_from_followers with a staleness of less than 2 * (max_clock_skew = 500000 usec)

To guarantee consistency as-of a point in time, we must be sure that the Hybrid Logical Clock can order the changes received though the Raft protocol. So, the minimum with a 500 millisecond clock skew (which is a correct setting for geo-distribution) is a 1 second staleness:

set yb_read_from_followers=on;
SET
set default_transaction_read_only = on;
SET
set yb_follower_read_staleness_ms=1001;
SET
explain analyze select * from demo;
                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..100.00 rows=1000 width=4) (actual time=1.263..3.510 rows=1000 loops=1)
 Planning Time: 0.595 ms
 Execution Time: 3.767 ms
(3 rows)

One second is probably not a problem. In all cases, you will never get a result that is from before the accepted staleness, so allowing a larger time increases the availability of follower reads.

This configuration, compared with the one in the previous post, has the advantage of not impacting the write latency on the primary. But the read replicas cannot be used for Disaster Recovery. You have the choice. In a next post, we will see another choice, using tablespaces to set the placement for specific tables, or table partitions, whatever the cluster configuration sets by default.

DEV Community