Franck Pachot for YugabyteDB

Posted on Jun 27, 2022 • Edited on May 23, 2023

Local reads from 🚀YugabyteDB raft followers

#yugabytedb #distributed #sql #postgres

In a YugabyteDB cluster, the connections, SQL processing, and data storage are distributed over the cluster. On a single region, with low latency between Availability Zones, all is distributed for scalability and high availability. Queries, and transactions, are reading and writing the tables, indexes, partitions, and transaction status from all nodes. Your users, from this region, will not suffer from latency.

However, even when the main application connects from one region, you may have users worldwide. They accept the latency for their writes, because the speed of light, and clock skew, are physical constraints. They accept the same for reads when they need to get the current state, seeing the latest writes. But, for reporting, having a local copy with a state that has a few seconds staleness is acceptable. On monolithic database, this is achieved with a standby database refreshed in async mode. YugabyteDB is distributed, and this requirement can be achieved with a simple placement preference.

Let's reproduce, in a lab, the following where most of the users are in Europe. I will ensure that the Raft leaders are there, for fast consistent reads and writes. With two regions in Europe, the quorum could be local and I'll show this in a next post. Here, I'll run with Replication Factor RF=5 to get one follower in each region, for local read. Then, the quorum will involve a roundtrip to the nearest region.

My lab is on Docker. Of course, I'll not look at the latency as all runs on my laptop, but it is easy to see where read and write operations occur in the web console. The regions are tagged with placement info flags: eu with zone 1 and 2, and us, ap, au with zone 1 only.

Setup the YugabyteDB cluster

I create a network for my lab and start two nodes in Europe (region eu, zones 1 and 2):

docker network create -d bridge yb

docker run -d --network yb --name yb-eu-1 -p5001:5433 -p7001:7000 -p9001:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-eu-1.yb \
 --master_flags="placement_zone=1,placement_region=eu,placement_cloud=cloud" \
--tserver_flags="placement_zone=1,placement_region=eu,placement_cloud=cloud"

docker run -d --network yb --name yb-eu-2 -p5002:5433 -p7002:7000 -p9002:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-eu-2.yb --join yb-eu-1 \
 --master_flags="placement_zone=2,placement_region=eu,placement_cloud=cloud" \
--tserver_flags="placement_zone=2,placement_region=eu,placement_cloud=cloud"

This is where I'll have most of my users and I've created two nodes, with the goal to get all leaders here in normal conditions, and one of the followers there so that reads are fast from Europe, and writes involve only one call to another region.

I add more nodes in multiple countries (us, ap and au), for a total of 5 nodes:

docker run -d --network yb --name yb-us-1 -p5003:5433 -p7003:7000 -p9003:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-us-1.yb --join yb-eu-1 \
 --master_flags="placement_zone=1,placement_region=us,placement_cloud=cloud" \
--tserver_flags="placement_zone=1,placement_region=us,placement_cloud=cloud"

docker run -d --network yb --name yb-ap-1 -p5004:5433 -p7004:7000 -p9004:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-ap-1.yb --join yb-eu-1 \
 --master_flags="placement_zone=1,placement_region=ap,placement_cloud=cloud" \
--tserver_flags="placement_zone=1,placement_region=ap,placement_cloud=cloud"

docker run -d --network yb --name yb-au-1 -p5005:5433 -p7005:7000 -p9005:9000 \
yugabytedb/yugabyte:latest \
yugabyted start --daemon=false --listen yb-au-1.yb --join yb-eu-1 \
 --master_flags="placement_zone=1,placement_region=au,placement_cloud=cloud" \
--tserver_flags="placement_zone=1,placement_region=au,placement_cloud=cloud"

Preferred tablet leaders

I define the placement with 5 replicas and a minimum of 1 in each:

docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
modify_placement_info \
cloud.eu.1:1,cloud.eu.2:1,cloud.us.1:1,cloud.ap.1:1,cloud.au.1:1 \
5

The format is a comma-separated list of cloud.region.zone:min_replicas followed by the total replication factor.

I set the Europe zones as the preferred ones for leaders:

docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
set_preferred_zones cloud.eu.1 cloud.eu.2

The format is a space separated list of cloud.region.zone

Master leaders

Note that the way I've created the cluster here, with yugabyted starting a Replication Factor RF=3 cluster, and changing the placement info for tables to RF=5 is a shortcut for this lab which focuses on the table servers. That's why you see a Replication Factor RF=5 with 5 yb-tmaster nodes and 3 yb-master:

Note that the masters are still in RF=3 and the master leader could be in us which will increase latency for connections from eu. If this is the case, like after a node failure, you can relocate the master with:

until 
# find the LEADER in "yb-eu-":
docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
list_all_masters | grep "yb-eu-" | grep LEADER
do
# stepdown the current leader
docker exec -i yb-eu-1 yb-admin -master_addresses yb-eu-1:7100,yb-eu-2:7100,yb-us-1:7100 \
master_leader_stepdown 
done

I'll focus on the placement of the tables in this post and the following ones, without looking more at the masters.

Cluster config

The placement information can be read from http://localhost:7001/cluster-config, or in a compact form:

curl -s http://localhost:7001/api/v1/cluster-config

{"version":3,"replication_info":{"live_replicas":{"num_replicas":5,"placement_blocks":[{"cloud_info":{"placement_cloud":"c
loud","placement_region":"au","placement_zone":"1"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placem
ent_region":"ap","placement_zone":"1"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placement_region":"
us","placement_zone":"1"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placement_region":"eu","placemen
t_zone":"2"},"min_num_replicas":1},{"cloud_info":{"placement_cloud":"cloud","placement_region":"eu","placement_zone":"1"},
"min_num_replicas":1}]},"affinitized_leaders":[{"placement_cloud":"cloud","placement_region":"eu","placement_zone":"1"},{"
placement_cloud":"cloud","placement_region":"eu","placement_zone":"2"}]},"cluster_uuid":"af68f56c-de35-456b-b279-fabdc2999
60c"}

I can also check the topology from the SQL yb_servers() - this is how the cluster-aware drivers know how to balance the connections:

psql -p 5005 -c "
 select host,port,node_type,cloud,region,zone
 from yb_servers()
 order by node_type,cloud,region,zone
 "
  host   | port | node_type | cloud | region | zone
---------+------+-----------+-------+--------+------
 yb-ap-1 | 5433 | primary   | cloud | ap     | 1
 yb-au-1 | 5433 | primary   | cloud | au     | 1
 yb-eu-1 | 5433 | primary   | cloud | eu     | 1
 yb-eu-2 | 5433 | primary   | cloud | eu     | 2
 yb-us-1 | 5433 | primary   | cloud | us     | 1
(5 rows)

You see the node type primary which means that each node participates in the quorum. When both regions in eu are available, the quorum (of 3 followers for an RF=5 cluster) will involve only one other region to acknowledge. If one eu zone is down, the database will still be available, with all leaders in the remaining eu zone, but waiting for two other regions for the quorum.

Writes

I connect to the last node and run some DDL and DML:

psql -p 5005 -e <<SQL
drop table if exists demo;
create table demo
 as select generate_series(1,1000) n;
update demo set n=n+1;
\watch 0.01
SQL

The http://localhost:7001/tablet-servers shows that Read ops/sec and Write ops/sec, from my update query, are on eu only:

The http://localhost:7001/tables endpoint shows that all leaders are in eu, with one follower in eu (in the other zone) and one follower in each other region us, ap and au:

Local reads

Now, when connected to au with follower read accepted (read only transaction with yb_read_from_followers enabled) I can see that all reads are from this region:

psql -p 5005 -e <<SQL
set yb_read_from_followers=on;
set default_transaction_read_only = on;
explain analyze select * from demo;
\watch 0.01
SQL

All reads are local, on the region I'm connected to (port 5005 is the exposed 5433 for yb-au-1

This is a simple cluster with a Replication Factor equal to the number of regions (RF=5 here as set by modify_placement_info) and the preferred leader on the main region (set by set_preferred_zones). Reads from eu are consistent and fast. Reads from other regions are fast if they accept a staleness of yb_follower_read_staleness_ms (default 30 seconds). High availability is good as it can accept 2 nodes down. However, the writes, even from eu, will wait for the nearest region, probably us in my example, to acknowledge.

This configuration, with RF=5, is good for Disaster Recovery, but will incur some write latency. The next post will explain how to have higher performance with lower DR possibilities, if this is your goal.

Top comments (1)

appdistributed • Aug 13 '22

Great article - Franck - so well written. It clearly shows that

The RF for masters - 3 in this case - can be different from the copy of tables - 5 ( modify_placement_info )
You can choose to step down a master till you get a local master
With follower_read it does read from follower.