The cost and benefit of synchronous replication in PostgreSQL and YugabyteDB

#postgres #yugabytedb #database #ha

I have seen a people comparing YugabyteDB and PostgreSQL, and surprised by the different throughput when running a simple test on a from a single session. The purpose of a distributed database is to scale out. When running on a single node without the need for High-Availability-without-data-loss (this is a tautology), a monolith database will always perform with lower latency. Because a distributed DB is designed to ensure the persistence (the D in ACID) though RPC (remote procedure calls) rather than local writes.

Here is a simple workload:

drop table if exists demo;
create table demo(
 i int primary key,
 t timestamp default clock_timestamp()
);

\timing on

do $$ 
 begin
  truncate demo; 
  for i in 1..1e4 loop
   insert into demo(i) values(i);
   commit; 
  end loop; 
 end; 
$$;

select
 count(*)/extract(epoch from max(t)-min(t)) "rows/s",
 count(*),max(t)-min(t) "duration"
 from demo;

YugabyteDB

Here is the run in the current production (stable) release, a RF=3 with all nodes on the same VM (for this test, in order to be independent on network latency, you don't do that in production):

[postgres@yb0 ~]$ psql -p 5433

yugabyte=# select version();
                                                  version
------------------------------------------------------------------------------------------------------------
 PostgreSQL 11.2-YB-2.6.1.0-b0 on x86_64-pc-linux-gnu, compiled by gcc (Homebrew gcc 5.5.0_4) 5.5.0, 64-bit

yugabyte=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 37130.036 ms (00:37.130)

yugabyte=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

      rows/s      | count |    duration
------------------+-------+-----------------
 270.115078207229 | 10000 | 00:00:37.021258

The number by itself is not important. It is a lab on one VM but I'll run everything in the same machine to compare the thoughput.

PostgreSQL no HA

I'm starting PostgreSQL on the same server:

/usr/pgsql-13/bin/initdb -D /var/tmp/pgdata
echo "port=5432" >> /var/tmp/pgdata/postgresql.conf
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgdata -l logfile start

And run the same:

[postgres@yb0 ~]$ psql -p 5432

postgres=# select version();

                                                version
--------------------------------------------------------------------------------------------------------
 PostgreSQL 13.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit

postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 5533.086 ms (00:05.533)

postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

       rows/s       | count |    duration
--------------------+-------+-----------------
 1809.0359900474075 | 10000 | 00:00:05.527806

This is what make you think that PostgreSQL is faster. Yes there is a 1:7 factor here in transactions per second.

But we are comparing apples and oranges in term of resilience. YugabyteDB was running with Replication Factor RF=3 so that each write is propagated to a quorum of 2 out of 3 replicas. In a Yugabyte cluster with RF=3, you can kill a node and:

2/3 of reads and writes continue to operate as if nothing happens. Thanks to the sharding of tables into tablets.
1/3 of reads and writes, those which had their leader on the dead node, have to wait a few seconds to get one of the followers, on the surviving nodes, to be elected new leader (Raft protocol)

And all continues because we have the quorum. And all is consistent. And no committed transaction has been lost. The only consequence is that, until the first node is back, or a new node added, loosing a second node will stop the database. Still with no data loss. But RF=3 can tolerate only one node down, by definition.

This protection involves remote procedure calls. Let's see how PostgreSQL would behave with some higher availability

PostgreSQL with standby

I'll add two standby databases to my PostgreSQL cluster:

/usr/pgsql-13/bin/pg_basebackup -p 5432 -D /var/tmp/pgsby1 -R --slot=sby1 -C
echo "port=5441" >> /var/tmp/pgsby1/postgresql.conf
touch /var/tmp/pgsby1/recovery.signal
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgsby1 -l logfile start

/usr/pgsql-13/bin/pg_basebackup -p 5432 -D /var/tmp/pgsby2 -R --slot=sby2 -C
echo "port=5442" >> /var/tmp/pgsby2/postgresql.conf
touch /var/tmp/pgsby2/recovery.signal
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgsby2 -l logfile start

Here they are visible from the primary:

[postgres@yb0 ~]$ psql -p 5432 -c "select * from pg_replication_slots"
 slot_name | plugin | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn | wal_status | safe_wal_size
-----------+--------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------+------------+---------------
 sby1      |        | physical  |        |          | f         | t      |     298963 |      |              | 0/5000060   |                     | reserved   |
 sby2      |        | physical  |        |          | f         | t      |     299054 |      |              | 0/5000060   |                     | reserved   |
(2 rows)

[postgres@yb0 ~]$ psql -p 5432 -c "select * from pg_stat_replication"
  pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_la
g | sync_priority | sync_state |          reply_time
--------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+----------
--+---------------+------------+-------------------------------
 298963 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:45.30363+00  |              | streaming | 0/5000060 | 0/5000060 | 0/5000060 | 0/5000060  |           |           |
  |             0 | async      | 2021-09-28 13:04:23.662693+00
 299054 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:48.668362+00 |              | streaming | 0/5000060 | 0/5000060 | 0/5000060 | 0/5000060  |           |           |
  |             0 | async      | 2021-09-28 13:04:23.663322+00
(2 rows)

Those are two asynchronous standbys receiving the streamed WAL.

I run the same mini-workload:

postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;
DO
Time: 6437.456 ms (00:06.437)

postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

       rows/s       | count |    duration
--------------------+-------+-----------------
 1554.5772428235664 | 10000 | 00:00:06.432617

This is still quite fast. But is this High Availability? Not at all. Yes in case of total failure of the primary database, I don't need to restore a backup and can failover to one of the standby databases. But:

I'll lose some committed transactions because I'm in ASYNC replication. Recovery Point Objective is RPO>0
Because of the preceding, this cannot be automated. You need a human decision to evaluate the risk of data loss, and the probability to get the failed primary site back at least to get the WAL with the latest transactions, before opening the standby. Human decision means, in practice, a Recovery Time Objective in hours: RTO>0

This cannot be compared with YugabyteDB replication where all is automated within seconds, without data loss.

PostgreSQL with synchronous standby

We can reduce the RPO with synchronous replication:

echo "synchronous_standby_names = '*'" >> /var/tmp/pgdata/postgresql.conf
/usr/pgsql-13/bin/pg_ctl -D /var/tmp/pgdata reload

This configuration ensures that one standby received the WAL to cover the transaction before we return a successful commit to the user:

[postgres@yb0 ~]$ psql -p 5432 -c "select * from pg_stat_replication"

  pid   | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   | sent_lsn  | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_la
g | sync_priority | sync_state |          reply_time
--------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+-----------+------------+-----------+-----------+----------
--+---------------+------------+-------------------------------

 298963 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:45.30363+00  |              | streaming | 0/538E3F0 | 0/538E3F0 | 0/538E3F0 | 0/538E3F0  |           |           |
  |             1 | sync       | 2021-09-28 13:14:12.307231+00
 299054 |       10 | postgres | walreceiver      |             |                 |          -1 | 2021-09-28 13:03:48.668362+00 |              | streaming | 0/538E3F0 | 0/538E3F0 | 0/538E3F0 | 0/538E3F0  |           |           |
  |             1 | potential  | 2021-09-28 13:14:12.307294+00
(

Now, if I run my small workload again:

postgres=# do $$ begin truncate demo; for i in 1..1e4 loop insert into demo(i) values(i); commit; end loop; end; $$;
DO
Time: 13613.487 ms (00:13.613)

postgres=# select count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration" from demo;

      rows/s      | count |    duration
------------------+-------+-----------------
 734.861683966413 | 10000 | 00:00:13.608003

The throughput has been divided by two.

Are we in High Availability here? This SYNC configuration requires complex monitoring and management. Because, even in sync, the persistence of WAL does not happen at the same time: first written and (fsync'd) then send (and acknowledged) to the standby, then returning "commit successful" to the user. There is no two-phase commit here. This is very different, in case of failure, from a consensus protocol as we can find in a distributed database. PostgreSQL databases are often used with ASYNC, and this is a very good DR (Disaster Recovery) solution where the data loss is minimal after a manual failover. SYNC replication is possible, but doesn't qualify as the same High Availability than distributed databases.

The numbers are not important here. They will depend on your machine and your network. Distributed databases can be in sync in a multi-AZ cluster, even multi-region. The point is that the thoughput is lower for a single session. But, because all nodes are active, this scales-out when having multiple sessions load-balanced over all nodes. You cannot do that with PostgreSQL standby that are read only.

YugabyteDB scale-out

I'm adding a "j" column for job number in my table:

drop table if exists demo;
create table demo(
 j int, i int,
 t timestamp default clock_timestamp(),
 primary key(j,i)
);

And run 3 parallel jobs doing the same inserts:

for i in {0..2} ; do
psql -h yb$i -p 5433 -c 'do $$ begin for i in 1..1e4 loop insert into demo(j,i) values('$i',i); commit; end loop; end; $$ ; ' &
done ; wait

Here is the result:

postgres=# select j, count(*)/extract(epoch from max(t)-min(t)) "rows/s",count(*),max(t)-min(t) "duration"
           from demo group by j;

 j |      rows/s      | count |    duration
---+------------------+-------+-----------------
 0 | 181.329577731828 | 10000 | 00:00:55.148201
 2 | 180.088033513951 | 10000 | 00:00:55.528398
 1 | 188.044672569376 | 10000 | 00:00:53.178853
(3 rows)

With my 3 concurrent sessions I have inserted at 550 transaction per second. Again, this is a small lab. While single session short transaction have limited rate because of the distributed nature of commits, it can scale to many nodes. If you stay on one VM without synchronous replication to another site, PostgreSQL will be faster. Where distributed databases show all their power is when you add nodes, for high availability and load balance, without adding complexity because all the distributed protocol is already there.

There are additional things that you can't see in this short test. PostgreSQL cannot sustain those inserts indefinitely. The shared buffers are filling, checkpoint will occur, the filesystem cache will be synced to disk. And the most important: at some point you will need to VACUUM the table before the transaction id wraps around, or the database will hang. The first minutes of insert are very optimistic in PostgreSQL, which is fine with short peaks of activity.

Note that I've written something similar in the past about RDS PostgreSQL vs. Aurora. Even if both cannot scale out the writes, the HA in Aurora relies on remote WAL sync for better HA (RPO=0 / RTO in few minutes).

Too many parameters to consider? Don't panic. Because YugabyteDB has the same API as PostgreSQL - it uses the same SQL and PL/pgSQL layer and similar open source license - you are not locked in your initial decision. You can start with PostgreSQL and scale with YugabyteDB, or vice-versa.