Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Nov 17, 2022 • Edited on Dec 21, 2022

YugabyteDB auto-sharding (showing tablet size with Grafana)

#yugabytedb #distributed #sql #sharding

In Distributed SQL Sharding: How Many Tablets, and at What Size? I mentioned how auto-split keeps the number of tablets high enough to distribute the load, but small enough to be easy rebalanced.

Here is a demo on a 12 nodes RF=3 YugabyteDB cluster where I play with the table information exposed by the webconsole, store them in the database, and visualize with Grafana.

12 pods cluster

This demo is on the cluster I've created in this previous post on Oracle Cloud Kubernetes (OKE)

By default the auto-split activity is limited, to not overload the system, with:

--outstanding_tablet_split_limit=1
--outstanding_tablet_split_limit_per_tserver=1

Aggressive auto-split

I've increased this aggressively, changing the overrides.yaml from the previous post to:

For this I changed the overrides.yaml as:

...
gflags:
  master:
...
    # aggressive auto-split
    outstanding_tablet_split_limit: "48"
    outstanding_tablet_split_limit_per_tserver: "4"
...

and did a rolling restart of the masters with:

helm upgrade yb-demo yugabytedb/yugabyte --namespace yb-demo -f overrides.yaml

Default thresholds

All other parameters are the defaults for version 2.15.3.1 and I'll mention directly the value to keep this blog simple to read. Here are those defaults:

--enable_automatic_tablet_splitting=true
--tablet_split_low_phase_shard_count_per_node=8
--tablet_split_low_phase_size_threshold_bytes=536870912
--tablet_split_high_phase_shard_count_per_node=24
--tablet_split_high_phase_size_threshold_bytes=10737418240
--tablet_split_limit_per_table=256
--tablet_force_split_threshold_bytes=107374182400
--tablet_split_size_threshold_bytes=0

Split phases

We have 3 phases:

Low phase to distribute quickly even when the tables are not too large, for performance
High phase to keep tablets in small size, easy to re-balance, but without too many tablets
Final phase where the goal is avoiding tablets that are too large

In each phase, it starts with number of tablets staying constant, growing with data ingestion. Then it starts to split on a threshold, until a maximum number of tablets per table per server. In Distributed SQL Sharding: How Many Tablets, and at What Size? I named the phases starting from this threshold because this is where splitting starts. But to be correct with the documentation, each phase is defined by the number of tablets (shards) rather than the size threshold. This is easier because the number of tablets is per table, but the size can be reached at different time when they are not perfectly balanced. In the following description, I distinguish the two sub-phases of each phase: the tablets growing and then splitting.

If the description is too long, there are pictures at the end of the post to illustrate this.

Low phase:

With a n nodes cluster, new tablets or indexes, in case of hash sharding, start with n tablets (one per table server) that can grow up to 512MB (the low-phase threshold). With RF=3 this means 3 tablet peers per node (leaders are the most active, serving reads and writes, followers on writes only or with followers reads).

With a balanced table, this applies to a table size from 0MB to 512MB * n. In my 12 pods example: 12 tablets when the table is between 0GB and 6GB

The tablets that grow beyond 512MB are split, increasing the number of tablets, until there are 8 tablets per server. With RF=3 this means 3 * 8 tablet peers per node.

With a balanced table, this applies to a table size from 512MB * n to 512MB * n * 8. In my 12 pods example: from 12 to 96 tablets when the table is between 6GB and 48GB

High Phase:

The tablets can grow up to 10GB. With RF=3 this is still 3 * 8 tablet peers per node.

With a balanced table, this applies to a table size from 512MB * n * 8 to 10GB * n * 8. In my 12 pods example: 96 tablets from 48GB to 960GB

The tablets that grow beyond 10GB are split, increasing the number of tablets, until there are 24 tablets per server. With RF=3 this means 3 * 24 tablet peers per node.

With a balanced table, this applies to a table size from 10GB * n * 8 to 10GB * n * 24. In my 12 pods example: 96 to 256 (the maximum per table) from 960GB to 2.5TB

Final Phase

The tablets can grow up to 100GB. With RF=3 this is still 3 * 24 tablet peers per node. But remember the limit of 256 tablets in total, so this is at maximum 768 tablet peers.

With a balanced table, this applies to a table size from 10GB * n * 24 to 100GB * n * 24. In my 12 pods example: 256 tablets from 2.8TB to 25TB

The tablets that grow beyond 100GB are split, increasing the number of tablets, if still possible within the limit of 256. They can now grow with no limit of size, but you should not reach this point. 100GB is already quite large for tablets to stay efficient when re-balancing. If you are with this size of table (25TB) you should have added more nodes to the cluster way before you reach this tablet size, and then be back to the previous phase. Very large tables can also be partitioned with PostgreSQL declarative partitioning, and each partition count as a table.

Tablet metrics

To verify the above maths on my 12 pods cluster, I've run the following to gather tablet information from all tablet servers, reading from the /tablets webconsole page:

--drop table if exists ybwr_tablets;
create table if not exists 
ybwr_tablets(
 host text default ''
, ts timestamptz default now()
, database_name text
, table_name text
, tablet_id uuid
, key_range text
, state text
, num_sst_files bigint
, wal_files numeric
, sst_files numeric
, sst_uncompressed numeric
, primary key (ts asc, host, tablet_id));

DO $DO$
declare i record; 
begin
for i in (select host from yb_servers()) loop 
 execute format(
  $COPY$
  copy ybwr_tablets(host, database_name, table_name, tablet_id, key_range, state, num_sst_files, wal_files, sst_files, sst_uncompressed) from program
   $BASH$
   exec 5<>/dev/tcp/%s/9000 ; awk '
function bytes(h){
 if(sub(/T/,"",h)>0) h=h*1024*1024*1024*1024
 if(sub(/G/,"",h)>0) h=h*1024*1024*1024
 if(sub(/M/,"",h)>0) h=h*1024*1024
 if(sub(/K/,"",h)>0) h=h*1024
 if(sub(/B/,"",h)>0) h=h
 return h
}
$0 ~ tserver_tablets {
print server,gensub(tserver_tablets,"\\1",1), gensub(tserver_tablets,"\\2",1), gensub(tserver_tablets,"\\3",1), gensub(tserver_tablets,"\\4",1), gensub(tserver_tablets,"\\5",1), gensub(tserver_tablets,"\\6",1), bytes(gensub(tserver_tablets,"\\7",1)), bytes(gensub(tserver_tablets,"\\8",1)), bytes(gensub(tserver_tablets,"\\9",1))
}
' OFS='<' OFMT="%%f" server="%s" \
tserver_tablets='^<tr><td>([^<]*)<[/]td><td>([^<]*)<[/]td><td>0000[0-9a-f]{4}00003000800000000000[0-9a-f]{4}<[/]td><td><a href="[/]tablet[?]id=([0-9a-f]{32})">[0-9a-f]{32}</a></td><td>([^<]*)<[/]td><td>([^<]*)<[/]td><td>false<[/]td><td>([0-9])<[/]td><td><ul><li>Total: [^<]*<li>Consensus Metadata: [^<]*<li>WAL Files: ([^<]*)<li>SST Files: ([^<]*)<li>SST Files Uncompressed: ([^<]*)<[/]ul><[/]td><td><ul>' <&5 & printf "GET /tablets HTTP/1.0\r\n\r\n" >&5
   $BASH$ (format csv, delimiter $DELIMITER$<$DELIMITER$)
  $COPY$
 ,i.host,i.host); 
end loop; 
--return clock_timestamp(); 
end; 
$DO$
\;
delete from ybwr_tablets where table_name!='demo'
\;
select ts
,database_name, table_name
, count(distinct host) as tservers 
, count(distinct tablet_id) as tablets, pg_size_pretty(sum(sst_files)/count(distinct tablet_id)) avg_tablet_size
, sum(num_sst_files) as sst_files
, pg_size_pretty(sum(sst_files)) as file_size
, pg_size_pretty(sum(sst_uncompressed)) data_size
from ybwr_tablets
where state='RUNNING' and table_name='demo'
group by ts, database_name, table_name
order by ts desc, database_name, table_name, sum(sst_files)
limit 5
;
\watch 60

You must run this as superuser to COPY FROM PROGRAM, and have bash and awk installed on the table servers.

Inserts

To keep increasing a table, I've run the following that inserts large rows, continuously, into a demo table.

drop table if exists demo;
create extension if not exists orafce;
create table if not exists demo (id bigint, value text);
set yb_disable_transactional_writes=on;
set yb_enable_upsert_mode=on;
insert into demo
 select generate_series(1,100),dbms_random.string('p',1024*1024)
\; 
select 
 pg_size_pretty(pg_table_size('demo'::regclass)) "size",
 num_tablets,
 tservers.count as tservers
 from 
   (select count(*) from yb_servers()) tservers,
   yb_table_properties('demo'::regclass)
 order by pg_table_size('demo'::regclass)
;
\watch 15

I didn't mention any primary key here (bad practice but this is a lab with no indexes so it doesn't matter), this means hash sharding on an internal key.

It shoes the size from pg_table_size():

This is the same as I do above, adding the SST files from each Tablet Server, and is sufficient if you don't need the detail per tablet. Note that pg_table_size() also adds the WAL size.

Grafana

I've displayed the gathered metrics in Grafana. That's why I've run my script to get all details and build colorful graphs from it.

The following shows the table size (from the sum of SST files divided by the replication factor), per tablet (displayed by their split bounderies):

select $__time(ts),sum(sst_files)/(
select avg(c) as rf from (select count(distinct host) as c from ybwr_tablets group by ts,tablet_id) v
) as tablet_size,format('%s',key_range) table_name
from ybwr_tablets
where state='RUNNING' and table_name='demo' and $__timeFilter(ts)
group by ts, database_name, table_name, key_range 
order by ts, database_name, table_name, key_range , sum(sst_files)

The tablet size is about 512MB. We are in the low phase when the total size is below 48GB (the high phase threshold).

Here is the query I used to count the tablets:

select $__time(ts),count(distinct tablet_id) as "num tablets"
from ybwr_tablets
where state='RUNNING' and table_name='demo' and $__timeFilter(ts)
group by ts, database_name, table_name --, host
order by 1,2

The phases are clearly visible here: we are in the low phase, with active splitting from 12 to 96 tablets, and then the high phase starts, without reaching the splitting threshold yet.

The tablets peers per server to show that all is well balanced:

select $__time(ts),count(distinct tablet_id) as tablets, regexp_replace(host,'tservers..*','tservers.') as tserver
from ybwr_tablets
where state='RUNNING' and table_name='demo' and $__timeFilter(ts)
group by ts, database_name, table_name, host 
order by 1,2

This is the number of tablets multiplied by the replication factor, all balanced across all table servers.

Finally to check that the storage is well balanced, the sum of SST files in each server:

select $__time(ts),sum(sst_files) as SST_size, regexp_replace(host,'tservers..*','tservers.') as tserver
from ybwr_tablets
where state='RUNNING' and table_name='demo' and $__timeFilter(ts)
group by ts, database_name, table_name, host 
order by 1,2

The are of course little hiccups because of SST File compaction happening in background.

Performance

The run above was waiting 15 minutes between the inserts and the screenshots were from a first run without increasing the auto-split aggressivity.

I've run the same with continuous inserts from one session (\watch 0.01) and the setting mentioned above. Here is the beginning of the low phase where in 5 minutes the number of tablets went from 12 to 96:

The tablet peers were balanced over all servers:

The tablet size started to increase at higher rate as soon as we had more tablets, which is the goal of this low phase:

The colors here are defined on the size of the tablets the SST file belongs to:

and for the next screenshot I filtered on key_range <= 'hash_split: [0x1024' in order to show only 1024/65536=1% of the split range, for better vizualization when one tablet becomes two:

Here is the same without filtering on range, too detailed for proper analisis, but great from an aesthetic point of view: