🚀Batched Nested Loop to reduce read requests to the distributed storage

#yugabytedb #distributed #sql #database

The Nested Loop join method is great when the high selectivity of the query, getting a few rows from a large inner table, is due to the join condition. However, with many rows from the outer table, the performance degrades: Nested Loop is the only join method where the inner table is accessed multiple times, once for each outer row. This is even worse in a Distributed SQL database where each access can be a remote call, involving network latency .

YugabyteDB addresses the problem with a new join method: Batched Nested Loop. The idea is the same technique that I've exposed in some previous posts with two statements, or generating them with jOOQ: read multiple rows from the outer table and access the inner table with an array of value for the join condition. The YugabyteDB storage layer is very efficient at reading many points or ranges in one call.

With Arrays you want to control the size and that's why you enable this new join method by setting a batch size.

yb_bnl_batch_size

The feature has been added to YugabyteDB 2.15 and is still in beta in 2.17 where Batched Nested Loop is enabled by setting yb_bnl_batch_size to a value higher than the default value 1:

yugabyte=# create table demo (id int primary key);
CREATE TABLE

yugabyte=# show yb_bnl_batch_size;
 yb_bnl_batch_size
-------------------
 1
(1 row)

yugabyte=# explain (costs off) /*+ nestloop(a b) */ 
           select * from demo a join demo b using(id);

                 QUERY PLAN
--------------------------------------------
 Nested Loop
   ->  Seq Scan on demo a
   ->  Index Scan using demo_pkey on demo b
         Index Cond: (id = a.id)
(4 rows)

yugabyte=# set yb_bnl_batch_size=5;
SET

yugabyte=# explain (costs off) /*+ nestloop(a b) */ 
           select * from demo a join demo b using(id);

                          QUERY PLAN
--------------------------------------------------------------
 YB Batched Nested Loop Join
   Join Filter: (a.id = b.id)
   ->  Seq Scan on demo a
   ->  Index Scan using demo_pkey on demo b
         Index Cond: (id = ANY (ARRAY[a.id, $1, $2, $3, $4]))
(5 rows)

The plan node changed from Nested Loop to YB Batched Nested Loop Join and the Index Cond shows the array of 5 values when yb_bnl_batch_size is set to 5. They are displayed as parameters but the first one shows the outer column.

I've set a value of 5 here but you probably want a larger batch size. Which value is the best? As always... it depends.

As with many parameters controlling the fetch size through network calls, values, in number of rows, between 100 and 1000 are probably good. It is a good compromise between the network latency and an acceptable buffer size. Higher doesn't give a significant gain and some side effects may be more important than the benefit. Small variations doesn't really matter once the value is large enough to reduce the network latency to a small part of the response time.

There is another reason why higher is not useful: the goal is to reduce the remote calls, but they are themselves limited by the number of operations in one call: --ysql_request_limit which defaults to 1024 in YugabyteDB.

yb_bnl_batch_size and ysql_request_limit

To verify this, I've run a simple query joining two tables with 2000 rows. with various values for ysql_request_limit (set at cluster level) and yb_bnl_batch_size (set at session level, or query level with the Set() hint).

If you want to reproduce it, here is how I did with docker on my laptop:

for p in 1024 ; do
for r in 100 1024 2000 ; do
for b in 1 10 100 500 1023 1024 1025 2000; do
docker run --rm yugabytedb/yugabyte:latest bash -c '
yugabyted start --listen 0.0.0.0 --tserver_flags="'ysql_prefetch_limit=$p,ysql_request_limit=$r'"
cat /root/var/logs/{master,tserver}.err
until postgres/bin/pg_isready ; do sleep 0.1 ; done | uniq &&
curl -s $(hostname):9000/varz?raw | grep -E "ysql_.*_limit"
ysqlsh -e <<SQL
create table demo ( id bigint primary key, x int) split into 10 tablets;
insert into demo select generate_series(1,2000), 0 ;
set yb_enable_expression_pushdown to on;
explain (analyze, dist, buffers, costs off)
/*+ nestloop(a b) leading((a b)) Set(yb_bnl_batch_size '$b') */
with a as (select * from demo)
select * from a join demo b using(id) where b.x=0;
SQL
'
done
done 
done | tee log.txt
grep -E -- "ysql_prefetch_limit|ysql_request_limit|yb_bnl_batch_size|Index Scan .* loops|Index Read Requests" log.txt

awk '
/^INSERT / {t=gensub(re,"\\1","1")}
/ysql_prefetch_limit/ {p=gensub(re,"\\1","1")}
/ysql_request_limit/  {r=gensub(re,"\\1","1")}
/yb_bnl_batch_size/   {b=gensub(re,"\\1","1")}
/Index Scan .* loops/ {l=gensub(re,"\\1","1")}
/Index Read Requests/ {i=gensub(re,"\\1","1")}
/Table Read Requests/ {t=gensub(re,"\\1","1")}
/^ *Execution Time/{
printf "%5d read requests=%5d loops=%5d yb_bnl_batch_size=%5d %6.2fms ysql_request_limit=%5d ysql_prefetch_limit=%5d\n",l*i,i,l,b,$(NF-1),r,p
}
' re="^.*[= ]([0-9.]+)[^0-9]*$" log.txt | sort -n

The result displays one line per test, gathering the values from the settings or the execution plan (explain (analyze, dist)) parsed with awk.

The values of read requests and loops come from the execution plan and must be multiplied together to get the total read requests for the inner table access over all loops. This is what I've put in the first column on which I sorted the result:

    1 read requests=    1 loops=    1 yb_bnl_batch_size= 2000 495.11ms ysql_request_limit= 2000 ysql_prefetch_limit= 1024
    2 read requests=    1 loops=    2 yb_bnl_batch_size= 1024 207.83ms ysql_request_limit= 1024 ysql_prefetch_limit= 1024
    2 read requests=    1 loops=    2 yb_bnl_batch_size= 1024 274.37ms ysql_request_limit= 2000 ysql_prefetch_limit= 1024
    2 read requests=    2 loops=    1 yb_bnl_batch_size= 2000 259.81ms ysql_request_limit= 1024 ysql_prefetch_limit= 1024
   20 read requests=    1 loops=   20 yb_bnl_batch_size=  100  58.44ms ysql_request_limit= 2000 ysql_prefetch_limit= 1024
   20 read requests=    1 loops=   20 yb_bnl_batch_size=  100  60.40ms ysql_request_limit= 1024 ysql_prefetch_limit= 1024
   20 read requests=    1 loops=   20 yb_bnl_batch_size=  100  64.33ms ysql_request_limit=  100 ysql_prefetch_limit= 1024
   20 read requests=   10 loops=    2 yb_bnl_batch_size= 1024  61.67ms ysql_request_limit=  100 ysql_prefetch_limit= 1024
   20 read requests=   20 loops=    1 yb_bnl_batch_size= 2000  79.35ms ysql_request_limit=  100 ysql_prefetch_limit= 1024
  200 read requests=    1 loops=  200 yb_bnl_batch_size=   10  83.23ms ysql_request_limit=  100 ysql_prefetch_limit= 1024
  200 read requests=    1 loops=  200 yb_bnl_batch_size=   10  94.62ms ysql_request_limit= 1024 ysql_prefetch_limit= 1024
  200 read requests=    1 loops=  200 yb_bnl_batch_size=   10 120.66ms ysql_request_limit= 2000 ysql_prefetch_limit= 1024
 2000 read requests=    1 loops= 2000 yb_bnl_batch_size=    1 572.09ms ysql_request_limit=  100 ysql_prefetch_limit= 1024
 2000 read requests=    1 loops= 2000 yb_bnl_batch_size=    1 645.31ms ysql_request_limit= 2000 ysql_prefetch_limit= 1024
 2000 read requests=    1 loops= 2000 yb_bnl_batch_size=    1 812.39ms ysql_request_limit= 1024 ysql_prefetch_limit= 1024

The number of loops depends only on the outer number of rows and yb_bnl_batch_size: 2000 when no batching (batch size of 1), 200 with a batch size of 10, 20 with a batch size of 100, 2 loops with a batch size of 1000, and 1 loop only when all 2000 rows fit in one batch. This is simple maths: the number of outer rows divided by the batch size.

The thing I wanted to verify was how the read requests follow the number of loops. They are equal when yb_bnl_batch_size is smaller than the ysql_request_limit. But when above, one loop is split into multiple read calls because of the request limit. This is why the batch size of 2000 shows two 2 read requests for 1 loop with the default ysql_request_limit of 1024.

The effectiveness of a large yb_bnl_batch_size to reduce the read requests is limited by the number of read operations that can be sent in one request: ysql_request_limit and by the number if rows that can be fetched in one request: ysql_prefetch_limit. Both default to 1024.

More about ysql_request_limit

YugabyteDB re-uses PostgreSQL for the SQL layer (called YSQL). This one is stateless and calls the distributed transaction and storage layer, DocDB, to read and write data. DocDB is a transactional Key-Value database where SQL rows and index entres are stored into documents (and sub-documents for individual column changes). The allows Horizontal Scalability, where the DocDB nodes can be in different data centers, even in distant regions. To keep High Performance, the read/write requests are grouped as much as possible. This can be done at two levels: requests in PgGate and RPC in YBClient.

What we see as IN (ARRAY[...]) in one Index Cond is actually translated to multiple read requests here. You can see those as Applying operation with yb_debug_log_docdb_requests. This is where ysql_request_limit applies. Note that, in this example, I've created the demo table with 10 tablets, with HASH sharding. There is still one read operation per value (until #7836 is implemented). With range sharding, they are batched, visible as Buffering operation with yb_debug_log_docdb_requests.

Cost of batching vs. latency

I have displayed the Elapsed Time. Batched Nested Loop is always faster than one loop per outer rows. However, the query is faster with a batch size of 100 rather than 1000, even if it involves 10 times more read requests. This is because I'm running this on a single-node where the latency of read requests is low. In this special case, the little overhead of batching is significant.

I have run similar tests on a single-region cluster with 0.8 millisecond latency between nodes:

It shows the same with the best response time with a batch size of 100:

    2 read requests=    1 loops=    2 yb_bnl_batch_size= 1000     208.24ms
    3 read requests=    1 loops=    3 yb_bnl_batch_size=  800     162.11ms
    4 read requests=    1 loops=    4 yb_bnl_batch_size=  600     134.91ms
    4 read requests=    2 loops=    2 yb_bnl_batch_size= 1200     216.53ms
    5 read requests=    1 loops=    5 yb_bnl_batch_size=  400      93.12ms
   10 read requests=    1 loops=   10 yb_bnl_batch_size=  200      71.44ms
   20 read requests=    1 loops=   20 yb_bnl_batch_size=  100      61.44ms
  200 read requests=    1 loops=  200 yb_bnl_batch_size=   10     164.11ms
 2000 read requests=    1 loops= 2000 yb_bnl_batch_size=    1    1175.69ms

Then I did the same in a multi-region cluster with two nodes at 90 milliseconds:

Of course, this is not ideal for joins and you should consider geo-partitioning. But I did that to show how the number of requests matters more than the small overhead of batching:

    2 read requests=    1 loops=    2 yb_bnl_batch_size= 1000    1474.44ms
    3 read requests=    1 loops=    3 yb_bnl_batch_size=  800    1828.48ms
    4 read requests=    1 loops=    4 yb_bnl_batch_size=  600    1861.53ms
    4 read requests=    2 loops=    2 yb_bnl_batch_size= 1200    1551.61ms
    5 read requests=    1 loops=    5 yb_bnl_batch_size=  400    2075.12ms
   10 read requests=    1 loops=   10 yb_bnl_batch_size=  200    2485.99ms
   20 read requests=    1 loops=   20 yb_bnl_batch_size=  100    3879.42ms
  200 read requests=    1 loops=  200 yb_bnl_batch_size=   10   13376.93ms
 2000 read requests=    1 loops= 2000 yb_bnl_batch_size=    1  126277.28ms

Having a large batch size, like 1024 as it is the request limit, is the best you can do in this situation. The response time is still long: 1.5 second to join 2000 rows and another join method (Hash or Merge) may be better. But the great thing is that thanks to batching the response time is still acceptable. The PostgreSQL non-batched Nested Loop takes 2 minutes here and the YugabyteDB batched one is down to 2 seconds.

That illustrates how YugabyteDB re-uses PostgreSQL: get all features from it rather than developing a new database from scratch, but optimize what must be done differently to provide high performance when (geo-)distributed.

In summary, and counting in power of two to sound geekier, it is probably good to set yb_bnl_batch_size from 128 to 1024 in most cases. When you join between regions with high latency, 1024 is probably the best. In single region with low-latency between AZs, 128 may be better. Or set it to 512 for a good generic default.

This feature is still in preview and will be GA soon, when the query planner will estimate its cost (to avoid hinting as I did above). Your feedback will help to improve and maybe set a better default in the future. Note: The result depends on the sharding method, the number of tablets, the number of rows, and the request and prefetch limits. The numbers here are for one specific case but you can reproduce it with variations.