Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Dec 23, 2023

Could Batched Nested Loop improve PostgreSQL like it does for YugabyteDB?

#yugabytedb #distributed #sql #postgres

I like the Batched Nested Loop feature in YugabyteDB's execution plans. It has solved many problems by making complex queries, with many joins, scalable without compromising the agility to distribute table rows and index entries transparently. Someone asked me if this feature would be useful for PostgreSQL. To find out, we need to test it.

Essentially, this feature can replace a thousand loops with one. It pushes down the outer loops to the inner loop to reduce the cross-shard calls. The thousand values from the outer table are batched into a single array.

Let's build a minimal example where the outer table has one thousand rows built with VALUES, and we want to fetch the matching rows from a one million table. I'll show what happens with an ARRAY and with a JOIN, in PostgreSQL and YugabyteDB.

PostgreSQL

I create the following table with one million rows:

create table t2 ( id bigint, primary key(id));
insert into t2 select generate_series(1,1000000);
vacuum t2;

I will query it for 1000 values using a nested loop from VALUES and a single scan with = ANY(). The purpose is to understand the performance complexity by examining the execution plan with runtime statistics.

As I'm too lazy to type 1000 values, I generate it and \gexec it:

\pset pager off
select
 format('
explain (analyze, buffers)
 select id from t2 
 where t2.id = ANY (''{ %s }''::bigint[]) 
 order by id
',string_agg(n::text,',')) as arr,
 format('
explain (analyze, buffers)
 with t1(id) as (values( %s )) 
 select id from t1 join t2 using(id)
 order by id
',string_agg(n::text,'),(')) as val
from generate_series(1,1000) n;
\gexec

I'll show the execution plans with explain (analyze, buffers) to see how many pages were read to get those 1000 rows.

PostgreSQL = ANY ( ARRAY )

Let's start without a join to see how our solution, with an array, accesses the thousand rows in the inner table.

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using t2_pkey on t2  (cost=0.42..3826.50 rows=1000 width=8) (actual time=0.038..0.889 rows=1000 loops=1)
   Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000}'::bigint[]))
   Heap Fetches: 0
   Buffers: local hit=3001
 Planning Time: 0.238 ms
 Execution Time: 0.942 ms
(6 rows)

This is an Index Only Scan because my index covers all columns I need, and with Heap Fetches: 0 as it was freshly vacuumed. To read 1000 values, it has read 3001 buffers. It is easy to guess that each index lookup had to read the B-Tree root, then one branch and one leaf. There's no scan optimization for multiple points or ranges in PostgreSQL, each lookup is an index access, and I can already guess that it is not faster than 1000 loops.

PostgreSQL Nested Loop Join

To use a join, I've defined those thousand values as a set of rows with the VALUES constructor.

                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=3888.83..3891.33 rows=1000 width=8) (actual time=1.426..1.479 rows=1000 loops=1)
   Sort Key: t2.id
   Sort Method: quicksort  Memory: 25kB
   Buffers: local hit=3001
   ->  Nested Loop  (cost=0.42..3839.00 rows=1000 width=8) (actual time=0.013..1.358 rows=1000 loops=1)
         Buffers: local hit=3001
         ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.163 rows=1000 loops=1)
         ->  Index Only Scan using t2_pkey on t2  (cost=0.42..3.83 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1000)
               Index Cond: (id = "*VALUES*".column1)
               Heap Fetches: 0
               Buffers: local hit=3001
 Planning Time: 0.461 ms
 Execution Time: 1.541 ms
(13 rows)

With loops=1000 the performance of the Index Only Scan is exactly the same, reading 3001 pages for 1000 lookups to a 3 levels B-Tree.

An additional Sort increases the response time for my order by. It was not necessary with the array because the query planner can reorder it but a Nested Loop is ordered by the outer table. There's a solution for that. If I replace from t1 by from (select * from t1 order by id) t1 the Nested Loop will be aware of the order and can bypass the Sort operation.

YugabyteDB

To run in in YugabyteDB, I've defined range sharding to get the index ordered like in PostgreSQL: primary key(id asc) and I've gathered distributed read/write and LSM-Tree metrics: explain (analyze, buffers, dist, debug).

There's no need to vacuum as there's no bloating, no stale visibility map and no xid wraparound with YugabyteDB.

YugabyteDB = ANY ( ARRAY )

 Index Scan using t2_pkey on t2  (cost=0.00..5.36 rows=1 width=8) (actual time=5.187..5.411 rows=1000 loops=1)
   Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,97,988,989,990,991,992,993,994,995,996,997,998,999,1000}'::bigint[]))
   Storage Table Read Requests: 1
   Storage Table Read Execution Time: 2.361 ms
   Metric rocksdb_number_db_seek: 1000.000
   Metric rocksdb_number_db_next: 1000.000
   Metric rocksdb_number_db_seek_found: 1000.000
   Metric rocksdb_number_db_next_found: 1000.000
   Metric rocksdb_iter_bytes_read: 82980.000
   Metric docdb_keys_found: 1000.000
 Planning Time: 8.063 ms
 Execution Time: 6.272 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 2.361 ms
 Storage Write Requests: 0.000
 Catalog Read Requests: 4
 Catalog Read Execution Time: 3.153 ms
 Catalog Write Requests: 0.000
 Storage Flush Requests: 0
 Metric rocksdb_number_db_seek: 1000
 Metric rocksdb_number_db_next: 1000
 Metric rocksdb_number_db_seek_found: 1000
 Metric rocksdb_number_db_next_found: 1000
 Metric rocksdb_iter_bytes_read: 82980
 Metric docdb_keys_found: 1000
 Storage Execution Time: 5.514 ms
 Peak Memory Usage: 30 kB
(27 rows)

The Index Scan involves searching for 1000 values in the LSM-Tree, which is a RocksDB for each YugabyteDB table or index shard. It is different than B-Tree. One rocksdb_number_db_seek goes to the key ( MemTable and Level 0 SST files), and rocksdb_number_db_next reads the row. It is worth noting that the 1000 searches are carried out by a single Read Request, which is one distributed read. This means that even if the index tablet is located in another zone, the query will still take only a few milliseconds. Unlike PostgreSQL, YugabyteDB can perform an Index Skip Scan when reading from the LSM-Tree, fetching multiple values in one scan.

YugabyteDB Nested Loop Join

Given the previous observation, I've replaced from t1 with from (select * from t1 order by id) t1 to show that the final Sort operation is not needed in this case:

 Nested Loop  (cost=82.33..208.72 rows=1000 width=8) (actual time=1.547..441.056 rows=1000 loops=1)
   CTE t1
     ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.158 rows=1000 loops=1)
   ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual time=0.840..1.443 rows=1000 loops=1)
         Sort Key: t1.id
         Sort Method: quicksort  Memory: 71kB
         ->  CTE Scan on t1  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.004..0.547 rows=1000 loops=1)
   ->  Index Scan using t2_pkey on t2  (cost=0.00..0.11 rows=1 width=8) (actual time=0.422..0.422 rows=1 loops=1000)
         Index Cond: (id = t1.id)
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 0.377 ms
         Metric rocksdb_block_cache_hit: 3.000
         Metric rocksdb_block_cache_index_hit: 1.000
         Metric rocksdb_block_cache_filter_hit: 1.000
         Metric rocksdb_block_cache_data_hit: 1.000
         Metric rocksdb_block_cache_bytes_read: 117565.000
         Metric rocksdb_number_db_seek: 1.000
         Metric rocksdb_number_db_next: 1.000
         Metric rocksdb_number_db_seek_found: 1.000
         Metric rocksdb_number_db_next_found: 1.000
         Metric rocksdb_iter_bytes_read: 82.980
         Metric rocksdb_block_cache_multi_touch_hit: 3.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 117565.000
         Metric docdb_keys_found: 1.000
 Planning Time: 0.800 ms
 Execution Time: 442.253 ms
 Storage Read Requests: 1000
 Storage Read Execution Time: 377.414 ms
 Storage Write Requests: 0.000
 Catalog Read Requests: 0
 Catalog Write Requests: 0.000
 Storage Flush Requests: 0
 Metric rocksdb_block_cache_hit: 3000
 Metric rocksdb_block_cache_index_hit: 1000
 Metric rocksdb_block_cache_filter_hit: 1000
 Metric rocksdb_block_cache_data_hit: 1000
 Metric rocksdb_block_cache_bytes_read: 117565000
 Metric rocksdb_number_db_seek: 1000
 Metric rocksdb_number_db_next: 1000
 Metric rocksdb_number_db_seek_found: 1000
 Metric rocksdb_number_db_next_found: 1000
 Metric rocksdb_iter_bytes_read: 82980
 Metric rocksdb_block_cache_multi_touch_hit: 3000
 Metric rocksdb_block_cache_multi_touch_bytes_read: 117565000
 Metric docdb_keys_found: 1000
 Storage Execution Time: 377.414 ms
 Peak Memory Usage: 232 kB
(47 rows)

The key distinction is that the 1000 lookups (rocksdb_number_db_seek: 1000) in the LSM-Tree were performed via 1000 separate read requests (Storage Read Requests: 1000). This approach adds network latency for each iteration, making it non-scalable. Our goal is to optimize the process by batching values and reducing the number of loops required, utilizing the Index Skip Scan mentioned above.

YB Batched Nested Loop Join

I'm running this in YugabyteDB 2.19 where the Batched Nested Loop is not yet enabled by default. I enabled it y increasing the batch size from 1 to 1024.

set yb_bnl_batch_size to 1024;

Here is the execution plan for the same query:

                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=144.67..147.17 rows=1000 width=8) (actual time=7.384..7.430 rows=1000 loops=1)
   Sort Key: t2.id
   Sort Method: quicksort  Memory: 71kB
   CTE t1
     ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.153 rows=1000 loops=1)
   ->  YB Batched Nested Loop Join  (cost=69.83..82.34 rows=1000 width=8) (actual time=6.163..7.053 rows=1000 loops=1)
         Join Filter: (t1.id = t2.id)
         ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual time=0.861..0.908 rows=1000 loops=1)
               Sort Key: t1.id
               Sort Method: quicksort  Memory: 71kB
               ->  CTE Scan on t1  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.006..0.576 rows=1000 loops=1)
         ->  Index Scan using t2_pkey on t2  (cost=0.00..4.11 rows=1 width=8) (actual time=3.853..4.111 rows=1000 loops=1)
               Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 2.413 ms
               Metric rocksdb_block_cache_hit: 2.000
               Metric rocksdb_block_cache_index_hit: 1.000
               Metric rocksdb_block_cache_data_hit: 1.000
               Metric rocksdb_block_cache_bytes_read: 52088.000
               Metric rocksdb_number_db_seek: 1000.000
               Metric rocksdb_number_db_next: 1000.000
               Metric rocksdb_number_db_seek_found: 1000.000
               Metric rocksdb_number_db_next_found: 1000.000
               Metric rocksdb_iter_bytes_read: 82980.000
               Metric rocksdb_block_cache_multi_touch_hit: 2.000
               Metric rocksdb_block_cache_multi_touch_bytes_read: 52088.000
               Metric docdb_keys_found: 1000.000
 Planning Time: 1.756 ms
 Execution Time: 10.993 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 2.413 ms
 Storage Write Requests: 0.000
 Catalog Read Requests: 0
 Catalog Write Requests: 0.000
 Storage Flush Requests: 0
 Metric rocksdb_block_cache_hit: 2
 Metric rocksdb_block_cache_index_hit: 1
 Metric rocksdb_block_cache_data_hit: 1
 Metric rocksdb_block_cache_bytes_read: 52088
 Metric rocksdb_number_db_seek: 1000
 Metric rocksdb_number_db_next: 1000
 Metric rocksdb_number_db_seek_found: 1000
 Metric rocksdb_number_db_next_found: 1000
 Metric rocksdb_iter_bytes_read: 82980
 Metric rocksdb_block_cache_multi_touch_hit: 2
 Metric rocksdb_block_cache_multi_touch_bytes_read: 52088
 Metric docdb_keys_found: 1000
 Storage Execution Time: 2.413 ms
 Peak Memory Usage: 5714 kB
(49 rows)

YugabyteDB replaced 1000 Index Cond: (id = t1.id) loops with with one Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint])) loop. The join executes with only one 'Read Request', providing high performance even when distributed.

One potential issue to note is that the batching process does not retain the order of the rows. To address this, a Sort operation has been added. However, when dealing with a large amount of data, the execution time of the Sort operation can be a concern, especially when using LIMIT or FETCH FIRST ... ROWS. The non-batched version may lead to better performance. This will be improved soon.

YugabyteDB Order-Preserving Batched Nested Loop

Batched Nested Loop was disabled by default during the preview phase, but it has been tested extensively by numerous customers and open-source users. It is an improvement over Nested Loop for almost all queries, with the exception of those that use ORDER BY LIMIT as mentioned above. This is because, in such cases, all rows must be sorted before fetching the top-N ones. This has been addressed by issue #19589 by sorting each batch and will be available in the next release.

I'm too impatient to show you the execution plan. I've build YugabyteDB from the source with the commit for #19589 and you can do the same because it is fully Open Source (instructions here):

yugabyte=# select version();
                                                                                         version

--------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------
 PostgreSQL 11.2-YB-2.21.0.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 16.0.6 (https://github.com/yugabyte/llvm-project.git 1e6329f40e5c531c09ade70
15278078682293ebd), 64-bit
(1 row)

yugabyte=# \! curl -s $(hostname):7000 | awk -F">" '/<pre>version/{print $NF}'
version 2.21.0.0 build 376 revision 6aef7ad88c9575c480b13be8f23ecce91bcdd08c build_type RELEASE built at 19 Dec 2023 06:42:35 UTC

yugabyte=# select name, setting from pg_settings where name like '%bnl%';
            name             | setting
-----------------------------+---------
 yb_bnl_batch_size           | 1024
 yb_bnl_enable_hashing       | on
 yb_bnl_optimize_first_batch | on
 yb_prefer_bnl               | on
(4 rows)

I created the same table again and ran the same query on it:

create table t2 ( id bigint, primary key(id asc));
insert into t2 select generate_series(1,1000000);
\pset pager off
select
 format('explain (analyze, buffers, dist, debug) select id from t2 where t2.id = ANY (''{ %s }''::bigint[]) order by id',string_agg(n::text,',')) as arr ,
 format('explain (analyze, buffers, dist, debug) with t1(id) as (values(%s)) select id from (select * from t1 order by id) t1 join t2 using(id) order by id',string_agg(n::text,'),(')) as val
from generate_series(1,1000) n;
\gexec

In this version, Batched Nested Loop is enabled by default. Here is the execution plan for the join:

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 YB Batched Nested Loop Join  (cost=82.33..193.16 rows=1000 width=8) (actual time=8.085..8.158 rows=1000 loops=1)
   Join Filter: (t1.id = t2.id)
   Sort Keys: t2.id
   CTE t1
     ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.107 rows=1000 loops=1)
   ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual time=1.071..1.100 rows=1000 loops=1)
         Sort Key: t1.id
         Sort Method: quicksort  Memory: 71kB
         ->  CTE Scan on t1  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.005..0.439 rows=1000 loops=1)
   ->  Index Scan using t2_pkey on t2  (cost=0.00..0.11 rows=1 width=8) (actual time=5.030..5.196 rows=1000 loops=1)
         Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
         Storage Table Read Requests: 1
         Storage Table Read Execution Time: 1.822 ms
         Metric rocksdb_block_cache_hit: 2.000
         Metric rocksdb_block_cache_index_hit: 1.000
         Metric rocksdb_block_cache_data_hit: 1.000
         Metric rocksdb_block_cache_bytes_read: 52088.000
         Metric rocksdb_number_db_seek: 1000.000
         Metric rocksdb_number_db_next: 1000.000
         Metric rocksdb_number_db_seek_found: 1000.000
         Metric rocksdb_number_db_next_found: 1000.000
         Metric rocksdb_iter_bytes_read: 82980.000
         Metric rocksdb_block_cache_multi_touch_hit: 2.000
         Metric rocksdb_block_cache_multi_touch_bytes_read: 52088.000
         Metric docdb_keys_found: 1000.000
         Metric ql_read_latency: sum: 935.000, count: 1.000
 Planning Time: 15.052 ms
 Execution Time: 10.667 ms
 Storage Read Requests: 1
 Storage Read Execution Time: 1.822 ms
 Storage Write Requests: 0
 Catalog Read Requests: 40
 Catalog Read Execution Time: 26.263 ms
 Catalog Write Requests: 0
 Storage Flush Requests: 0
 Metric rocksdb_block_cache_hit: 2
 Metric rocksdb_block_cache_index_hit: 1
 Metric rocksdb_block_cache_data_hit: 1
 Metric rocksdb_block_cache_bytes_read: 52088
 Metric rocksdb_number_db_seek: 1000
 Metric rocksdb_number_db_next: 1000
 Metric rocksdb_number_db_seek_found: 1000
 Metric rocksdb_number_db_next_found: 1000
 Metric rocksdb_iter_bytes_read: 82980
 Metric rocksdb_block_cache_multi_touch_hit: 2
 Metric rocksdb_block_cache_multi_touch_bytes_read: 52088
 Metric docdb_keys_found: 1000
 Metric ql_read_latency: sum: 935, count: 1
 Storage Execution Time: 28.085 ms
 Peak Memory Usage: 5751 kB
(50 rows)

The change is that the Sort Keys: t2.id has been moved under the YB Batched Nested Loop Join. Currently, since I am reading fewer rows than the batch size, there is no difference in the response time. However, with more rows, sorting small batches is quicker and requires less memory. This performance improvement will be more noticeable when a LIMIT is added as it won't have to sort all the rows.

YugabyteDB High Performance Top-N Join

There's an additional optimization in this latest version, that we have seen with the new yb_bnl_optimize_first_batch=on parameter.

I query the Top-42 rows by adding a fetch first 42 rows or limit 42 clause to the order by:

explain (analyze, buffers, dist, debug) with t1(id) as (
values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255),(256),(257),(258),(259),(260),(261),(262),(263),(264),(265),(266),(267),(268),(269),(270),(271),(272),(273),(274),(275),(276),(277),(278),(279),(280),(281),(282),(283),(284),(285),(286),(287),(288),(289),(290),(291),(292),(293),(294),(295),(296),(297),(298),(299),(300),(301),(302),(303),(304),(305),(306),(307),(308),(309),(310),(311),(312),(313),(314),(315),(316),(317),(318),(319),(320),(321),(322),(323),(324),(325),(326),(327),(328),(329),(330),(331),(332),(333),(334),(335),(336),(337),(338),(339),(340),(341),(342),(343),(344),(345),(346),(347),(348),(349),(350),(351),(352),(353),(354),(355),(356),(357),(358),(359),(360),(361),(362),(363),(364),(365),(366),(367),(368),(369),(370),(371),(372),(373),(374),(375),(376),(377),(378),(379),(380),(381),(382),(383),(384),(385),(386),(387),(388),(389),(390),(391),(392),(393),(394),(395),(396),(397),(398),(399),(400),(401),(402),(403),(404),(405),(406),(407),(408),(409),(410),(411),(412),(413),(414),(415),(416),(417),(418),(419),(420),(421),(422),(423),(424),(425),(426),(427),(428),(429),(430),(431),(432),(433),(434),(435),(436),(437),(438),(439),(440),(441),(442),(443),(444),(445),(446),(447),(448),(449),(450),(451),(452),(453),(454),(455),(456),(457),(458),(459),(460),(461),(462),(463),(464),(465),(466),(467),(468),(469),(470),(471),(472),(473),(474),(475),(476),(477),(478),(479),(480),(481),(482),(483),(484),(485),(486),(487),(488),(489),(490),(491),(492),(493),(494),(495),(496),(497),(498),(499),(500),(501),(502),(503),(504),(505),(506),(507),(508),(509),(510),(511),(512),(513),(514),(515),(516),(517),(518),(519),(520),(521),(522),(523),(524),(525),(526),(527),(528),(529),(530),(531),(532),(533),(534),(535),(536),(537),(538),(539),(540),(541),(542),(543),(544),(545),(546),(547),(548),(549),(550),(551),(552),(553),(554),(555),(556),(557),(558),(559),(560),(561),(562),(563),(564),(565),(566),(567),(568),(569),(570),(571),(572),(573),(574),(575),(576),(577),(578),(579),(580),(581),(582),(583),(584),(585),(586),(587),(588),(589),(590),(591),(592),(593),(594),(595),(596),(597),(598),(599),(600),(601),(602),(603),(604),(605),(606),(607),(608),(609),(610),(611),(612),(613),(614),(615),(616),(617),(618),(619),(620),(621),(622),(623),(624),(625),(626),(627),(628),(629),(630),(631),(632),(633),(634),(635),(636),(637),(638),(639),(640),(641),(642),(643),(644),(645),(646),(647),(648),(649),(650),(651),(652),(653),(654),(655),(656),(657),(658),(659),(660),(661),(662),(663),(664),(665),(666),(667),(668),(669),(670),(671),(672),(673),(674),(675),(676),(677),(678),(679),(680),(681),(682),(683),(684),(685),(686),(687),(688),(689),(690),(691),(692),(693),(694),(695),(696),(697),(698),(699),(700),(701),(702),(703),(704),(705),(706),(707),(708),(709),(710),(711),(712),(713),(714),(715),(716),(717),(718),(719),(720),(721),(722),(723),(724),(725),(726),(727),(728),(729),(730),(731),(732),(733),(734),(735),(736),(737),(738),(739),(740),(741),(742),(743),(744),(745),(746),(747),(748),(749),(750),(751),(752),(753),(754),(755),(756),(757),(758),(759),(760),(761),(762),(763),(764),(765),(766),(767),(768),(769),(770),(771),(772),(773),(774),(775),(776),(777),(778),(779),(780),(781),(782),(783),(784),(785),(786),(787),(788),(789),(790),(791),(792),(793),(794),(795),(796),(797),(798),(799),(800),(801),(802),(803),(804),(805),(806),(807),(808),(809),(810),(811),(812),(813),(814),(815),(816),(817),(818),(819),(820),(821),(822),(823),(824),(825),(826),(827),(828),(829),(830),(831),(832),(833),(834),(835),(836),(837),(838),(839),(840),(841),(842),(843),(844),(845),(846),(847),(848),(849),(850),(851),(852),(853),(854),(855),(856),(857),(858),(859),(860),(861),(862),(863),(864),(865),(866),(867),(868),(869),(870),(871),(872),(873),(874),(875),(876),(877),(878),(879),(880),(881),(882),(883),(884),(885),(886),(887),(888),(889),(890),(891),(892),(893),(894),(895),(896),(897),(898),(899),(900),(901),(902),(903),(904),(905),(906),(907),(908),(909),(910),(911),(912),(913),(914),(915),(916),(917),(918),(919),(920),(921),(922),(923),(924),(925),(926),(927),(928),(929),(930),(931),(932),(933),(934),(935),(936),(937),(938),(939),(940),(941),(942),(943),(944),(945),(946),(947),(948),(949),(950),(951),(952),(953),(954),(955),(956),(957),(958),(959),(960),(961),(962),(963),(964),(965),(966),(967),(968),(969),(970),(971),(972),(973),(974),(975),(976),(977),(978),(979),(980),(981),(982),(983),(984),(985),(986),(987),(988),(989),(990),(991),(992),(993),(994),(995),(996),(997),(998),(999),(1000)
) select id 
 from (select * from t1 order by id) t1
 join t2 using(id) order by id fetch first 42 rows only
;

Here is the execution plan:

                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=82.33..86.98 rows=42 width=8) (actual time=3.619..3.634 rows=42 loops=1)
   CTE t1
     ->  Values Scan on "*VALUES*"  (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.109 rows=1000 loops=1)
   ->  YB Batched Nested Loop Join  (cost=69.83..180.66 rows=1000 width=8) (actual time=3.617..3.620 rows=42 loops=1)
         Join Filter: (t1.id = t2.id)
         Sort Keys: t2.id
         ->  Sort  (cost=69.83..72.33 rows=1000 width=4) (actual time=0.695..0.697 rows=42 loops=1)
               Sort Key: t1.id
               Sort Method: quicksort  Memory: 71kB
               ->  CTE Scan on t1  (cost=0.00..20.00 rows=1000 width=4) (actual time=0.006..0.427 rows=1000 loops=1)
         ->  Index Scan using t2_pkey on t2  (cost=0.00..0.11 rows=1 width=8) (actual time=0.978..0.991 rows=42 loops=1)
               Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
               Storage Table Read Requests: 1
               Storage Table Read Execution Time: 0.698 ms
               Metric rocksdb_block_cache_hit: 2.000
               Metric rocksdb_block_cache_index_hit: 1.000
               Metric rocksdb_block_cache_data_hit: 1.000
               Metric rocksdb_block_cache_bytes_read: 52088.000
               Metric rocksdb_number_db_seek: 42.000
               Metric rocksdb_number_db_next: 42.000
               Metric rocksdb_number_db_seek_found: 42.000
               Metric rocksdb_number_db_next_found: 42.000
               Metric rocksdb_iter_bytes_read: 3358.000
               Metric rocksdb_block_cache_multi_touch_hit: 2.000
               Metric rocksdb_block_cache_multi_touch_bytes_read: 52088.000
               Metric docdb_keys_found: 42.000
               Metric ql_read_latency: sum: 140.000, count: 1.000
 Planning Time: 1.215 ms
 Execution Time: 4.370 ms

Not only the Sort operation is restricted to the batch size, but the LIMIT has been pushed down, resulting in the first batch only reading rows=42. This query's performance is optimal, with only one read request and one seek per row to output.

YugabyteDB enables scalable distributed joins, which is an advantage over sharded databases like Citus or Aurora Limitless that require co-locating joined shards.

Would it improve PostgreSQL?

Batching the nested loop join has a significant advantage in YugabyteDB as it reduces the number of remote calls, making joins scalable.

We attempted to simulate an array parameter in PostgreSQL, but it didn't result in any performance improvements. This is because PostgreSQL doesn't support Index Skip Scan, which means that every index lookup, whether from an outer table or an inner array, takes the same amount of time.

Although this approach may be more efficient if Index Skip Scan is implemented in a future version, the benefits will be minimal. It can help avoid the reading of B-Tree root and branches, but most of the time, they are shared buffer hits in a monolithic database, which is already fast.

Batching can be a useful strategy for optimizing PostgreSQL Index Scan if the array is sorted based on the CTID - the physical location of rows in heap tables. This approach can enhance the table access through the caches, particularly when the index has a poor correlation factor. Oracle has TABLE ACCESS BY INDEX ROWID BATCHED to accomplish this, but this does not have all the advantages because it will not preserve the row ordering from the outer table, similar to PostgreSQL Bitmap Heap Scan.

The YugabyteDB Batched Nested Loop is an excellent example of the benefits of its innovative architecture. It uses a fork of PostgreSQL to offer comprehensive SQL features, which it inherits from PostgreSQL's maturity. However, unlike an extension, it is not restricted and in terms of performance it can provide Distributed SQL scalability, including joins and pagination.

DEV Community

Could Batched Nested Loop improve PostgreSQL like it does for YugabyteDB?

PostgreSQL

PostgreSQL = ANY ( ARRAY )

PostgreSQL Nested Loop Join

YugabyteDB

YugabyteDB = ANY ( ARRAY )

YugabyteDB Nested Loop Join

YB Batched Nested Loop Join

YugabyteDB Order-Preserving Batched Nested Loop

YugabyteDB High Performance Top-N Join

Would it improve PostgreSQL?

Top comments (0)

Read next

SQL: Perform statistics at different granularities based on the time span of the data #85

Understanding Sequelize's 'escape' function and why ideally we should avoid using it manually

SQL Server Execution Plans

How to deploy PostgreSQL database to Railway