I like the Batched Nested Loop feature in YugabyteDB's execution plans. It has solved many problems by making complex queries, with many joins, scalable without compromising the agility to distribute table rows and index entries transparently. Someone asked me if this feature would be useful for PostgreSQL. To find out, we need to test it.
Essentially, this feature can replace a thousand loops with one. It pushes down the outer loops to the inner loop to reduce the cross-shard calls. The thousand values from the outer table are batched into a single array.
Let's build a minimal example where the outer table has one thousand rows built with VALUES, and we want to fetch the matching rows from a one million table. I'll show what happens with an ARRAY and with a JOIN, in PostgreSQL and YugabyteDB.
PostgreSQL
I create the following table with one million rows:
create table t2 ( id bigint, primary key(id));
insert into t2 select generate_series(1,1000000);
vacuum t2;
I will query it for 1000 values using a nested loop from VALUES
and a single scan with = ANY()
. The purpose is to understand the performance complexity by examining the execution plan with runtime statistics.
As I'm too lazy to type 1000 values, I generate it and \gexec
it:
\pset pager off
select
format('
explain (analyze, buffers)
select id from t2
where t2.id = ANY (''{ %s }''::bigint[])
order by id
',string_agg(n::text,',')) as arr,
format('
explain (analyze, buffers)
with t1(id) as (values( %s ))
select id from t1 join t2 using(id)
order by id
',string_agg(n::text,'),(')) as val
from generate_series(1,1000) n;
\gexec
I'll show the execution plans with explain (analyze, buffers)
to see how many pages were read to get those 1000 rows.
PostgreSQL = ANY ( ARRAY )
Let's start without a join to see how our solution, with an array, accesses the thousand rows in the inner table.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t2_pkey on t2 (cost=0.42..3826.50 rows=1000 width=8) (actual time=0.038..0.889 rows=1000 loops=1)
Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,987,988,989,990,991,992,993,994,995,996,997,998,999,1000}'::bigint[]))
Heap Fetches: 0
Buffers: local hit=3001
Planning Time: 0.238 ms
Execution Time: 0.942 ms
(6 rows)
This is an Index Only Scan
because my index covers all columns I need, and with Heap Fetches: 0
as it was freshly vacuumed. To read 1000 values, it has read 3001 buffers. It is easy to guess that each index lookup had to read the B-Tree root, then one branch and one leaf. There's no scan optimization for multiple points or ranges in PostgreSQL, each lookup is an index access, and I can already guess that it is not faster than 1000 loops.
PostgreSQL Nested Loop Join
To use a join, I've defined those thousand values as a set of rows with the VALUES constructor.
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Sort (cost=3888.83..3891.33 rows=1000 width=8) (actual time=1.426..1.479 rows=1000 loops=1)
Sort Key: t2.id
Sort Method: quicksort Memory: 25kB
Buffers: local hit=3001
-> Nested Loop (cost=0.42..3839.00 rows=1000 width=8) (actual time=0.013..1.358 rows=1000 loops=1)
Buffers: local hit=3001
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.163 rows=1000 loops=1)
-> Index Only Scan using t2_pkey on t2 (cost=0.42..3.83 rows=1 width=8) (actual time=0.001..0.001 rows=1 loops=1000)
Index Cond: (id = "*VALUES*".column1)
Heap Fetches: 0
Buffers: local hit=3001
Planning Time: 0.461 ms
Execution Time: 1.541 ms
(13 rows)
With loops=1000
the performance of the Index Only Scan is exactly the same, reading 3001 pages for 1000 lookups to a 3 levels B-Tree.
An additional Sort
increases the response time for my order by
. It was not necessary with the array because the query planner can reorder it but a Nested Loop is ordered by the outer table. There's a solution for that. If I replace from t1
by from (select * from t1 order by id) t1
the Nested Loop will be aware of the order and can bypass the Sort
operation.
YugabyteDB
To run in in YugabyteDB, I've defined range sharding to get the index ordered like in PostgreSQL: primary key(id asc)
and I've gathered distributed read/write and LSM-Tree metrics: explain (analyze, buffers, dist, debug)
.
There's no need to vacuum as there's no bloating, no stale visibility map and no xid wraparound with YugabyteDB.
YugabyteDB = ANY ( ARRAY )
Index Scan using t2_pkey on t2 (cost=0.00..5.36 rows=1 width=8) (actual time=5.187..5.411 rows=1000 loops=1)
Index Cond: (id = ANY ('{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271,272,273,274,275,276,277,278,279,280,281,282,283,284,285,286,287,288,289,290,291,292,293,294,295,296,297,298,299,300,301,302,303,304,305,306,307,308,309,310,311,312,313,314,315,316,317,318,319,320,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,345,346,347,348,349,350,351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366,367,368,369,370,371,372,373,374,375,376,377,378,379,380,381,382,383,384,385,386,387,388,389,390,391,392,393,394,395,396,397,398,399,400,401,402,403,404,405,406,407,408,409,410,411,412,413,414,415,416,417,418,419,420,421,422,423,424,425,426,427,428,429,430,431,432,433,434,435,436,437,438,439,440,441,442,443,444,445,446,447,448,449,450,451,452,453,454,455,456,457,458,459,460,461,462,463,464,465,466,467,468,469,470,471,472,473,474,475,476,477,478,479,480,481,482,483,484,485,486,487,488,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,566,567,568,569,570,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631,632,633,634,635,636,637,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,664,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,742,743,744,745,746,747,748,749,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,770,771,772,773,774,775,776,777,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,794,795,796,797,798,799,800,801,802,803,804,805,806,807,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833,834,835,836,837,838,839,840,841,842,843,844,845,846,847,848,849,850,851,852,853,854,855,856,857,858,859,860,861,862,863,864,865,866,867,868,869,870,871,872,873,874,875,876,877,878,879,880,881,882,883,884,885,886,887,888,889,890,891,892,893,894,895,896,897,898,899,900,901,902,903,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943,944,945,946,947,948,949,950,951,952,953,954,955,956,957,958,959,960,961,962,963,964,965,966,967,968,969,970,971,972,973,974,975,976,977,978,979,980,981,982,983,984,985,986,97,988,989,990,991,992,993,994,995,996,997,998,999,1000}'::bigint[]))
Storage Table Read Requests: 1
Storage Table Read Execution Time: 2.361 ms
Metric rocksdb_number_db_seek: 1000.000
Metric rocksdb_number_db_next: 1000.000
Metric rocksdb_number_db_seek_found: 1000.000
Metric rocksdb_number_db_next_found: 1000.000
Metric rocksdb_iter_bytes_read: 82980.000
Metric docdb_keys_found: 1000.000
Planning Time: 8.063 ms
Execution Time: 6.272 ms
Storage Read Requests: 1
Storage Read Execution Time: 2.361 ms
Storage Write Requests: 0.000
Catalog Read Requests: 4
Catalog Read Execution Time: 3.153 ms
Catalog Write Requests: 0.000
Storage Flush Requests: 0
Metric rocksdb_number_db_seek: 1000
Metric rocksdb_number_db_next: 1000
Metric rocksdb_number_db_seek_found: 1000
Metric rocksdb_number_db_next_found: 1000
Metric rocksdb_iter_bytes_read: 82980
Metric docdb_keys_found: 1000
Storage Execution Time: 5.514 ms
Peak Memory Usage: 30 kB
(27 rows)
The Index Scan involves searching for 1000 values in the LSM-Tree, which is a RocksDB for each YugabyteDB table or index shard. It is different than B-Tree. One rocksdb_number_db_seek
goes to the key ( MemTable and Level 0 SST files), and rocksdb_number_db_next
reads the row. It is worth noting that the 1000 searches are carried out by a single Read Request
, which is one distributed read. This means that even if the index tablet is located in another zone, the query will still take only a few milliseconds. Unlike PostgreSQL, YugabyteDB can perform an Index Skip Scan when reading from the LSM-Tree, fetching multiple values in one scan.
YugabyteDB Nested Loop Join
Given the previous observation, I've replaced from t1
with from (select * from t1 order by id) t1
to show that the final Sort
operation is not needed in this case:
Nested Loop (cost=82.33..208.72 rows=1000 width=8) (actual time=1.547..441.056 rows=1000 loops=1)
CTE t1
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.158 rows=1000 loops=1)
-> Sort (cost=69.83..72.33 rows=1000 width=4) (actual time=0.840..1.443 rows=1000 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 71kB
-> CTE Scan on t1 (cost=0.00..20.00 rows=1000 width=4) (actual time=0.004..0.547 rows=1000 loops=1)
-> Index Scan using t2_pkey on t2 (cost=0.00..0.11 rows=1 width=8) (actual time=0.422..0.422 rows=1 loops=1000)
Index Cond: (id = t1.id)
Storage Table Read Requests: 1
Storage Table Read Execution Time: 0.377 ms
Metric rocksdb_block_cache_hit: 3.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_filter_hit: 1.000
Metric rocksdb_block_cache_data_hit: 1.000
Metric rocksdb_block_cache_bytes_read: 117565.000
Metric rocksdb_number_db_seek: 1.000
Metric rocksdb_number_db_next: 1.000
Metric rocksdb_number_db_seek_found: 1.000
Metric rocksdb_number_db_next_found: 1.000
Metric rocksdb_iter_bytes_read: 82.980
Metric rocksdb_block_cache_multi_touch_hit: 3.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 117565.000
Metric docdb_keys_found: 1.000
Planning Time: 0.800 ms
Execution Time: 442.253 ms
Storage Read Requests: 1000
Storage Read Execution Time: 377.414 ms
Storage Write Requests: 0.000
Catalog Read Requests: 0
Catalog Write Requests: 0.000
Storage Flush Requests: 0
Metric rocksdb_block_cache_hit: 3000
Metric rocksdb_block_cache_index_hit: 1000
Metric rocksdb_block_cache_filter_hit: 1000
Metric rocksdb_block_cache_data_hit: 1000
Metric rocksdb_block_cache_bytes_read: 117565000
Metric rocksdb_number_db_seek: 1000
Metric rocksdb_number_db_next: 1000
Metric rocksdb_number_db_seek_found: 1000
Metric rocksdb_number_db_next_found: 1000
Metric rocksdb_iter_bytes_read: 82980
Metric rocksdb_block_cache_multi_touch_hit: 3000
Metric rocksdb_block_cache_multi_touch_bytes_read: 117565000
Metric docdb_keys_found: 1000
Storage Execution Time: 377.414 ms
Peak Memory Usage: 232 kB
(47 rows)
The key distinction is that the 1000 lookups (rocksdb_number_db_seek: 1000
) in the LSM-Tree were performed via 1000 separate read requests (Storage Read Requests: 1000
). This approach adds network latency for each iteration, making it non-scalable. Our goal is to optimize the process by batching values and reducing the number of loops required, utilizing the Index Skip Scan mentioned above.
YB Batched Nested Loop Join
I'm running this in YugabyteDB 2.19 where the Batched Nested Loop is not yet enabled by default. I enabled it y increasing the batch size from 1 to 1024.
set yb_bnl_batch_size to 1024;
Here is the execution plan for the same query:
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Sort (cost=144.67..147.17 rows=1000 width=8) (actual time=7.384..7.430 rows=1000 loops=1)
Sort Key: t2.id
Sort Method: quicksort Memory: 71kB
CTE t1
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.153 rows=1000 loops=1)
-> YB Batched Nested Loop Join (cost=69.83..82.34 rows=1000 width=8) (actual time=6.163..7.053 rows=1000 loops=1)
Join Filter: (t1.id = t2.id)
-> Sort (cost=69.83..72.33 rows=1000 width=4) (actual time=0.861..0.908 rows=1000 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 71kB
-> CTE Scan on t1 (cost=0.00..20.00 rows=1000 width=4) (actual time=0.006..0.576 rows=1000 loops=1)
-> Index Scan using t2_pkey on t2 (cost=0.00..4.11 rows=1 width=8) (actual time=3.853..4.111 rows=1000 loops=1)
Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
Storage Table Read Requests: 1
Storage Table Read Execution Time: 2.413 ms
Metric rocksdb_block_cache_hit: 2.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_data_hit: 1.000
Metric rocksdb_block_cache_bytes_read: 52088.000
Metric rocksdb_number_db_seek: 1000.000
Metric rocksdb_number_db_next: 1000.000
Metric rocksdb_number_db_seek_found: 1000.000
Metric rocksdb_number_db_next_found: 1000.000
Metric rocksdb_iter_bytes_read: 82980.000
Metric rocksdb_block_cache_multi_touch_hit: 2.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 52088.000
Metric docdb_keys_found: 1000.000
Planning Time: 1.756 ms
Execution Time: 10.993 ms
Storage Read Requests: 1
Storage Read Execution Time: 2.413 ms
Storage Write Requests: 0.000
Catalog Read Requests: 0
Catalog Write Requests: 0.000
Storage Flush Requests: 0
Metric rocksdb_block_cache_hit: 2
Metric rocksdb_block_cache_index_hit: 1
Metric rocksdb_block_cache_data_hit: 1
Metric rocksdb_block_cache_bytes_read: 52088
Metric rocksdb_number_db_seek: 1000
Metric rocksdb_number_db_next: 1000
Metric rocksdb_number_db_seek_found: 1000
Metric rocksdb_number_db_next_found: 1000
Metric rocksdb_iter_bytes_read: 82980
Metric rocksdb_block_cache_multi_touch_hit: 2
Metric rocksdb_block_cache_multi_touch_bytes_read: 52088
Metric docdb_keys_found: 1000
Storage Execution Time: 2.413 ms
Peak Memory Usage: 5714 kB
(49 rows)
YugabyteDB replaced 1000 Index Cond: (id = t1.id)
loops with with one Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
loop. The join executes with only one 'Read Request', providing high performance even when distributed.
One potential issue to note is that the batching process does not retain the order of the rows. To address this, a Sort
operation has been added. However, when dealing with a large amount of data, the execution time of the Sort
operation can be a concern, especially when using LIMIT
or FETCH FIRST ... ROWS
. The non-batched version may lead to better performance. This will be improved soon.
YugabyteDB Order-Preserving Batched Nested Loop
Batched Nested Loop was disabled by default during the preview phase, but it has been tested extensively by numerous customers and open-source users. It is an improvement over Nested Loop for almost all queries, with the exception of those that use ORDER BY LIMIT as mentioned above. This is because, in such cases, all rows must be sorted before fetching the top-N ones. This has been addressed by issue #19589 by sorting each batch and will be available in the next release.
I'm too impatient to show you the execution plan. I've build YugabyteDB from the source with the commit for #19589 and you can do the same because it is fully Open Source (instructions here):
yugabyte=# select version();
version
--------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------
PostgreSQL 11.2-YB-2.21.0.0-b0 on x86_64-pc-linux-gnu, compiled by clang version 16.0.6 (https://github.com/yugabyte/llvm-project.git 1e6329f40e5c531c09ade70
15278078682293ebd), 64-bit
(1 row)
yugabyte=# \! curl -s $(hostname):7000 | awk -F">" '/<pre>version/{print $NF}'
version 2.21.0.0 build 376 revision 6aef7ad88c9575c480b13be8f23ecce91bcdd08c build_type RELEASE built at 19 Dec 2023 06:42:35 UTC
yugabyte=# select name, setting from pg_settings where name like '%bnl%';
name | setting
-----------------------------+---------
yb_bnl_batch_size | 1024
yb_bnl_enable_hashing | on
yb_bnl_optimize_first_batch | on
yb_prefer_bnl | on
(4 rows)
I created the same table again and ran the same query on it:
create table t2 ( id bigint, primary key(id asc));
insert into t2 select generate_series(1,1000000);
\pset pager off
select
format('explain (analyze, buffers, dist, debug) select id from t2 where t2.id = ANY (''{ %s }''::bigint[]) order by id',string_agg(n::text,',')) as arr ,
format('explain (analyze, buffers, dist, debug) with t1(id) as (values(%s)) select id from (select * from t1 order by id) t1 join t2 using(id) order by id',string_agg(n::text,'),(')) as val
from generate_series(1,1000) n;
\gexec
In this version, Batched Nested Loop is enabled by default. Here is the execution plan for the join:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
YB Batched Nested Loop Join (cost=82.33..193.16 rows=1000 width=8) (actual time=8.085..8.158 rows=1000 loops=1)
Join Filter: (t1.id = t2.id)
Sort Keys: t2.id
CTE t1
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.107 rows=1000 loops=1)
-> Sort (cost=69.83..72.33 rows=1000 width=4) (actual time=1.071..1.100 rows=1000 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 71kB
-> CTE Scan on t1 (cost=0.00..20.00 rows=1000 width=4) (actual time=0.005..0.439 rows=1000 loops=1)
-> Index Scan using t2_pkey on t2 (cost=0.00..0.11 rows=1 width=8) (actual time=5.030..5.196 rows=1000 loops=1)
Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
Storage Table Read Requests: 1
Storage Table Read Execution Time: 1.822 ms
Metric rocksdb_block_cache_hit: 2.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_data_hit: 1.000
Metric rocksdb_block_cache_bytes_read: 52088.000
Metric rocksdb_number_db_seek: 1000.000
Metric rocksdb_number_db_next: 1000.000
Metric rocksdb_number_db_seek_found: 1000.000
Metric rocksdb_number_db_next_found: 1000.000
Metric rocksdb_iter_bytes_read: 82980.000
Metric rocksdb_block_cache_multi_touch_hit: 2.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 52088.000
Metric docdb_keys_found: 1000.000
Metric ql_read_latency: sum: 935.000, count: 1.000
Planning Time: 15.052 ms
Execution Time: 10.667 ms
Storage Read Requests: 1
Storage Read Execution Time: 1.822 ms
Storage Write Requests: 0
Catalog Read Requests: 40
Catalog Read Execution Time: 26.263 ms
Catalog Write Requests: 0
Storage Flush Requests: 0
Metric rocksdb_block_cache_hit: 2
Metric rocksdb_block_cache_index_hit: 1
Metric rocksdb_block_cache_data_hit: 1
Metric rocksdb_block_cache_bytes_read: 52088
Metric rocksdb_number_db_seek: 1000
Metric rocksdb_number_db_next: 1000
Metric rocksdb_number_db_seek_found: 1000
Metric rocksdb_number_db_next_found: 1000
Metric rocksdb_iter_bytes_read: 82980
Metric rocksdb_block_cache_multi_touch_hit: 2
Metric rocksdb_block_cache_multi_touch_bytes_read: 52088
Metric docdb_keys_found: 1000
Metric ql_read_latency: sum: 935, count: 1
Storage Execution Time: 28.085 ms
Peak Memory Usage: 5751 kB
(50 rows)
The change is that the Sort Keys: t2.id
has been moved under the YB Batched Nested Loop Join
. Currently, since I am reading fewer rows than the batch size, there is no difference in the response time. However, with more rows, sorting small batches is quicker and requires less memory. This performance improvement will be more noticeable when a LIMIT is added as it won't have to sort all the rows.
YugabyteDB High Performance Top-N Join
There's an additional optimization in this latest version, that we have seen with the new yb_bnl_optimize_first_batch=on
parameter.
I query the Top-42 rows by adding a fetch first 42 rows
or limit 42
clause to the order by
:
explain (analyze, buffers, dist, debug) with t1(id) as (
values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255),(256),(257),(258),(259),(260),(261),(262),(263),(264),(265),(266),(267),(268),(269),(270),(271),(272),(273),(274),(275),(276),(277),(278),(279),(280),(281),(282),(283),(284),(285),(286),(287),(288),(289),(290),(291),(292),(293),(294),(295),(296),(297),(298),(299),(300),(301),(302),(303),(304),(305),(306),(307),(308),(309),(310),(311),(312),(313),(314),(315),(316),(317),(318),(319),(320),(321),(322),(323),(324),(325),(326),(327),(328),(329),(330),(331),(332),(333),(334),(335),(336),(337),(338),(339),(340),(341),(342),(343),(344),(345),(346),(347),(348),(349),(350),(351),(352),(353),(354),(355),(356),(357),(358),(359),(360),(361),(362),(363),(364),(365),(366),(367),(368),(369),(370),(371),(372),(373),(374),(375),(376),(377),(378),(379),(380),(381),(382),(383),(384),(385),(386),(387),(388),(389),(390),(391),(392),(393),(394),(395),(396),(397),(398),(399),(400),(401),(402),(403),(404),(405),(406),(407),(408),(409),(410),(411),(412),(413),(414),(415),(416),(417),(418),(419),(420),(421),(422),(423),(424),(425),(426),(427),(428),(429),(430),(431),(432),(433),(434),(435),(436),(437),(438),(439),(440),(441),(442),(443),(444),(445),(446),(447),(448),(449),(450),(451),(452),(453),(454),(455),(456),(457),(458),(459),(460),(461),(462),(463),(464),(465),(466),(467),(468),(469),(470),(471),(472),(473),(474),(475),(476),(477),(478),(479),(480),(481),(482),(483),(484),(485),(486),(487),(488),(489),(490),(491),(492),(493),(494),(495),(496),(497),(498),(499),(500),(501),(502),(503),(504),(505),(506),(507),(508),(509),(510),(511),(512),(513),(514),(515),(516),(517),(518),(519),(520),(521),(522),(523),(524),(525),(526),(527),(528),(529),(530),(531),(532),(533),(534),(535),(536),(537),(538),(539),(540),(541),(542),(543),(544),(545),(546),(547),(548),(549),(550),(551),(552),(553),(554),(555),(556),(557),(558),(559),(560),(561),(562),(563),(564),(565),(566),(567),(568),(569),(570),(571),(572),(573),(574),(575),(576),(577),(578),(579),(580),(581),(582),(583),(584),(585),(586),(587),(588),(589),(590),(591),(592),(593),(594),(595),(596),(597),(598),(599),(600),(601),(602),(603),(604),(605),(606),(607),(608),(609),(610),(611),(612),(613),(614),(615),(616),(617),(618),(619),(620),(621),(622),(623),(624),(625),(626),(627),(628),(629),(630),(631),(632),(633),(634),(635),(636),(637),(638),(639),(640),(641),(642),(643),(644),(645),(646),(647),(648),(649),(650),(651),(652),(653),(654),(655),(656),(657),(658),(659),(660),(661),(662),(663),(664),(665),(666),(667),(668),(669),(670),(671),(672),(673),(674),(675),(676),(677),(678),(679),(680),(681),(682),(683),(684),(685),(686),(687),(688),(689),(690),(691),(692),(693),(694),(695),(696),(697),(698),(699),(700),(701),(702),(703),(704),(705),(706),(707),(708),(709),(710),(711),(712),(713),(714),(715),(716),(717),(718),(719),(720),(721),(722),(723),(724),(725),(726),(727),(728),(729),(730),(731),(732),(733),(734),(735),(736),(737),(738),(739),(740),(741),(742),(743),(744),(745),(746),(747),(748),(749),(750),(751),(752),(753),(754),(755),(756),(757),(758),(759),(760),(761),(762),(763),(764),(765),(766),(767),(768),(769),(770),(771),(772),(773),(774),(775),(776),(777),(778),(779),(780),(781),(782),(783),(784),(785),(786),(787),(788),(789),(790),(791),(792),(793),(794),(795),(796),(797),(798),(799),(800),(801),(802),(803),(804),(805),(806),(807),(808),(809),(810),(811),(812),(813),(814),(815),(816),(817),(818),(819),(820),(821),(822),(823),(824),(825),(826),(827),(828),(829),(830),(831),(832),(833),(834),(835),(836),(837),(838),(839),(840),(841),(842),(843),(844),(845),(846),(847),(848),(849),(850),(851),(852),(853),(854),(855),(856),(857),(858),(859),(860),(861),(862),(863),(864),(865),(866),(867),(868),(869),(870),(871),(872),(873),(874),(875),(876),(877),(878),(879),(880),(881),(882),(883),(884),(885),(886),(887),(888),(889),(890),(891),(892),(893),(894),(895),(896),(897),(898),(899),(900),(901),(902),(903),(904),(905),(906),(907),(908),(909),(910),(911),(912),(913),(914),(915),(916),(917),(918),(919),(920),(921),(922),(923),(924),(925),(926),(927),(928),(929),(930),(931),(932),(933),(934),(935),(936),(937),(938),(939),(940),(941),(942),(943),(944),(945),(946),(947),(948),(949),(950),(951),(952),(953),(954),(955),(956),(957),(958),(959),(960),(961),(962),(963),(964),(965),(966),(967),(968),(969),(970),(971),(972),(973),(974),(975),(976),(977),(978),(979),(980),(981),(982),(983),(984),(985),(986),(987),(988),(989),(990),(991),(992),(993),(994),(995),(996),(997),(998),(999),(1000)
) select id
from (select * from t1 order by id) t1
join t2 using(id) order by id fetch first 42 rows only
;
Here is the execution plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Limit (cost=82.33..86.98 rows=42 width=8) (actual time=3.619..3.634 rows=42 loops=1)
CTE t1
-> Values Scan on "*VALUES*" (cost=0.00..12.50 rows=1000 width=4) (actual time=0.001..0.109 rows=1000 loops=1)
-> YB Batched Nested Loop Join (cost=69.83..180.66 rows=1000 width=8) (actual time=3.617..3.620 rows=42 loops=1)
Join Filter: (t1.id = t2.id)
Sort Keys: t2.id
-> Sort (cost=69.83..72.33 rows=1000 width=4) (actual time=0.695..0.697 rows=42 loops=1)
Sort Key: t1.id
Sort Method: quicksort Memory: 71kB
-> CTE Scan on t1 (cost=0.00..20.00 rows=1000 width=4) (actual time=0.006..0.427 rows=1000 loops=1)
-> Index Scan using t2_pkey on t2 (cost=0.00..0.11 rows=1 width=8) (actual time=0.978..0.991 rows=42 loops=1)
Index Cond: (id = ANY (ARRAY[(t1.id)::bigint, ($2)::bigint, ($3)::bigint, ..., ($1024)::bigint]))
Storage Table Read Requests: 1
Storage Table Read Execution Time: 0.698 ms
Metric rocksdb_block_cache_hit: 2.000
Metric rocksdb_block_cache_index_hit: 1.000
Metric rocksdb_block_cache_data_hit: 1.000
Metric rocksdb_block_cache_bytes_read: 52088.000
Metric rocksdb_number_db_seek: 42.000
Metric rocksdb_number_db_next: 42.000
Metric rocksdb_number_db_seek_found: 42.000
Metric rocksdb_number_db_next_found: 42.000
Metric rocksdb_iter_bytes_read: 3358.000
Metric rocksdb_block_cache_multi_touch_hit: 2.000
Metric rocksdb_block_cache_multi_touch_bytes_read: 52088.000
Metric docdb_keys_found: 42.000
Metric ql_read_latency: sum: 140.000, count: 1.000
Planning Time: 1.215 ms
Execution Time: 4.370 ms
Not only the Sort
operation is restricted to the batch size, but the LIMIT has been pushed down, resulting in the first batch only reading rows=42
. This query's performance is optimal, with only one read request and one seek per row to output.
YugabyteDB enables scalable distributed joins, which is an advantage over sharded databases like Citus or Aurora Limitless that require co-locating joined shards.
Would it improve PostgreSQL?
Batching the nested loop join has a significant advantage in YugabyteDB as it reduces the number of remote calls, making joins scalable.
We attempted to simulate an array parameter in PostgreSQL, but it didn't result in any performance improvements. This is because PostgreSQL doesn't support Index Skip Scan, which means that every index lookup, whether from an outer table or an inner array, takes the same amount of time.
Although this approach may be more efficient if Index Skip Scan is implemented in a future version, the benefits will be minimal. It can help avoid the reading of B-Tree root and branches, but most of the time, they are shared buffer hits in a monolithic database, which is already fast.
Batching can be a useful strategy for optimizing PostgreSQL Index Scan if the array is sorted based on the CTID - the physical location of rows in heap tables. This approach can enhance the table access through the caches, particularly when the index has a poor correlation factor. Oracle has TABLE ACCESS BY INDEX ROWID BATCHED
to accomplish this, but this does not have all the advantages because it will not preserve the row ordering from the outer table, similar to PostgreSQL Bitmap Heap Scan.
The YugabyteDB Batched Nested Loop is an excellent example of the benefits of its innovative architecture. It uses a fork of PostgreSQL to offer comprehensive SQL features, which it inherits from PostgreSQL's maturity. However, unlike an extension, it is not restricted and in terms of performance it can provide Distributed SQL scalability, including joins and pagination.
Top comments (0)