Franck Pachot for YugabyteDB Distributed PostgreSQL Database

Posted on Jan 31, 2023 • Edited on Aug 14, 2023

SELECT COUNT(*) FROM 🚀WHERE some conditions ;

#yugabytedb #pushdown #offloading #sql

YugabyteDB does not re-use the PostgreSQL code as-is. To get high performance in a distributed environment, many operations have been pushed down and offloaded to the storage. The SQL processing happens in YSQL (the PostgreSQL backend) which gets rows from DocDB (the distributed storage and transaction).

Some of those optimizations were introduced in previous releases:

pushdowns for aggregates in release 2.0
pushdown for Seq Scan filters in 2.13
pushdown for Index Scan conditions in 2.15

How do they combine where running a select COUNT(*) with a WHERE clause?

I'm running this in version 2.17 and create a 1 million rows table with a range-sharded index:

create table demo ( id bigint primary key, val int );
insert into demo select n, n from generate_series(1,1000000) n;
create index demoi on demo(val asc, id) 
 split at values ( (333333),(666666) );
analyze demo;
set yb_enable_optimizer_statistics = on;

Count a few rows

I'll start with the most common case: the WHERE close is selective and that's why we have an index for it.

yugabyte=> explain (analyze, dist)
 select count(*) from demo where val between 1 and 42;

                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4.01..4.02 rows=1 width=8) (actual time=0.805..0.805 rows=1 loops=1)
   ->  Index Only Scan using demoi on demo  (cost=0.00..4.01 rows=1 width=0) (actual time=0.764..0.781 rows=42 loops=1)
         Index Cond: ((val >= 1) AND (val <= 42))
         Heap Fetches: 0
         Storage Index Read Requests: 1
         Storage Index Execution Time: 4.000 ms
 Planning Time: 0.072 ms
 Execution Time: 0.847 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 4.000 ms
 Peak Memory Usage: 24 kB
(12 rows)

This plan is efficient. The query planner estimated a few rows to be counted and choose the Index Only Scan. 42 rows were fetched from DocDB to YSQL in one read request, in 4 milliseconds, and then counted there for an additional 4 millisecond.

Index Scan and push-down

What is pushed down here? The answer is in the rows=42 actual output of the Scan operation. The index condition was pushed down, to scan a range of the index. But as it returned all 42 rows rather than the count, the Aggregate was not pushed down.

This is acceptable for a few rows. We will see in a second part what happens with a million of rows that you definitely don't want to transport to count them on another node.

Seq Scan and push-down

What would happen if, for any reason, an index scan was not possible?

I disable enable_indexscan for the query with a hint:

yugabyte=> explain (analyze, dist) 
           /*+  Set(enable_indexscan off) */
 select count(*) from demo where val between 1 and 42;

                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=104697.50..104697.51 rows=1 width=8) (actual time=3767.276..3767.276 rows=1 loops=1)
   ->  Seq Scan on demo  (cost=0.00..104697.50 rows=1 width=0) (actual time=26.033..3767.161 rows=42 loops=1)
         Filter: ((val >= 1) AND (val <= 42))
         Rows Removed by Filter: 999958
         Storage Table Read Requests: 978
         Storage Table Execution Time: 3652.016 ms
 Planning Time: 0.110 ms
 Execution Time: 3767.337 ms
 Storage Read Requests: 978
 Storage Write Requests: 0
 Storage Execution Time: 3652.016 ms
 Peak Memory Usage: 30 kB
(12 rows)

Now it takes more than 3 seconds because it has to scan all rows (one million) in DocDB, send all of them to the YSQL layer that applies the filter (Rows Removed by Filter: 999958), gets the remaining ones (rows=42) and finally count them to get one aggregated result (rows=1)

This is where we want to push down the filter by enabling the yb_enable_expression_pushdown session parameter:

yugabyte=> explain (analyze, dist) 
           /*+  Set(enable_indexscan off) Set(yb_enable_expression_pushdown on) */
 select count(*) from demo where val between 1 and 42;

                                                   QUERY PLAN
----------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=104697.50..104697.51 rows=1 width=8) (actual time=1233.028..1233.028 rows=1 loops=1)
   ->  Seq Scan on demo  (cost=0.00..104697.50 rows=1 width=0) (actual time=1233.015..1233.018 rows=3 loops=1)
         Remote Filter: ((val >= 1) AND (val <= 42))
         Partial Aggregate: true
         Storage Table Read Requests: 1
         Storage Table Execution Time: 1232.005 ms
 Planning Time: 0.110 ms
 Execution Time: 1233.096 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 1232.005 ms
 Peak Memory Usage: 30 kB
(12 rows)

This is a bit faster because the rows have been filtered in DocDB (Remote Filter) and also counted there (Partial Aggregate: true). Only the partial count has been fetched from the 3 tablets (rows=3) and has added to finalize the aggregate to get one count (rows=1).

When there is no index, this is the best we can do: 1.2 seconds here instead of the previous 3.7 seconds (without the push-down of filter and aggregate).

This is great but, unfortunately too simple if you have to use pg_hint_plan to force a full table scan. The Seq Scan you see in the execution plan is not a real Seq Scan. The aggregate push-down has been implemented in early versions of Yugabyte using the Foreign Scan node of PostgreSQL. You see Seq Scan because Foreign Scan was renamed in explain.c to hide this implementation detail.

When you think about it, it means that the SeqScan() hint forces another one, the real Seq Scan which is, in YugabyteDB, actually a subclass of it that is called YBSeqScan. I know you get lost, but don't panic, I've summarized that in a table at the end of this post. This scan is visible in the execution plan as YB Seq Scan and, in this version (2.17), it does not support the aggregate push down:

yugabyte=> explain (analyze, dist) 
           /*+  SeqScan(demo) Set(yb_enable_expression_pushdown on) */
 select count(*) from demo where val between 1 and 42;

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=104697.50..104697.51 rows=1 width=8) (actual time=1222.116..1222.116 rows=1 loops=1)
   ->  YB Seq Scan on demo  (cost=0.00..104697.50 rows=1 width=0) (actual time=1222.093..1222.105 rows=42 loops=1)
         Remote Filter: ((val >= 1) AND (val <= 42))
 Planning Time: 0.114 ms
 Execution Time: 1222.160 ms
 Storage Read Requests: 0
 Storage Write Requests: 0
 Storage Execution Time: 0.000 ms
 Peak Memory Usage: 30 kB
(9 rows)

The expression filter was pushed down, as a Remote Filter in this YB Seq Scan but the 42 rows have been fetched to be aggregated in YSQL.

Here, because there are only 42 rows, it doesn't show a difference in the response time.

Count many rows

I'll now count a lot of rows by changing the predicate:

yugabyte=> explain (analyze, dist) 
 select count(*) from demo where val between 42 and 1e6 ;

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=108021.14..108021.15 rows=1 width=8) (actual time=3790.036..3790.036 rows=1 loops=1)
   ->  Seq Scan on demo  (cost=0.00..107190.29 rows=332340 width=0) (actual time=4.047..3705.883 rows=999959 loops=1)
         Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
         Rows Removed by Filter: 41
         Storage Table Read Requests: 978
         Storage Table Execution Time: 3416.015 ms
 Planning Time: 0.072 ms
 Execution Time: 3790.096 ms
 Storage Read Requests: 978
 Storage Write Requests: 0
 Storage Execution Time: 3416.015 ms
 Peak Memory Usage: 30 kB
(12 rows)

Here, without any push down, the scan takes nearly 4 seconds, one million rows were fetched by YSQL to be filtered (removing only 41 rows) and counted.

An index is not better, even with expression pushdown, because there are many rows and the aggregate is not pushed down:

yugabyte=> explain (analyze, dist) 
 /*+  IndexOnlyScan(demo) Set(yb_enable_expression_pushdown on) */
 select count(*) from demo where val between 42 and 1e6 ;

                                                              QUERY PLAN                                                       
-----------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=124631.41..124631.42 rows=1 width=8) (actual time=5546.335..5546.336 rows=1 loops=1)
   ->  Index Only Scan using demoi on demo  (cost=0.00..122138.86 rows=997019 width=0) (actual time=5.589..5451.448 rows=999959 loops=1)
         Index Cond: (val >= 42)
         Remote Filter: ((val)::numeric <= '1000000'::numeric)
         Heap Fetches: 0
         Storage Index Read Requests: 978
         Storage Index Execution Time: 4928.021 ms
 Planning Time: 0.126 ms
 Execution Time: 5546.383 ms
 Storage Read Requests: 978
 Storage Write Requests: 0
 Storage Execution Time: 4928.021 ms
 Peak Memory Usage: 24 kB
(13 rows)

With the Seq Scan, the aggregate can be pushed down by enabling yb_enable_expression_pushdown:

yugabyte=> explain (analyze, dist) 
 /*+  Set(yb_enable_expression_pushdown on) */
 select count(*) from demo where val between 42 and 1e6 ;

                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=108021.14..108021.15 rows=1 width=8) (actual time=1359.976..1359.976 rows=1 loops=1)
   ->  Seq Scan on demo  (cost=0.00..107190.29 rows=332340 width=0) (actual time=1359.964..1359.967 rows=3 loops=1)
         Remote Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
         Partial Aggregate: true
         Storage Table Read Requests: 1
         Storage Table Execution Time: 1360.006 ms
 Planning Time: 0.104 ms
 Execution Time: 1360.037 ms
 Storage Read Requests: 1
 Storage Write Requests: 0
 Storage Execution Time: 1360.006 ms
 Peak Memory Usage: 30 kB
(12 rows)

Here, because the query planner estimations were good, a Seq Scan, which is actually the Foreign Scan, has been chosen. If you force a SeqScan() with a hint, the YB Seq Scan comes, without pushing down the aggregate:

yugabyte=> explain (analyze, dist) 
 /*+  SeqScan(demo) Set(yb_enable_expression_pushdown on) */
 select count(*) from demo where val between 42 and 1e6 ;

                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=108021.14..108021.15 rows=1 width=8) (actual time=2561.376..2561.376 rows=1 loops=1)
   ->  YB Seq Scan on demo  (cost=0.00..107190.29 rows=332340 width=0) (actual time=5.691..2403.967 rows=999959 loops=1)
         Remote Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
 Planning Time: 0.112 ms
 Execution Time: 2561.419 ms
 Storage Read Requests: 0
 Storage Write Requests: 0
 Storage Execution Time: 0.000 ms
 Peak Memory Usage: 30 kB

It is the same with the NoIndexScan() hint:

yugabyte=> explain (analyze, dist) 
 /*+  NoIndexScan(demo) Set(yb_enable_expression_pushdown on) */
 select count(*) from demo where val between 42 and 1e6 ;

                                                       QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=108021.14..108021.15 rows=1 width=8) (actual time=2763.550..2763.551 rows=1 loops=1)
   ->  YB Seq Scan on demo  (cost=0.00..107190.29 rows=332340 width=0) (actual time=6.953..2631.524 rows=999959 loops=1)
         Remote Filter: ((val >= 42) AND ((val)::numeric <= '1000000'::numeric))
 Planning Time: 0.115 ms
 Execution Time: 2763.651 ms
 Storage Read Requests: 0
 Storage Write Requests: 0
 Storage Execution Time: 0.000 ms
 Peak Memory Usage: 30 kB
(9 rows)

Then, if you need to force Full Table Scan that enables all push downs, you must use the query parameter planner, set at session level, or with the Set() hint, but not the Scan hints.

A summary as of YugabyteDB 2.17

Here is a summary:

	Seq Scan	YB Seq Scan	Index Scan
In EXPLAIN as	`Seq Scan`	`YB Seq Scan`	`Index Scan`
In code (`/src/postgres/`)	ForeignScan	YBSeqScan	IndexScan
Aggregate Push Down	`Partial Aggregate`✅
Expression Push Down	`Remote Filter`✅	`Remote Filter`✅	`Index Cond.`✅
Forced by hint	Set()	SeqScan()	IndexScan()

Note:

SeqScan in nodeSeqscan.c is not used
ForeignScan in ybc_fdw.c whill not be used anymore for YugabyteDB tables after https://github.com/yugabyte/yugabyte-db/issues/18082

Don't forget to enable the features in your session

set yb_enable_expression_pushdown = on;
set yb_enable_optimizer_statistics = on;

and have the tables analyzed. You shouldn't need to hint if the estimations are good.

DEV Community

SELECT COUNT(*) FROM 🚀WHERE some conditions ;

Count a few rows

Index Scan and push-down

Seq Scan and push-down

Count many rows

A summary as of YugabyteDB 2.17

Top comments (0)

Read next

Domina el arte del análisis de datos junto a AWS

LeetCode Meditations: Number of 1 Bits

The Definitive Guide to Maximizing Digital Marketing Approaches

Tips On How To Generate Income Online For Newbies: 21 Easy Ideas 2024