DEV Community

Cover image for Out of Range statistics with PostgreSQL & YugabyteDB
Franck Pachot for YugabyteDB

Posted on

Out of Range statistics with PostgreSQL & YugabyteDB

There can be issues with optimizer statistics when a query contains a predicate with a value that is out of range, such as higher than the maximum value gathered during the last ANALYZE.

If statistics were updated in real-time, it would be easy for the planner to estimate the number of rows to zero or one (the planner never sets zero). However, statistics are gathered with ANALYZE or Auto-Analyze and can become stale until the next run.
Typically, there are always a few rows above the known maximum on columns that are constantly increasing, such as sequences or current timestamps. Oracle Database uses linear decay. PostgreSQL leverages the index to provide a quick, accurate maximum to adjust the estimations. YugabyteDB is currently implementing the cost-based optimizer, which is available in preview. Here is a small test I did to compare the estimates.

I created a table with a ts timestamp, the system time when inserting, and a ty type column. I disable Auto-Vacuum to disable Auto Analyze for the experiment:

postgres=# create table demo ( id bigserial primary key, ts timestamptz default clock_timestamp(), ty text )
           with (autovacuum_enabled = off);
CREATE TABLE

postgres=# insert into demo (ty,ts)
            select 'a', timestamp '2024-02-08 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 86400
Enter fullscreen mode Exit fullscreen mode

I inserted data for February 8, 2024. I will query a range of dates, including those outside the existing range, for a future date. Additionally, I will query with equality on ty for both an existing and non-existing value.

Here is an example without statistics. Look at the estimated number of rows:

postgres=# explain (analyze, summary off) select * from demo
 where ts > timestamp '2024-02-07 14:00:00' - interval '1 hour';

                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=19652 width=48) (actual time=0.013..10.890 rows=39599 loops=1)
   Filter: (ts > '2024-02-07 13:00:00'::timestamp without time zone)
   Rows Removed by Filter: 46801
(3 rows)

postgres=# explain (analyze, summary off) select * from demo
 where ts > timestamp '2024-02-08 14:00:00' - interval '1 hour';

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=19652 width=48) (actual time=9.285..9.285 rows=0 loops=1)
   Filter: (ts > '2024-02-08 13:00:00'::timestamp without time zone)
   Rows Removed by Filter: 86400
(3 rows)

postgres=# explain (analyze, summary off) select * from demo
postgres-#  where ts > timestamp '2024-02-09 14:00:00' - interval '1 hour';

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=19652 width=48) (actual time=9.385..9.385 rows=0 loops=1)
   Filter: (ts > '2024-02-09 13:00:00'::timestamp without time zone)
   Rows Removed by Filter: 86400
(3 rows)

postgres=# explain (analyze, summary off) select * from demo
 where ty='a'
;

                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=295 width=48) (actual time=0.014..11.199 rows=86400 loops=1)
   Filter: (ty = 'a'::text)
(2 rows)

postgres=# explain (analyze, summary off) select * from demo
 where ty='b'
;

                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=295 width=48) (actual time=7.595..7.595 rows=0 loops=1)
   Filter: (ty = 'b'::text)
   Rows Removed by Filter: 86400
(3 rows)
Enter fullscreen mode Exit fullscreen mode

My objective is to analyze the accuracy of estimated rows (cost=... rows=) as compared to actual ones (actual... rows=) . I intend to do this through various scenarios including those with and without the use of ANALYZE, INSERT after the last ANALYZE, and an index on the columns.

This has been run on PostgreSQL 16, and on YugabyteDB 2.19.3 with and without yb_enable_optimizer_statistics=on.

Here is the summary from the above execution plans:

Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 19652 1000 1 39599
ts>2024-02-08 19652 1000 1 0
ts>2024-02-09 19652 1000 1 0
ty='a' 295 1000 1 86400
ty='b' 295 1000 1 0

I have rows only for the day of 2024-02-07 but without any statistics, the query planner doesn't know that and uses a generic calculation to estimate the number of rows.

Let's ANALYZE the table and run the same queries:

postgres=# analyze demo;
ANALYZE
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 39275 86400 39752 39599
ts>2024-02-08 9 86400 9 0
ts>2024-02-09 9 86400 9 0
ty='a' 86400 86400 86400 86400
ty='b' 1 86400 1 0

With the table and column statistics, the estimated number of rows is correct with PostgreSQL. For the values out of range, the planner estimates a small number, but higher than zero, to account for stale statistics. With yb_enable_optimizer_statistics=off, YugabyteDB uses the number of rows (pg_class.reltuples) but not the column statistics. With yb_enable_optimizer_statistics=on it uses them to estimate the selectivity, with the PostgreSQL algorithm.

Let's insert another day to wee what happens when we are above the values gathered by the last ANALYZE:

postgres=# insert into demo (ty,ts)
select 'a', timestamp '2024-02-09 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 8640
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 125723 86400 39752 125999
ts>2024-02-08 39422 86400 9 39599
ts>2024-02-09 17 86400 9 0
ty='a' 172800 86400 86400 172800
ty='b' 1 86400 1 0

Even if the statistics where not gathered, PostgreSQL has the knowledge of newly inserted rows (used by the Auto Vacuum and visible in pg_stat_all_tables) and provides a accurate estimations.

Let's Analyze the table:

postgres=# analyze demo;
ANALYZE
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 125834 172800 125782 125999
ts>2024-02-08 39053 172800 38853 39599
ts>2024-02-09 17 172800 17 0
ty='a' 172800 172800 172800 172800
ty='b' 1 172800 1 0

The estimations are good with PostgreSQL and YugabyteDB when using the column statistics. However, the out-of-range 2024-02-09 is still over-estimated.

Let's create an index on the ts column and run the queries again:

postgres=# create index demots on demo(ts);
CREATE INDEX
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 125834 172800 125782 125999
ts>2024-02-08 39053 172800 38853 39599
ts>2024-02-09 1 172800 17 0
ty='a' 172800 172800 172800 172800
ty='b' 1 172800 1 0

Thanks to the presence of the index, the out-of-range estimation is now correct (the planner always put 1 instead of 0) with PostgreSQL which can know the actual maximum value easily from the last block of the B-Tree index.

Now that I have this index, I insert rows for an additional day and run the queries again:

postgres=# insert into demo (ty,ts)
            select 'a', timestamp '2024-02-10 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 86400
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 188694 172800 125782 212399
ts>2024-02-08 58563 172800 38853 125999
ts>2024-02-09 1165 172800 17 39599
ty='a' 259122 172800 172800 259200
ty='b' 1 172800 1 0

Having the actual maximum is not sufficient for an exact estimation, but, without any ANALYZE, the out-of-range values were estimated higher with PostgreSQL.

I insert ten more days and run the queries again:

postgres=# insert into demo (ty,ts)
            select 'b', timestamp '2024-02-20 00:00:00'-generate_series(1,864000)*interval'1 second'
;
INSERT 0 864000
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 817749 172800 125782 1076399
ts>2024-02-08 253794 172800 38853 989999
ts>2024-02-09 10658 172800 17 903599
ty='a' 1122965 172800 172800 259200
ty='b' 1 172800 1 864000

The estimation error is higher. This mechanism helps with stale statistics, when Auto Analyze didn't run recently, but is not a solution if you stop Auto Analyze.

Note that I've inserted those new rows with a different ty value, but, without ANALYZE, the query planner doesn't see them and still under-estimates the number of rows with ty='b'

We have seen that the index can help for the increasing value, let's try the same here:

postgres=# create index demo_b on demo(ty asc);
CREATE INDEX
Enter fullscreen mode Exit fullscreen mode
Query PostgreSQL YB stats=off YB stats=on Actual
ts>2024-02-07 817749 172800 125782 1076399
ts>2024-02-08 253794 172800 38853 989999
ts>2024-02-09 10658 172800 17 903599
ty='a' 1123200 172800 172800 259200
ty='b' 1 172800 1 864000

This didn't change. When you add rows with a new value on a column that is used by predicates, you must have it analyzed. in this example, the misestimate can have severe consequences on joins. When the estimation is 1 row, the PostgreSQL planner can choose a Nested Loop, and then finally run it with thousands of loops. YugabyteDB lowers the consequence of it with Batched Nested Loops.

I disabled Auto Vacuum for the experiment. In PostgreSQL you should have it enabled:

postgres=# alter table demo set ( autovacuum_enabled = on );
ALTER TABLE
Enter fullscreen mode Exit fullscreen mode

YugabyteDB doesn't need vacuum. It is not subject to bloat, heap fetches, and transaction ID wraparound. If you rely on statistics (yb_enable_optimizer_statistics=on you should schedule ANALYZE).


In summary, either you run with no statistics at all and rely on generic plans, which is not recommended in PostgreSQL, or you must be sure that Auto-Analyze is running regularly. PostgreSQL has some nice ways to avoid misestimates for out-of-range predicates but they work only for a small staleness. With YugabyteDB, until the Cost Based Optimizer is GA with a cost model that is cluster-aware, you can use the rule based optimizer for OLTP queries, and help with planner hints for more complex queries. If you use the Cost-Based Optimizer, it is important to analyze all tables and set yb_enable_optimizer_statistics to ensure that the query planner uses all available statistics. When you rever back, you must de-analyze the tables.

Additionally, it is recommended to have a data model that is not too sensitive to small changes in cardinality. This can be achieved by creating the right indexes for your access patterns so that the best Scan and Join methods become obvious.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more