Franck Pachot for YugabyteDB

Posted on Feb 11

Out of Range statistics with PostgreSQL & YugabyteDB

#postgres #yugabytedb #sql #database

There can be issues with optimizer statistics when a query contains a predicate with a value that is out of range, such as higher than the maximum value gathered during the last ANALYZE.

If statistics were updated in real-time, it would be easy for the planner to estimate the number of rows to zero or one (the planner never sets zero). However, statistics are gathered with ANALYZE or Auto-Analyze and can become stale until the next run.
Typically, there are always a few rows above the known maximum on columns that are constantly increasing, such as sequences or current timestamps. Oracle Database uses linear decay. PostgreSQL leverages the index to provide a quick, accurate maximum to adjust the estimations. YugabyteDB is currently implementing the cost-based optimizer, which is available in preview. Here is a small test I did to compare the estimates.

I created a table with a ts timestamp, the system time when inserting, and a ty type column. I disable Auto-Vacuum to disable Auto Analyze for the experiment:

postgres=# create table demo ( id bigserial primary key, ts timestamptz default clock_timestamp(), ty text )
           with (autovacuum_enabled = off);
CREATE TABLE

postgres=# insert into demo (ty,ts)
            select 'a', timestamp '2024-02-08 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 86400

I inserted data for February 8, 2024. I will query a range of dates, including those outside the existing range, for a future date. Additionally, I will query with equality on ty for both an existing and non-existing value.

Here is an example without statistics. Look at the estimated number of rows:

postgres=# explain (analyze, summary off) select * from demo
 where ts > timestamp '2024-02-07 14:00:00' - interval '1 hour';

                                                QUERY PLAN
-----------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=19652 width=48) (actual time=0.013..10.890 rows=39599 loops=1)
   Filter: (ts > '2024-02-07 13:00:00'::timestamp without time zone)
   Rows Removed by Filter: 46801
(3 rows)

postgres=# explain (analyze, summary off) select * from demo
 where ts > timestamp '2024-02-08 14:00:00' - interval '1 hour';

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=19652 width=48) (actual time=9.285..9.285 rows=0 loops=1)
   Filter: (ts > '2024-02-08 13:00:00'::timestamp without time zone)
   Rows Removed by Filter: 86400
(3 rows)

postgres=# explain (analyze, summary off) select * from demo
postgres-#  where ts > timestamp '2024-02-09 14:00:00' - interval '1 hour';

                                              QUERY PLAN
------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=19652 width=48) (actual time=9.385..9.385 rows=0 loops=1)
   Filter: (ts > '2024-02-09 13:00:00'::timestamp without time zone)
   Rows Removed by Filter: 86400
(3 rows)

postgres=# explain (analyze, summary off) select * from demo
 where ty='a'
;

                                               QUERY PLAN
---------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=295 width=48) (actual time=0.014..11.199 rows=86400 loops=1)
   Filter: (ty = 'a'::text)
(2 rows)

postgres=# explain (analyze, summary off) select * from demo
 where ty='b'
;

                                             QUERY PLAN
----------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1287.96 rows=295 width=48) (actual time=7.595..7.595 rows=0 loops=1)
   Filter: (ty = 'b'::text)
   Rows Removed by Filter: 86400
(3 rows)

My objective is to analyze the accuracy of estimated rows (cost=... rows=) as compared to actual ones (actual... rows=) . I intend to do this through various scenarios including those with and without the use of ANALYZE, INSERT after the last ANALYZE, and an index on the columns.

This has been run on PostgreSQL 16, and on YugabyteDB 2.19.3 with and without yb_enable_optimizer_statistics=on.

Here is the summary from the above execution plans:

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	19652	1000	1	39599
ts>2024-02-08	19652	1000	1	0
ts>2024-02-09	19652	1000	1	0
ty='a'	295	1000	1	86400
ty='b'	295	1000	1	0

I have rows only for the day of 2024-02-07 but without any statistics, the query planner doesn't know that and uses a generic calculation to estimate the number of rows.

Let's ANALYZE the table and run the same queries:

postgres=# analyze demo;
ANALYZE

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	39275	86400	39752	39599
ts>2024-02-08	9	86400	9	0
ts>2024-02-09	9	86400	9	0
ty='a'	86400	86400	86400	86400
ty='b'	1	86400	1	0

With the table and column statistics, the estimated number of rows is correct with PostgreSQL. For the values out of range, the planner estimates a small number, but higher than zero, to account for stale statistics. With yb_enable_optimizer_statistics=off, YugabyteDB uses the number of rows (pg_class.reltuples) but not the column statistics. With yb_enable_optimizer_statistics=on it uses them to estimate the selectivity, with the PostgreSQL algorithm.

Let's insert another day to wee what happens when we are above the values gathered by the last ANALYZE:

postgres=# insert into demo (ty,ts)
select 'a', timestamp '2024-02-09 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 8640

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	125723	86400	39752	125999
ts>2024-02-08	39422	86400	9	39599
ts>2024-02-09	17	86400	9	0
ty='a'	172800	86400	86400	172800
ty='b'	1	86400	1	0

Even if the statistics where not gathered, PostgreSQL has the knowledge of newly inserted rows (used by the Auto Vacuum and visible in pg_stat_all_tables) and provides a accurate estimations.

Let's Analyze the table:

postgres=# analyze demo;
ANALYZE

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	125834	172800	125782	125999
ts>2024-02-08	39053	172800	38853	39599
ts>2024-02-09	17	172800	17	0
ty='a'	172800	172800	172800	172800
ty='b'	1	172800	1	0

The estimations are good with PostgreSQL and YugabyteDB when using the column statistics. However, the out-of-range 2024-02-09 is still over-estimated.

Let's create an index on the ts column and run the queries again:

postgres=# create index demots on demo(ts);
CREATE INDEX

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	125834	172800	125782	125999
ts>2024-02-08	39053	172800	38853	39599
ts>2024-02-09	1	172800	17	0
ty='a'	172800	172800	172800	172800
ty='b'	1	172800	1	0

Thanks to the presence of the index, the out-of-range estimation is now correct (the planner always put 1 instead of 0) with PostgreSQL which can know the actual maximum value easily from the last block of the B-Tree index.

Now that I have this index, I insert rows for an additional day and run the queries again:

postgres=# insert into demo (ty,ts)
            select 'a', timestamp '2024-02-10 00:00:00'-generate_series(1,86400)*interval'1 second'
;
INSERT 0 86400

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	188694	172800	125782	212399
ts>2024-02-08	58563	172800	38853	125999
ts>2024-02-09	1165	172800	17	39599
ty='a'	259122	172800	172800	259200
ty='b'	1	172800	1	0

Having the actual maximum is not sufficient for an exact estimation, but, without any ANALYZE, the out-of-range values were estimated higher with PostgreSQL.

I insert ten more days and run the queries again:

postgres=# insert into demo (ty,ts)
            select 'b', timestamp '2024-02-20 00:00:00'-generate_series(1,864000)*interval'1 second'
;
INSERT 0 864000

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	817749	172800	125782	1076399
ts>2024-02-08	253794	172800	38853	989999
ts>2024-02-09	10658	172800	17	903599
ty='a'	1122965	172800	172800	259200
ty='b'	1	172800	1	864000

The estimation error is higher. This mechanism helps with stale statistics, when Auto Analyze didn't run recently, but is not a solution if you stop Auto Analyze.

Note that I've inserted those new rows with a different ty value, but, without ANALYZE, the query planner doesn't see them and still under-estimates the number of rows with ty='b'

We have seen that the index can help for the increasing value, let's try the same here:

postgres=# create index demo_b on demo(ty asc);
CREATE INDEX

Query	PostgreSQL	YB stats=off	YB stats=on	Actual
ts>2024-02-07	817749	172800	125782	1076399
ts>2024-02-08	253794	172800	38853	989999
ts>2024-02-09	10658	172800	17	903599
ty='a'	1123200	172800	172800	259200
ty='b'	1	172800	1	864000

This didn't change. When you add rows with a new value on a column that is used by predicates, you must have it analyzed. in this example, the misestimate can have severe consequences on joins. When the estimation is 1 row, the PostgreSQL planner can choose a Nested Loop, and then finally run it with thousands of loops. YugabyteDB lowers the consequence of it with Batched Nested Loops.

I disabled Auto Vacuum for the experiment. In PostgreSQL you should have it enabled:

postgres=# alter table demo set ( autovacuum_enabled = on );
ALTER TABLE

YugabyteDB doesn't need vacuum. It is not subject to bloat, heap fetches, and transaction ID wraparound. If you rely on statistics (yb_enable_optimizer_statistics=on you should schedule ANALYZE).

In summary, either you run with no statistics at all and rely on generic plans, which is not recommended in PostgreSQL, or you must be sure that Auto-Analyze is running regularly. PostgreSQL has some nice ways to avoid misestimates for out-of-range predicates but they work only for a small staleness. With YugabyteDB, until the Cost Based Optimizer is GA with a cost model that is cluster-aware, you can use the rule based optimizer for OLTP queries, and help with planner hints for more complex queries. If you use the Cost-Based Optimizer, it is important to analyze all tables and set yb_enable_optimizer_statistics to ensure that the query planner uses all available statistics. When you rever back, you must de-analyze the tables.

Additionally, it is recommended to have a data model that is not too sensitive to small changes in cardinality. This can be achieved by creating the right indexes for your access patterns so that the best Scan and Join methods become obvious.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more

DEV Community

Out of Range statistics with PostgreSQL & YugabyteDB

Top comments (0)

Read next

Express

Serverless Postgres with Neon - My first impression

OracleBootCamp - 5 May 2024

Passing The DataCamp SQL Associate Certificate