Here is a small example inspired by by this on HackerNews:
SELECT * FROM TABLE FOO WHERE ORGANIZATION_ID = 10 ORDER BY CREATED_AT LIMIT 10 OFFSET 0;
Postgres sorts bycreated_at
using an index, but uses the filter onorganization_id
. This organization has a million rows... without the order, the query runs in ms. In order, in seconds/minutes.
I did the same test on SQLServer and SQLite, they executed the query correctly, using the correct index. I've never created an index to fix a bad plan in SQLServer, and I've used SqlServer a lot more than Postgres.
I created a temporary demo table to disable auto vacuum and analyze and observe the behavior of optimizer statistics.
create extension if not exists pgcrypto;
drop table if exists demo;
create temporary table demo ( id uuid default gen_random_uuid() primary key, a int, b int, c int default 0 );
create index demoa on demo(a asc);
create index demob on demo(b asc);
insert into demo (a,b) select a, b from generate_series(1,10) a, lateral ( select generate_series(1,100*a) b ) b ;
vacuum analyze demo;
I run a query that filters on column a
and sorts on column b
to get the Top-10. Because I've two indexes, the query planner has the choice between
- read all rows for the
a
value with an index range scan, sort them, and return the first 10 rows - read all rows in
b
order with a full index scan, filter for thea
value and stop when 10 rows have been returned.
This choice is not easy as it involves estimating the cost of sorting in the first case and the selectivity of the filter in the second case.
Here, with the table freshly vacuumed and analyzed, the query planner decides on the first solution: range scan and sort:
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=12.19..12.22 rows=10 width=28) (actual time=0.048..0.050 rows=10 loops=1)
Buffers: local hit=3
-> Sort (cost=12.19..12.44 rows=100 width=28) (actual time=0.047..0.048 rows=10 loops=1)
Sort Key: b
Sort Method: top-N heapsort Memory: 25kB
Buffers: local hit=3
-> Index Scan using demoa on demo (cost=0.28..10.03 rows=100 width=28) (actual time=0.015..0.032 rows=100 loops=1)
Index Cond: (a = 1)
Buffers: local hit=3
Planning Time: 0.088 ms
Execution Time: 0.066 ms
The estimated cost is estimated to 10.03
Update
I'm running an update for all rows I'm interested in. I vacuum the changes, but do not analyze yet:
postgres=# update demo set c=c+1 where a =1;
UPDATE 100
postgres=# vacuum demo;
VACUUM
Because the statistics are the same, the plan for my query is the same, with a similar cost:
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Limit (cost=12.15..12.18 rows=10 width=28) (actual time=0.052..0.054 rows=10 loops=1)
Buffers: local hit=4
-> Sort (cost=12.15..12.40 rows=99 width=28) (actual time=0.051..0.052 rows=10 loops=1)
Sort Key: b
Sort Method: top-N heapsort Memory: 25kB
Buffers: local hit=4
-> Index Scan using demoa on demo (cost=0.28..10.02 rows=99 width=28) (actual time=0.016..0.035 rows=100 loops=1)
Index Cond: (a = 1)
Buffers: local hit=4
Planning Time: 0.095 ms
Execution Time: 0.069 ms
(11 rows)
The estimated cost is 10.02
. If you have an idea why it is not exactly the same (10.02 instead of 10.03) when planning with exactly the same statistics, please share your idea in comments.
Look and Save the statistics
Here are the statistics from before the UPDATE:
postgres=# select schemaname,tablename,attname,correlation,* from pg_stats where tablename='demo';
schemaname | tablename | attname | correlation | schemaname | tablename | attname | inherited | null_frac | avg_width | n_distinct | most_common_vals | most_common_freqs | histogram_bounds | correlation | most_common_elems | most_common_elem_freqs | elem_count_histogram
------------+-----------+---------+--------------+------------+-----------+---------+-----------+-----------+-----------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------+-------------------+------------------------+----------------------
pg_temp_3 | demo | id | -0.017304245 | pg_temp_3 | demo | id | f | 0 | 16 | -1 | | | {00177a50-262f-44cb-8d1e-3d9559d08fd3,02a53508-2345-4d4f-8f4c-7548af2d0c28,050e297f-1d09-4d99-9700-4766e6f368a5,08076212-f6f0-4109-a9d4-05b28eec858a,0a1c98d4-c8b5-47c1-97e7-67fa9a8b4236,0bd09320-af97-474c-8b6f-5befee23787b,0e88109a-f281-4657-a7b5-a9bfc36d8fb2,111b3f37-7a43-4292-ae6a-fa19c8eb96bd,13d0df94-04f4-44d0-a931-bf084785aeca,16c67cc1-8893-4086-97f8-6331748e8e76,18fcf9a4-683a-41d6-964f-5c8290137f1e,1b4c73c7-2bc5-41cc-827c-c46897e212c9,1e47c3bc-d8d0-42cd-8cf5-cc7870221c59,2090ca69-4094-45f9-985f-a137bbb449b2,2347fa3d-c4af-4ec9-93ca-4ffbc7ce7a3e,259e5b31-dd0a-4c30-b1f7-7cb9294a047e,285b7cd9-8834-4ba4-99b1-02e665d53b8b,2ad15788-b68c-4f13-b96d-f8dc72ec8d3c,2d442a98-b977-47c5-b321-a28bec5bbee7,300b5f1c-24b9-42eb-8d70-c6e6b1f010b5,327121cf-8813-4eef-b769-273512ade3ec,353cd2b6-0bcd-4392-800b-398f1f95a26d,3771edc7-277f-451e-99ed-7dd2cc0358c8,3abae78e-9739-45bc-9d2a-b6a891cc8c5f,3d73e280-91c3-4bcd-ab1f-ad0e3c798a7c,40155190-712e-4a03-a717-537c10d0cbee,430755f5-efde-4adc-8166-bca0f6bd185a,457decf4-aff4-44ac-a527-9896339e9d78,4822960c-86d8-405f-92bb-704f12100797,4aa22a76-8796-46ef-9e50-8735913f246e,4cf59db9-d6a2-477e-a7fb-e7f1ff392345,50229955-60ee-461d-b810-6299e8ede8b4,530e85d3-fde4-4413-bc36-7b44925d6e9c,5657c3bc-9f9c-402b-b3b0-b46b57f91ee5,58f797ef-6bd0-4ea9-bad5-bdee1e1ea8b9,5b3f6fbe-b55f-4faa-82d8-c092d50db5d4,5df5155f-2015-42f2-a4f4-3faf086058b9,60728578-f62f-4aa8-9a8b-3ea897fa7681,637eb296-1dc8-41d7-949f-64e9d2309d82,65efdb61-0c0a-4e04-a0aa-e6cc2e68f95b,6882bef1-27a1-429c-b80e-e155836b7e25,6b150811-621f-4331-98e4-726c7487efca,6db90715-69ed-461d-b55f-7b0c163dae13,70ddbe21-332b-49c0-aad0-645577c7ed4a,730b1f18-6245-47f6-9c1e-c74ae23741d8,75450bab-d60b-43c3-a7eb-eb7aad0790df,77cf7a70-e1c7-46f2-846b-fc186e5800d1,7a60c937-700c-4872-bff3-b9ee7deeb4bb,7d413b35-b2b6-40e8-8b98-a69b4adf4873,7fb76dbe-8d52-414b-9baf-f8768f403c91,824c969f-9af6-4151-8462-a01856cd35c9,84bc8a53-2ee3-4d54-be7a-5849837f5c23,878ed8b4-f802-464d-865f-c394f611c254,898c5890-8e9e-402c-9500-93ddb716e54a,8bc7aa7d-9e25-444b-abdd-9f970c76d226,8e16229c-d6a3-4d6b-9d54-9063e44fee0d,9050e18c-55f6-4066-8a37-78f48ef990e0,9269e5a7-edd8-44cf-a850-9c5f6f0cf2ec,94d8c19c-afd2-4765-9aea-92aa2a7c50cf,978df8ba-90c8-4178-b46f-59a8414436a7,99d89759-5cdf-4fce-ad43-6f34bcae4291,9c284ad7-c3a2-47ef-a5d2-7a1fb840863b,9ed32aa3-eef8-4e2c-b279-b2a79f062802,a163bb22-b901-4303-85a9-62ab7b5a99a9,a3e03662-827a-40f1-a1c6-690716d45419,a6a625b7-5522-44d0-95f0-bf2bdbef932c,a95e51dc-0956-4475-83ae-a41d391f7169,ac906b78-acef-4c17-ba61-02e1514811ff,af2e92a8-61a0-4893-bed6-1fd319c01c09,b20dbd0d-4a9f-401c-88a4-25b84d215f1c,b4ceda62-7683-4b53-bb9e-46209dce724d,b759fd6f-c8ff-4411-9ccd-4ffd3dc26038,b9db83b4-52d9-4372-a30e-0e7fb305af8a,bc1f7aef-3a4e-4fa2-80c7-c70039d354fb,bddfd17f-6734-43a2-9e78-55f1eca74fb4,c03054ba-7873-4b96-af6c-be18e6c3f2f3,c29d0d3e-f52d-4613-97c6-f6f27c6f0776,c48eed2f-7d92-4934-8953-a40cd5ed4e83,c6df1a12-9da9-48b8-b03b-87bf855bda8d,c979d826-7143-41c0-9336-076d6ef693ec,cc3804b0-88d6-4c31-83e2-5f59a6dcc462,ced519b1-b856-4f00-b8f8-a0be4199affd,d1ab2aed-e2c8-42ad-a7bd-a687c0b5d5b2,d3f3bcb9-26c5-4681-9b85-96bb1267f69b,d65c5c7e-11c7-4cc6-98d2-90fc107035ca,d8ca7739-f8cf-4fd1-bc29-a9d6fdd44030,db4e11cf-e6e0-40ed-ba4c-a7551b9a7ab1,de7fa6de-90dc-49b8-b86a-afc2cbe11523,e0d8186a-11a8-4000-a630-caa86b5d5a16,e33eb335-ba8b-4aa1-aac1-538e1be9e006,e5c43d4f-f2bf-4027-bcaa-ba185f1f55fe,e81b88d8-4abb-4a03-9492-fd716fcb9def,eb774db0-7e61-4821-8543-b57138692fa8,ee2e8e26-20bc-4784-8177-14aaf60081c8,f0b78a9f-ae6e-4e1d-89f8-ec253c9c572d,f359d614-ffb4-4c5d-9e66-518d70ffe26a,f5b9d45c-1388-4a26-8e16-5dd331b69715,f8340cbf-010b-4f58-bd3e-9864fbab28be,fad8492e-f12f-472e-9eb6-b70e4714d81e,fd5b3384-193c-4be3-a122-71fd6f4cedff,ffebdedd-d88e-4037-8942-dcc611d6c4c7} | -0.017304245 | | |
pg_temp_3 | demo | a | 1 | pg_temp_3 | demo | a | f | 0 | 4 | 10 | {10,9,8,7,6,5,4,3,2,1} | {0.18181819,0.16363636,0.14545454,0.12727273,0.10909091,0.09090909,0.07272727,0.054545455,0.036363635,0.018181818} | | 1 | | |
pg_temp_3 | demo | b | 0.57577896 | pg_temp_3 | demo | b | f | 0 | 4 | -0.18181819 | {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100} | {0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818,0.0018181818} | {101,105,110,115,120,125,130,135,140,145,150,155,160,165,170,175,180,185,190,195,200,206,212,217,223,229,234,240,245,251,257,262,268,274,279,285,290,296,302,308,315,321,328,334,340,347,353,360,366,373,379,385,392,398,405,413,420,428,435,443,450,458,465,473,480,488,495,503,512,521,530,539,548,557,566,575,584,593,603,614,625,637,648,659,670,682,693,705,720,735,750,765,780,795,815,838,860,883,910,955,1000} | 0.57577896 | | |
pg_temp_3 | demo | c | 1 | pg_temp_3 | demo | c | f | 0 | 4 | 1 | {0} | {1} | | 1 | | |
(4 rows)
postgres=# select * from pg_class where relname='demo';
oid | relname | relnamespace | reltype | reloftype | relowner | relam | relfilenode | reltablespace | relpages | reltuples | relallvisible | reltoastrelid | relhasindex | relisshared | relpersistence | relkind | relnatts | relchecks | relhasrules | relhastriggers | relhassubclass | relrowsecurity | relforcerowsecurity | relispopulated | relreplident | relispartition | relrewrite | relfrozenxid | relminmxid | relacl | reloptions | relpartbound
-------+---------+--------------+---------+-----------+----------+-------+-------------+---------------+----------+-----------+---------------+---------------+-------------+-------------+----------------+---------+----------+-----------+-------------+----------------+----------------+----------------+---------------------+----------------+--------------+----------------+------------+--------------+------------+--------+------------+--------------
18338 | demo | 18145 | 18340 | 0 | 10 | 2 | 18338 | 0 | 42 | 5428 | 42 | 0 | t | f | t | r | 4 | 0 | f | f | f | f | f | t | d | f | 0 | 2026 | 1 | | |
(1 row)
Because I already know which one will make the difference, I save the statistics about index correlation by generating the statements to set them back:
postgres=# select string_agg(
format('update pg_statistic set stanumbers%s=%L where starelid=%s and staattnum=%s and stakind%s=3;'
,n,array[correlation],starelid,staattnum,n
),e'\n' order by staattnum) as restore_correlation
from pg_stats
natural join (select oid as starelid, relname as tablename, relnamespace from pg_class) c
natural join (select oid as relnamespace, nspname as schemaname from pg_namespace) n
natural join (select attrelid as starelid, attname, attnum as staattnum from pg_attribute) a
cross join (select generate_series(1,5) n) g
where tablename='demo' and attname='a';
restore_correlation
------------------------------------------------------------------------------------------------
update pg_statistic set stanumbers1='{1}' where starelid=18338 and staattnum=2 and stakind1=3;+
update pg_statistic set stanumbers2='{1}' where starelid=18338 and staattnum=2 and stakind2=3;+
update pg_statistic set stanumbers3='{1}' where starelid=18338 and staattnum=2 and stakind3=3;+
update pg_statistic set stanumbers4='{1}' where starelid=18338 and staattnum=2 and stakind4=3;+
update pg_statistic set stanumbers5='{1}' where starelid=18338 and staattnum=2 and stakind5=3;
(1 row)
postgres=# \gset
You may be surprised by my query but pg_statistic
is not very relational, the value can to to different places, and that's why I have multiple updates for one value.
The point of this article is that I do not expect a change when analyzing the table because my updated rows are still well clustered. My query can get all from the same heap page:
postgres=# select ctid,* from demo where a=1 order by b limit 10;
ctid | id | a | b | c
---------+--------------------------------------+---+----+---
(40,61) | 163d93e7-ed84-4be8-9f87-f8771895490e | 1 | 1 | 1
(40,62) | a6fd2b75-7d96-4af5-a0fd-6deb8f03e2ed | 1 | 2 | 1
(40,63) | 8d730e02-a4c0-45bf-afc9-5a44f9241a64 | 1 | 3 | 1
(40,64) | 1c0d6e1c-e88e-47f0-bc8d-71daa85170dd | 1 | 4 | 1
(40,65) | 130ffe14-705a-476c-94b6-fa0354efe994 | 1 | 5 | 1
(40,66) | 3dbff06e-5126-4abb-87c7-a24e07a3e37a | 1 | 6 | 1
(40,67) | 92248fb9-10ba-47a7-b13f-c521e1e11a4b | 1 | 7 | 1
(40,68) | 291239e1-23a7-4398-b6af-a17352fab837 | 1 | 8 | 1
(40,69) | c9c997d1-0df4-4a81-bee3-989fe3d42cb9 | 1 | 9 | 1
(40,70) | d54c50c4-fb1a-4126-b15c-11de36a7316b | 1 | 10 | 1
(10 rows)
Looking at the CTID, the rows I'm interested in seems to be well clustered.
Analyze
I analyze the table, like what would happen with auto-analyze on a non-temporary table, and run my query again:
postgres=# analyze demo;
ANALYZE
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..28.20 rows=10 width=28) (actual time=0.020..0.048 rows=10 loops=1)
Buffers: local hit=102
-> Index Scan using demob on demo (cost=0.28..279.41 rows=100 width=28) (actual time=0.019..0.046 rows=10 loops=1)
Filter: (a = 1)
Rows Removed by Filter: 90
Buffers: local hit=102
Planning:
Buffers: shared hit=16
Planning Time: 0.147 ms
Execution Time: 0.060 ms
(10 rows)
The query planner has chosen the other index, to avoid a sort, but having to read more rows to discard to apply the filter. Obviously, this was a bad choice, reading 102 buffers instead of 4.
I can prove that it was a bad choice by removing the index (or using /*+ IndexScan( demo demoa ) */
if pg_hint_plan
is installed):
postgres=# drop index demob;
DROP INDEX
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=45.44..45.47 rows=10 width=28) (actual time=0.043..0.045 rows=10 loops=1)
Buffers: local hit=4
-> Sort (cost=45.44..45.69 rows=100 width=28) (actual time=0.042..0.043 rows=10 loops=1)
Sort Key: b
Sort Method: top-N heapsort Memory: 25kB
Buffers: local hit=4
-> Index Scan using demoa on demo (cost=0.28..43.28 rows=100 width=28) (actual time=0.009..0.027 rows=100 loops=1)
Index Cond: (a = 1)
Buffers: local hit=4
Planning:
Buffers: shared hit=5 dirtied=1
Planning Time: 0.120 ms
Execution Time: 0.058 ms
(13 rows)
This plan is much better but the PostgreSQL query planner didn't choose it because it estimated the cost to 45.47 which is higher than the estimation from the other index that was 28.20
The reason for this higher estimation is in the Index Scan which was estimated to 10.02 before the analyze and to 43.28 after it. It is not a high factor, but it is a huge change in the access path.
Correlation
I suspect the correlation factor, which is there to estimate the random reads to the table when coming from an index entry:
postgres=# select schemaname,tablename,attname,correlation from pg_stats where tablename='demo'
;
schemaname | tablename | attname | correlation
------------+-----------+---------+--------------
pg_temp_3 | demo | id | -0.023444278
pg_temp_3 | demo | a | 0.89289254
pg_temp_3 | demo | b | 0.48658225
pg_temp_3 | demo | c | 1
(4 rows)
It was 1 for the column a
before the analyze, but 0.89289254 after the analyze. It doesn't look much different but seems to be enough to have the query planner choosing another index.
This is why I saved the previous one, and I run the statements to set to the previous value:
postgres=# select :'restore_correlation';
?column?
------------------------------------------------------------------------------------------------
update pg_statistic set stanumbers1='{1}' where starelid=18338 and staattnum=2 and stakind1=3;+
update pg_statistic set stanumbers2='{1}' where starelid=18338 and staattnum=2 and stakind2=3;+
update pg_statistic set stanumbers3='{1}' where starelid=18338 and staattnum=2 and stakind3=3;+
update pg_statistic set stanumbers4='{1}' where starelid=18338 and staattnum=2 and stakind4=3;+
update pg_statistic set stanumbers5='{1}' where starelid=18338 and staattnum=2 and stakind5=3;
(1 row)
postgres=# :restore_correlation
UPDATE 0
UPDATE 1
UPDATE 0
UPDATE 0
UPDATE 0
postgres=# select schemaname,tablename,attname,correlation from
pg_stats where tablename='demo'
;
schemaname | tablename | attname | correlation
------------+-----------+---------+--------------
pg_temp_3 | demo | id | -0.023444278
pg_temp_3 | demo | a | 1
pg_temp_3 | demo | b | 0.48658225
pg_temp_3 | demo | c | 1
(4 rows)
The pg_statistic
table is a bit weird, the values can be at different places, and that's why I had several updates but updated on the Statistic Kind number 3 which is the correlation.
Using the previous correlation factor
With the previous correlation, the cost is back to 10.3 which means that the right index would have been picked up even if I didn't drop the other one:
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=12.19..12.22 rows=10 width=28) (actual time=0.043..0.045 rows=10 loops=1)
Buffers: local hit=4
-> Sort (cost=12.19..12.44 rows=100 width=28) (actual time=0.042..0.043 rows=10 loops=1)
Sort Key: b
Sort Method: top-N heapsort Memory: 25kB
Buffers: local hit=4
-> Index Scan using demoa on demo (cost=0.28..10.03 rows=100 width=28) (actual time=0.008..0.026 rows=100 loops=1)
Index Cond: (a = 1)
Buffers: local hit=4
Planning:
Buffers: shared hit=3
Planning Time: 0.144 ms
Execution Time: 0.058 ms
(13 rows)
However, this is not a sustainable solution because in a regular table, it will be analyzed again, the query would get a higher cost, and a bad index may be chosen.
postgres=# vacuum full demo;
VACUUM
postgres=# analyze demo;
ANALYZE
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=44.63..44.66 rows=10 width=28) (actual time=0.039..0.041 rows=10 loops=1)
Buffers: local hit=4
-> Sort (cost=44.63..44.88 rows=100 width=28) (actual time=0.038..0.039 rows=10 loops=1)
Sort Key: b
Sort Method: top-N heapsort Memory: 25kB
Buffers: local hit=4
-> Index Scan using demoa on demo (cost=0.28..42.47 rows=100 width=28) (actual time=0.007..0.023 rows=100 loops=1)
Index Cond: (a = 1)
Buffers: local hit=4
Planning:
Buffers: shared hit=12
Planning Time: 0.138 ms
Execution Time: 0.054 ms
(13 rows)
postgres=# select ctid,* from demo where a=1 order by b limit 10
;
ctid | id | a | b | c
----------+--------------------------------------+---+----+---
(39,97) | 163d93e7-ed84-4be8-9f87-f8771895490e | 1 | 1 | 1
(39,98) | a6fd2b75-7d96-4af5-a0fd-6deb8f03e2ed | 1 | 2 | 1
(39,99) | 8d730e02-a4c0-45bf-afc9-5a44f9241a64 | 1 | 3 | 1
(39,100) | 1c0d6e1c-e88e-47f0-bc8d-71daa85170dd | 1 | 4 | 1
(39,101) | 130ffe14-705a-476c-94b6-fa0354efe994 | 1 | 5 | 1
(39,102) | 3dbff06e-5126-4abb-87c7-a24e07a3e37a | 1 | 6 | 1
(39,103) | 92248fb9-10ba-47a7-b13f-c521e1e11a4b | 1 | 7 | 1
(39,104) | 291239e1-23a7-4398-b6af-a17352fab837 | 1 | 8 | 1
(39,105) | c9c997d1-0df4-4a81-bee3-989fe3d42cb9 | 1 | 9 | 1
(39,106) | d54c50c4-fb1a-4126-b15c-11de36a7316b | 1 | 10 | 1
(10 rows)
I listed the CTID to show you that the query planner was wrong. My rows are well correlated: all in the same block, in the same order as the index value.
PageInspect
I can check with PageInspect that the index entries in the index leaf block are in the same order as the rows in the heap table:
postgres=# create extension if not exists pageinspect;
CREATE EXTENSION
postgres=# select * from bt_metap('demoa');
magic | version | root | level | fastroot | fastlevel | last_cleanup_num_delpages | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+---------------------------+-------------------------+---------------
340322 | 4 | 3 | 1 | 3 | 1 | 0 | -1 | t
(1 row)
postgres=# select substr(data,1,30), itemoffset, ctid, itemlen, nulls, vars, dead , *
from bt_page_items(get_raw_page('demoa', 1))
where data < '02' order by data
;
substr | itemoffset | ctid | itemlen | nulls | vars | dead | itemoffset | ctid | itemlen | nulls | vars | data | dead | htid | tids
-------------------------+------------+-----------+---------+-------+------+------+------------+-----------+---------+-------+------+-------------------------+------+---------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
01 00 00 00 00 00 00 00 | 2 | (16,8292) | 616 | f | f | f | 2 | (16,8292) | 616 | f | f | 01 00 00 00 00 00 00 00 | f | (39,97) | {"(39,97)","(39,98)","(39,99)","(39,100)","(39,101)","(39,102)","(39,103)","(39,104)","(39,105)","(39,106)","(39,107)","(39,108)","(39,109)","(39,110)","(39,111)","(39,112)","(39,113)","(39,114)","(39,115)","(39,116)","(39,117)","(39,118)","(39,119)","(39,120)","(39,121)","(39,122)","(39,123)","(39,124)","(39,125)","(39,126)","(39,127)","(39,128)","(39,129)","(39,130)","(39,131)","(39,132)","(39,133)","(39,134)","(39,135)","(39,136)","(40,1)","(40,2)","(40,3)","(40,4)","(40,5)","(40,6)","(40,7)","(40,8)","(40,9)","(40,10)","(40,11)","(40,12)","(40,13)","(40,14)","(40,15)","(40,16)","(40,17)","(40,18)","(40,19)","(40,20)","(40,21)","(40,22)","(40,23)","(40,24)","(40,25)","(40,26)","(40,27)","(40,28)","(40,29)","(40,30)","(40,31)","(40,32)","(40,33)","(40,34)","(40,35)","(40,36)","(40,37)","(40,38)","(40,39)","(40,40)","(40,41)","(40,42)","(40,43)","(40,44)","(40,45)","(40,46)","(40,47)","(40,48)","(40,49)","(40,50)","(40,51)","(40,52)","(40,53)","(40,54)","(40,55)","(40,56)","(40,57)","(40,58)","(40,59)","(40,60)"}
(1 row)
This looks well correlated to me, but probably not enough according to the statistics.
Reorganize the table (CLUSTER)
Another solution is to reorganize the table to get the best correlation with the index, using CLUSTER:
postgres=# cluster demo using demoa;
CLUSTER
postgres=# analyze demo;
ANALYZE
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Limit (cost=12.19..12.22 rows=10 width=28) (actual time=0.042..0.044 rows=10 loops=1)
Buffers: local hit=1 read=2
-> Sort (cost=12.19..12.44 rows=100 width=28) (actual time=0.042..0.042 rows=10 loops=1)
Sort Key: b
Sort Method: top-N heapsort Memory: 25kB
Buffers: local hit=1 read=2
-> Index Scan using demoa on demo (cost=0.28..10.03 rows=100 width=28) (actual time=0.010..0.024 rows=100 loops=1)
Index Cond: (a = 1)
Buffers: local hit=1 read=2
Planning:
Buffers: shared hit=16, local read=2
Planning Time: 0.197 ms
Execution Time: 0.057 ms
(13 rows)
postgres=# select ctid,* from demo where a=1 order by b limit 10
;
ctid | id | a | b | c
--------+--------------------------------------+---+----+---
(0,1) | 163d93e7-ed84-4be8-9f87-f8771895490e | 1 | 1 | 1
(0,2) | a6fd2b75-7d96-4af5-a0fd-6deb8f03e2ed | 1 | 2 | 1
(0,3) | 8d730e02-a4c0-45bf-afc9-5a44f9241a64 | 1 | 3 | 1
(0,4) | 1c0d6e1c-e88e-47f0-bc8d-71daa85170dd | 1 | 4 | 1
(0,5) | 130ffe14-705a-476c-94b6-fa0354efe994 | 1 | 5 | 1
(0,6) | 3dbff06e-5126-4abb-87c7-a24e07a3e37a | 1 | 6 | 1
(0,7) | 92248fb9-10ba-47a7-b13f-c521e1e11a4b | 1 | 7 | 1
(0,8) | 291239e1-23a7-4398-b6af-a17352fab837 | 1 | 8 | 1
(0,9) | c9c997d1-0df4-4a81-bee3-989fe3d42cb9 | 1 | 9 | 1
(0,10) | d54c50c4-fb1a-4126-b15c-11de36a7316b | 1 | 10 | 1
(10 rows)
This is usually not an acceptable solution because it requires downtime (except if you do it with pg_repack
instead), may have side effects with other indexes (you can cluster on one index only), and is not sustainable (new updates will shuffle the rows again).
Provide the best index to avoid bad plans
My go-to solution for addressing plan instability is to use a more robust index that isn't sensitive to small fluctuations in query planner estimates. When you filter and sort, a single index should provide both:
postgres=# create index demoab on demo(a asc, b asc);
CREATE INDEX
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..8.26 rows=10 width=28) (actual time=0.012..0.015 rows=10 loops=1)
Buffers: local hit=1 read=2
-> Index Scan using demoab on demo (cost=0.28..80.03 rows=100 width=28) (actual time=0.011..0.013 rows=10 loops=1)
Index Cond: (a = 1)
Buffers: local hit=1 read=2
Planning:
Buffers: shared hit=18, local read=1
Planning Time: 0.166 ms
Execution Time: 0.027 ms
(9 rows)
The cost is smaller than the index on b
that I have deleted. To be sure, let's re-create it and test again:
postgres=# create index demob on demo(b asc);
CREATE INDEX
postgres=# update demo set c=c+1 where a =1;
UPDATE 100
postgres=# vacuum analyze demo;
VACUUM
postgres=# select schemaname,tablename,attname,correlation from pg_stats where tablename='demo';
schemaname | tablename | attname | correlation
------------+-----------+---------+--------------
pg_temp_3 | demo | id | -0.023419052
pg_temp_3 | demo | a | 0.89289254
pg_temp_3 | demo | b | 0.4865387
pg_temp_3 | demo | c | 1
(4 rows)
postgres=# explain (analyze, buffers)
select * from demo where a=1
order by b limit 10
;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..10.30 rows=10 width=28) (actual time=0.010..0.014 rows=10 loops=1)
Buffers: local hit=3
-> Index Scan using demoab on demo (cost=0.28..100.49 rows=100 width=28) (actual time=0.010..0.012 rows=10 loops=1)
Index Cond: (a = 1)
Buffers: local hit=3
Planning:
Buffers: shared hit=14, local hit=1
Planning Time: 0.273 ms
Execution Time: 0.026 ms
(9 rows)
The right index is still used but in my opinion the cost difference is still very low (10.30 vs 28.20) to clear all risks. To be safe, we can:
- change to a covering index
on demo ( a, b ) include (c, id)
that will makes the query cheaper when using this index. - use
pg_hint_plan
to force the index we have designed for this query:/*+ IndexScan(demo demoab) */
- use
pg_hint_plan
to increase the cost of reading too many tuples:/*+ Set(cpu_tuple_cost 1) */
You may be surprised by the recommandation of using hints, but this is a limitation of the Cost Based Optimizer, and we need to workaround. For such query, you are more concerned about the scalability, and how the response time is proportional to the number of rows in the result. A plan that has to read more rows, sort them, and discard them should be avoided even if the estimated cost is lower.
Note that, as far as I know, extended statistics are not useful here because PostgreSQL doesn't gather their correlation. This means that there's only a single-column correlation that doesn't show the true correlation with a multi-column index. To compare, Oracle stores correlation (called clustering factor) per index and not per column.
Top comments (7)
Hi Frank. The issue with 10.02 and 10.03 is because of estimated number of rows returned. It's 100 in the first plan and 99 in the second one. And this difference is because of reltuples of demo table. Before UPDATE it is 5500 and after it is 5428
Good point, even if we do not ANALYZE, VACUUM updates
pg_class
.reltuples
👍In case you are interested, cost calculations :)
Frank, I would argue that plan with index on b column is bad.
In my opinion it's actually better because actual time=0.020..0.048 vs actual time=0.043..0.045. Yes, I agree that it's 3 ms slower. But the first row is returned 23 ms faster. What do you think?
I look at buffers more than time because time depends on many parameters, like cache hits, which are always different in a test environment. The query reads ten rows, so do you care about the first row coming faster? It would be different if it were a cursor on a million rows. Also, it is not the cost of the first row but the cost before being able to return the first row. "Rows Removed by Filter" indicates that you read much more than you need. depending on how rows are physically ordered, you may discard many rows before getting the first one.
Agreed on buffer read, but there's also sorting cost when using index on 'a' column.
Also please notice that if we increase limit from 10 to 15, execution plan is back to using index on 'a' column.