DEV Community

Cover image for PostgreSQL with modern storage: what about a lower random_page_cost?
Franck Pachot for AWS Heroes

Posted on • Edited on

PostgreSQL with modern storage: what about a lower random_page_cost?

The topic was widely discussed last week, in favor of a lower value for random_page_cost:

The random_page_cost parameter accounts for the latency incurred by random reads, particularly from indexes with poor correlation factors. Its default value is 4. On the other hand, the seq_page_cost parameter accounts for the lower latency incurred by contiguous page reads, where many pages are read without additional seek. Its default value is 1. With modern storage, should you set a lower value for random_page_cost as those messages suggest?

As I'm more concerned by the predictability of performance with stable execution plans rather than tuning magic global parameters, I'm in favor of not changing the default. Let's explain why.

HDD seek latency

It is commonly believed that default values were set based on disk latency when mechanical HDDs were prevalent.

Let's consider an example. If your storage has a latency (seek) of 10 milliseconds and a bandwidth (throughput) of 4KB/s, then reading an 8k block would take 10+(8/4)=12 milliseconds. If you read 10 consecutive pages, it would take 10+10*(8/4)=30 milliseconds. Thus, it would be equivalent to 3 milliseconds per page. We can conclude that reading random pages is 4 times more expensive than sequential reads (12/3=4).

This could be the reason why default values were chosen in this manner:

postgres=# \dconfig *page_cost

List of configuration parameters
    Parameter     | Value
------------------+-------
 random_page_cost | 4
 seq_page_cost    | 1
Enter fullscreen mode Exit fullscreen mode

The input/output (IO) specifications may appear low, but they are not too far from what 3600 RPM disks could achieve at that time. Oracle still employs these numbers, namely, 10ms IO seek time and 4kB/second IO transfer time when no system statistics are gathered, and this is the recommended practice. Oracle's cost model accounts for 8 contiguous blocks in I/O calls, and then random reads are estimated to be 3.7 times more expensive. This estimate is not much different from the default setting of 4 employed by PostgreSQL.

I'm comparing with Oracle Database on purpose as this has been discussed a lot, attempting to make index access more desirable by lowering optimizer_index_cost_adj or gathering system statistics to calibrate the storage. Finally, there's a consensus that it has brought more problems than solutions.

SSD random reads

When people are using SSDs with no mechanical latency, they tend to want to change the values since random reads are just as fast as sequential reads.

They often consider reducing the value of random_page_cost to be closer to the value of seq_page_cost. However, I believe that this is not a good practice for three reasons.

  • Firstly, modern storage systems are complex and have multiple levels of cache and network, which may increase or decrease the latency experienced by the database.
  • Secondly, it is better to have a stable and consistent model rather than having a different model for each system.
  • Lastly, a better outcome can be achieved by looking at the index design to get predictable performance.

Storage hierarchy

In modern systems, it is incorrect to assume that the pages that your query reads are equivalent to read requests to your HDD or SSD. In fact, the process is much more complex and involves several layers:

  1. Firstly, the query reads pages from the PostgreSQL shared buffers.
  2. If there are cache misses, it reads from Linux filesystem buffers.
  3. If there are further cache misses, it prepares an I/O call.
  4. The kernel may read more data when it detects sequential reads (readahead).
  5. It sends an I/O call through the network (SAN, NAS, Cloud Storage).
  6. It reads from the storage cache.
  7. Finally, if there are still cache misses, it reads from the disk (HDD or SSD).

Is it reasonable to use the physical read latency factor of SSD or HDD at the lowest level of the chain to determine the random_page_cost applied to the estimated pages at the highest level?

There is no latency when accessing blocks stored in memory at different levels of cache, which means that the value of random_page_cost should be equal to seq_page_cost for them. In OLTP applications, a high cache hit ratio is expected, and this justifies a low value of random_page_cost, regardless of the disk latency. But there is more to consider.

PostgreSQL reads data in single blocks. Reading multiple contiguous blocks in one I/O call has a benefit due to Linux read-ahead. This advantage becomes more significant when there are delays in I/O, which historically was due to the rotation of HDDs. Back then, disks were directly attached to servers. But even if modern storage uses SSDs with lower latency, they are often behind network calls, and cloud instance limitations. As a result, the "random_page_cost" should be higher than "seq_page_cost".

Index with good correlation factor

I have created a table with a strong correlation factor for the primary key 'demo_key', but a weak correlation factor for 'demo_val':

\c
drop table if exists demo;
create table demo ( id bigserial primary key , val int );
create index demo_val on demo (val asc);
insert into demo(val) select random()*100000 from generate_series(1,100000);
vacuum analyze demo;
Enter fullscreen mode Exit fullscreen mode

The table contains one hundred thousand rows, which are spread across 541 pages:

postgres=# select reltuples, relpages 
           from pg_class where relname = 'demo'
;
 reltuples | relpages
-----------+----------
    100000 |      541
Enter fullscreen mode Exit fullscreen mode

When selecting less than 50% of rows (50000 out of 100000), PostgreSQL uses an Index Scan:

postgres=# explain (analyze, buffers, summary off) select * from demo where id <= 50000;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo  (cost=0.29..1697.91 rows=49864 width=16) (actual time=0.013..8.382 rows=50000 loops=1)
   Index Cond: (id <= 50000)
   Buffers: shared hit=409
 Planning:
   Buffers: shared hit=3
Enter fullscreen mode Exit fullscreen mode

This has actually read 409 pages out of a 541 pages table (Note: the 409 pages are from the index and the table - see comment below). When reading more rows, it will be equivalent to read all the pages.

When I choose a search criteria that returns over 60% of the data in PostgreSQL, the query planer selects a Sequential Scan instead of using an index:

postgres=# explain (analyze, buffers, summary off) select * from demo where id < 60000;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1791.00 rows=59968 width=16) (actual time=0.013..8.414 rows=59999 loops=1)
   Filter: (id < 60000)
   Rows Removed by Filter: 40001
   Buffers: shared hit=541
Enter fullscreen mode Exit fullscreen mode

What changes can be expected if the value of random_page_cost is similar to that of seq_page_cost?

postgres=# set random_page_cost=1.1;
SET

postgres=# explain (analyze, buffers, summary off)
           select * from demo where id < 50000
;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo  (cost=0.29..1299.66 rows=50021 width=12) (actual time=0.022..8.115 rows=49999 loops=1)
   Index Cond: (id < 50000)
   Buffers: shared hit=409

postgres=# explain (analyze, buffers, summary off) 
           select * from demo where id < 60000
;
                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_pkey on demo  (cost=0.29..1554.88 rows=59879 width=12) (actual time=0.023..9.898 rows=59999 loops=1)
   Index Cond: (id < 60000)
   Buffers: shared hit=490

postgres=# explain (analyze, buffers, summary off) 
           select * from demo where id < 70000
;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1791.00 rows=70015 width=12) (actual time=0.013..8.799 rows=69999 loops=1)
   Filter: (id < 70000)
   Rows Removed by Filter: 30001
   Buffers: shared hit=541
Enter fullscreen mode Exit fullscreen mode

By setting random_page_cost=1.1, the query planner chose an Index Scan over a Sequential Scan for a larger number of rows.

Does it matter? Probably not. I had to query thousands of rows, which is half of the table, to see a change in the plan. Based on those numbers, I think it would be more scalable to add columns to the index to achieve an Index Only Scan with fewer pages to read rather than switch to a full table scan on a large table that will grow. Alternatively, if this query reads a large portion of the table that isn't expected to grow, I would prefer a sequential scan even for a smaller result, to get predictable performances.

Index with bad correlation factor

When there is poor correlation between the index entries and the physical rows in a heap table, PostgreSQL utilizes a Bitmap Scan to prevent reading the same page multiple times. In my example, the inflection point where a Sequential Scan becomes preferable is between 38% and 39%:

postgres=# set random_page_cost=4;
SET
postgres=# explain (analyze, buffers, summary off)
           select * from demo where val < 38000
;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on demo  (cost=742.71..1765.04 rows=38506 width=12) (actual time=1.856..6.096 rows=38090 loops=1)
   Recheck Cond: (val < 38000)
   Heap Blocks: exact=541
   Buffers: shared hit=651
   ->  Bitmap Index Scan on demo_val  (cost=0.00..733.09 rows=38506 width=0) (actual time=1.792..1.792 rows=38090 loops=1)
         Index Cond: (val < 38000)
         Buffers: shared hit=110

postgres=# explain (analyze, buffers, summary off) 
           select * from demo where val < 39000
;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1791.00 rows=39502 width=12) (actual time=0.014..8.974 rows=39079 loops=1)
   Filter: (val < 39000)
   Rows Removed by Filter: 60921
   Buffers: shared hit=541
Enter fullscreen mode Exit fullscreen mode

If random_page_cost is set close to seq_page_cost, a distinct pattern emerges. A Bitmap Index Scan is used when reading less than 16% of rows, and a Seq Scan is used when reading more than 57% of rows:

postgres=# set random_page_cost=1.1;
SET

postgres=# explain (analyze, buffers, summary off) 
           select * from demo where val < 16000
;
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on demo  (cost=178.02..922.28 rows=16261 width=12) (actual time=0.736..3.692 rows=16091 loops=1)
   Recheck Cond: (val < 16000)
   Heap Blocks: exact=541
   Buffers: shared hit=588
   ->  Bitmap Index Scan on demo_val  (cost=0.00..173.95 rows=16261 width=0) (actual time=0.677..0.677 rows=16091 loops=1)
         Index Cond: (val < 16000)
         Buffers: shared hit=47

postgres=# explain (analyze, buffers, summary off) 
           select * from demo where val < 17000
;
                                                        QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
 Index Scan using demo_val on demo  (cost=0.29..953.18 rows=17302 width=12) (actual time=0.022..8.620 rows=17082 loops=1)
   Index Cond: (val < 17000)
   Buffers: shared hit=17103

postgres=# explain (analyze, buffers, summary off) 
                   select * from demo where val < 57000
;
                                                QUERY PLAN
----------------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..1791.00 rows=57467 width=12) (actual time=0.012..9.650 rows=57286 loops=1)
   Filter: (val < 57000)
   Rows Removed by Filter: 42714
   Buffers: shared hit=541

Enter fullscreen mode Exit fullscreen mode

Between those two thresholds, an Index Scan was introduced, when the number of rows exceeded a certain threshold, making it too large to build a bitmap, but not large enough to do a sequential scan.

Is that better? It's unlikely that having three execution plans instead of two will improve plan stability. In a test scenario where everything is in cache, the time taken doesn't matter. However, the number of buffers read shows a problem: the Index Scan has read 30 times more blocks than a Seq Scan, and these blocks are random reads that cannot be optimized with read-ahead. If, for any reason, more memory is used on your system and the buffer reads become cache misses, the response time can increase by several orders of magnitude.

Find the inflection point

I didn't try and guess to see the different plans. Here is the code that I used:

\c
set random_page_cost=1.1;
create temporary table if not exists plans( val float, sql text, plan json );
do $$
 declare
  sql text;
  plan json;
 begin
  for i in 1..100 loop
   sql=format('explain (format json) select * from demo where val < %s',i*1000);
   execute sql into plan;
   insert into plans (val, sql, plan) values (i, sql, plan);
  end loop;
 end;
$$;
with plans as (
select val, plan->0->'Plan'->>'Node Type' scan, (plan->0->'Plan'->>'Total Cost')::float cost, sql from plans
) select scan, min(val) min_val,max(val) max_val, min(cost) min_cost, max(cost) max_cost from plans group by scan order by min_cost ;

Enter fullscreen mode Exit fullscreen mode

This used the settings and the query for the last case and here was the result:

       scan       | min_val | max_val | min_cost | max_cost
------------------+---------+---------+----------+----------
 Bitmap Heap Scan |       1 |      16 |   537.64 |   922.28
 Index Scan       |      17 |      56 |   953.18 |  1762.64
 Seq Scan         |      57 |     100 |  1791    |  1791   
(3 rows)
Enter fullscreen mode Exit fullscreen mode

It can be helpful to test your query for different selectivity and see when it can go wrong. I would judge the quality of your indexes on how the common queries are far from the inflection points.

What about your data model and indexing strategy?

This test doesn't clearly indicate the most reasonable default for random_page_cost. Therefore, I suggest sticking with the default value because it offers a significant advantage: it's what most people use. By doing so, you can avoid any unexpected surprises that may arise from a custom configuration.

Based on the test results, it appears that the index on the val column is not well-suited for queries that read a large number of rows. When multiple rows are read, all pages from the table are read, which is not expected to remain in cache for larger tables. With the non-default random_page_cost setting, the situation becomes even worse, as it can switch to a plan that reads 30 times more pages than necessary.

There is an inflection point where the query planner has to choose between using an index or scanning the entire table. This decision may change when a single new row is added to a large table. If the table is near this point, the query's performance will depend on several factors, such as available memory for cache, IOPS, and bandwidth limits. These factors cannot be accurately predicted because they depend on the total workload. You can modify this point using random_page_cost, but it's best to avoid this unpredictable zone altogether.

I recommend to spend more time on looking at execution plans and less time on global parameter tuning.

Using high random_page_cost as a red flag for bad indexes

If you notice that changing the random_page_cost parameter results in different response times, it could be an indication that some of your access patterns do not have optimal indexes. In this case, the query planner is forced to come up with the "least bad" solution. Instead of decreasing the parameter to get more indexes used, you should focus on making those indexes more appealing to the query planner. When your index is used with a high value of random_page_cost and provides the expected performance, you can be confident that you will avoid unpleasant surprises when your users query slightly more rows.

If you plan on altering the random_page_cost to a lower value, it is crucial to understand why it would enhance the performance rather than basing it on just one test. With the growth of your data and increased workload, the index access that may provide a quick response time now may take much longer if the memory available for cache reduces. Thus, it is essential to evaluate the performance impact of any such change and ensure that it remains an improvement over time, rather than a short-term fix.

I believe that the decision between using an Index Scan and a Sequential Scan depends on scalability rather than response time. The response time of an Index Scan is proportional to the size of the result set, while the response time of a Sequential Scan is proportional to the size of the table being scanned. It is important to understand which one will grow to make an informed decision and provide indexes that will be chosen by the query planner without depending on dubious a global setting.


Note that I think that the indexing strategy should not depend on the platform, and execution plans should not change when scaling-up or changing storage specifications. However, there may be differing opinions on this matter: https://x.com/mmeent_pg/status/1775592093253066814?s=20

The most important is that you know what you do, and why.

The purpose of this article is not to dissuade you from setting the random_page_cost parameter. Rather, it aims to clarify the reasoning behind it. It is perfectly acceptable to set it to 1.1 if that magic value works for you because your indexes are not picked up with the default value and the planner goes for unexpected sequential scans. However, if you attempt to provide a scientific explanation based on HDD vs SSD access, it may not take into account all the layers between your database and the disks.

Top comments (4)

Collapse
 
ialkhasov profile image
IAlkhasov

Buffers: shared hit=17103.
Does it mean that on bad correlation some pages are read multiple times and that every read is tracked?

Does PostgreSQL has an option of cache eviction? So there's an ability to test cache misses and page disk IO?

Collapse
 
franckpachot profile image
Franck Pachot

I'm replying late. Yes, the "shared hit" is the number of times the buffer is pinned to read it. With bad correlation and reading many rows, it can be revisited many times.

There's no command in PostgreSQL to flush the cache, but there's an extension: github.com/zilder/pg_dropcache

Collapse
 
ialkhasov profile image
IAlkhasov

Hi Frank.
When querying id <= 50000 only 270 table pages are read, other 139 pages are index pages.

Collapse
 
franckpachot profile image
Franck Pachot

Yes, good point. We can see the number of index pages with an Index Only Scan:

main=> explain (analyze, buffers, summary off) select id from demo where id <= 50000;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_pkey on demo  (cost=0.29..1421.41 rows=49664 width=8) (actual time=0.024..4.062 rows=50000 loops=1)
   Index Cond: (id <= 50000)
   Heap Fetches: 0
   Buffers: shared hit=140
(4 rows)
Enter fullscreen mode Exit fullscreen mode