Franck Pachot for AWS Heroes

Posted on Mar 3 • Edited on Mar 22

Postgres dead tuple space reused without vacuum

#postgres #vacuum #bloat #space

There are numerous talks and articles about PostgreSQL bloat and vacuum, but few of them provide a detailed explanation of how it works and why auto-vacuum must be more aggressive. In this post, I will demonstrate a simple test case to illustrate how dead tuples (old versions of rows) may be cleaned out before any vacuum (the PostgreSQL garbage collection) runs on them.

I will demonstrate an example using a temporary table where auto-vacuum is not applicable. You can perform the same on a permanent table with autovacuum_enabled=off. The goal is to run it manually to understand how it works before and after vacuuming. Below, I will show the syntax for both cases. I add an index and insert sixteen rows:

postgres=# select version();
                                                      version
-------------------------------------------------------------------------------------------------------------------
 PostgreSQL 16.2 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-20.0.3), 64-bit
(1 row)

postgres=# create temporary table demo
           ( id bigint, value text )
           with ( autovacuum_enabled=off )
;
CREATE TABLE

postgres=# create unique index demo_pkey on demo (id)
           include (value)
;

postgres=# insert into demo
           select generate_series(0,15), lpad('x',500,'x')
;
INSERT 0 16

For the demonstration, I will use large rows and index entries to have only a few per block, which will be easier to display.

PageInspect

I will examine the rows in pages, even those that are no longer active, using the PageInspect extension:

postgres=# create extension if not exists pageinspect;
CREATE EXTENSION

I can see the first 15 rows in the first block of my heap table:

postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30), lp_len
           from heap_page_items(get_raw_page('demo', 0))
;
  t_ctid | lp | lp_len | lp_flags | t_xmin | t_xmax |                                         substr
--------+----+--------+----------+--------+--------+----------------------------------------------------------------------------------------
 (0,1)  |  1 |    536 |        1 |   3231 |      0 | \x0000000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,2)  |  2 |    536 |        1 |   3231 |      0 | \x0100000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,3)  |  3 |    536 |        1 |   3231 |      0 | \x0200000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,4)  |  4 |    536 |        1 |   3231 |      0 | \x0300000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,5)  |  5 |    536 |        1 |   3231 |      0 | \x0400000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,6)  |  6 |    536 |        1 |   3231 |      0 | \x0500000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,7)  |  7 |    536 |        1 |   3231 |      0 | \x0600000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,8)  |  8 |    536 |        1 |   3231 |      0 | \x0700000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,9)  |  9 |    536 |        1 |   3231 |      0 | \x0800000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,10) | 10 |    536 |        1 |   3231 |      0 | \x0900000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,11) | 11 |    536 |        1 |   3231 |      0 | \x0a00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,12) | 12 |    536 |        1 |   3231 |      0 | \x0b00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,13) | 13 |    536 |        1 |   3231 |      0 | \x0c00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,14) | 14 |    536 |        1 |   3231 |      0 | \x0d00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,15) | 15 |    536 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
(15 rows)

The value 1 of lp_flags means "normal". We will see other values later.

I can see the first 14 index entries in the first leaf block of the B-Tree index:

postgres=# select substr(data,1,30), itemoffset, ctid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 1))
           order by data
;
                   substr                   | itemoffset |  htid  | itemlen | nulls | vars | dead
--------------------------------------------+------------+--------+---------+-------+------+------
 00 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          2 | (0,1)  |     520 | f     | t    | f
 01 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          3 | (0,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          4 | (0,3)  |     520 | f     | t    | f
 03 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          5 | (0,4)  |     520 | f     | t    | f
 04 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          6 | (0,5)  |     520 | f     | t    | f
 05 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          7 | (0,6)  |     520 | f     | t    | f
 06 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          8 | (0,7)  |     520 | f     | t    | f
 07 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          9 | (0,8)  |     520 | f     | t    | f
 08 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         10 | (0,9)  |     520 | f     | t    | f
 09 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         11 | (0,10) |     520 | f     | t    | f
 0a 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         12 | (0,11) |     520 | f     | t    | f
 0b 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         13 | (0,12) |     520 | f     | t    | f
 0c 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         14 | (0,13) |     520 | f     | t    | f
 0d 00 00 00 00 00 00 00                    |          1 |        |      16 | f     | f    |
(14 rows)

The TID that identifies the tuple (table row version) has two numbers: the block number and the item number within the page.

Here is the size on disk for the table and index:

postgres=# select relname,pg_table_size(oid),pg_size_pretty(pg_table_size(oid))
           from pg_class where relname like 'demo%'
;
  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |         49152 | 48 kB
 demo_pkey |         32768 | 32 kB
(2 rows)

Select

Without PageInspect, I can see the visible rows with their TID be we will see later that it is not a good idea to understand what happens step by step:

postgres=# select ctid, id, xmin, xmax, id, substr(value,1,30) , length(value)
           from demo
;
  ctid  | id | xmin | xmax | id |             substr             | length
--------+----+------+------+----+--------------------------------+--------
 (0,1)  |  0 | 3231 |    0 |  0 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,2)  |  1 | 3231 |    0 |  1 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,3)  |  2 | 3231 |    0 |  2 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,4)  |  3 | 3231 |    0 |  3 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,5)  |  4 | 3231 |    0 |  4 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,6)  |  5 | 3231 |    0 |  5 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,7)  |  6 | 3231 |    0 |  6 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,8)  |  7 | 3231 |    0 |  7 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,9)  |  8 | 3231 |    0 |  8 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,10) |  9 | 3231 |    0 |  9 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,11) | 10 | 3231 |    0 | 10 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,12) | 11 | 3231 |    0 | 11 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,13) | 12 | 3231 |    0 | 12 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,14) | 13 | 3231 |    0 | 13 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,15) | 14 | 3231 |    0 | 14 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (1,1)  | 15 | 3231 |    0 | 15 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
(16 rows)

If you check again with PageInspect you can see that nothing has changed after my select. But do not rely too much on that, reads, in a SQL database, can change the state, as we will see later.

Update

An update or delete creates another version. I'm updating one row in the first block:

postgres=# update demo set value=upper(value) where id=2
;
UPDATE 1

The version in block 0 of the heap table is marked as deleted with its XMAX set to it's end of visibility:

postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30), lp_len
           from heap_page_items(get_raw_page('demo', 0))
;
 t_ctid | lp | lp_len | lp_flags | t_xmin | t_xmax |                                         substr
--------+----+--------+----------+--------+--------+----------------------------------------------------------------------------------------
 (0,1)  |  1 |    536 |        1 |   3231 |      0 | \x0000000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,2)  |  2 |    536 |        1 |   3231 |      0 | \x0100000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (1,2)  |  3 |    536 |        1 |   3231 |   3232 | \x0200000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,4)  |  4 |    536 |        1 |   3231 |      0 | \x0300000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,5)  |  5 |    536 |        1 |   3231 |      0 | \x0400000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,6)  |  6 |    536 |        1 |   3231 |      0 | \x0500000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,7)  |  7 |    536 |        1 |   3231 |      0 | \x0600000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,8)  |  8 |    536 |        1 |   3231 |      0 | \x0700000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,9)  |  9 |    536 |        1 |   3231 |      0 | \x0800000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,10) | 10 |    536 |        1 |   3231 |      0 | \x0900000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,11) | 11 |    536 |        1 |   3231 |      0 | \x0a00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,12) | 12 |    536 |        1 |   3231 |      0 | \x0b00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,13) | 13 |    536 |        1 |   3231 |      0 | \x0c00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,14) | 14 |    536 |        1 |   3231 |      0 | \x0d00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (0,15) | 15 |    536 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
(15 rows)

The new version has been appended in block 1 with its XMIN set to its start of visibility:

postgres=# select t_ctid, lp, lp_len, lp_flags, t_xmin, t_xmax, substr(t_data,1,42)
           from heap_page_items(get_raw_page('demo', 1))
;
 t_ctid | lp | lp_len | lp_flags | t_xmin | t_xmax |                                         substr
--------+----+--------+----------+--------+--------+----------------------------------------------------------------------------------------
 (1,1)  |  1 |    536 |        1 |   3231 |      0 | \x0f00000000000000e0070000787878787878787878787878787878787878787878787878787878787878
 (1,2)  |  2 |    536 |        1 |   3232 |      0 | \x0200000000000000e0070000585858585858585858585858585858585858585858585858585858585858
(2 rows)

In the B-Tree index, there is an entry for each version. The value id=2 has an entry to (1,2) for the new version and another to (0,2) for the old version:

postgres=# select substr(data,1,42), itemoffset, htid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 1))
           order by data
;
                   substr                   | itemoffset |  htid  | itemlen | nulls | vars | dead
--------------------------------------------+------------+--------+---------+-------+------+------
 00 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          2 | (0,1)  |     520 | f     | t    | f
 01 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          3 | (0,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 58 58  |          5 | (1,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          4 | (0,3)  |     520 | f     | t    | f
 03 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          6 | (0,4)  |     520 | f     | t    | f
 04 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          7 | (0,5)  |     520 | f     | t    | f
 05 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          8 | (0,6)  |     520 | f     | t    | f
 06 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          9 | (0,7)  |     520 | f     | t    | f
 07 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         10 | (0,8)  |     520 | f     | t    | f
 08 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         11 | (0,9)  |     520 | f     | t    | f
 09 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         12 | (0,10) |     520 | f     | t    | f
 0a 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         13 | (0,11) |     520 | f     | t    | f
 0b 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         14 | (0,12) |     520 | f     | t    | f
 0c 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         15 | (0,13) |     520 | f     | t    | f
 0d 00 00 00 00 00 00 00                    |          1 |        |      16 | f     | f    |
(15 rows)

The size of the table and index on disk has not changed because the blocks were already allocated:

postgres=# select relname,pg_table_size(oid),pg_size_pretty(pg_table_size(oid))
           from pg_class where relname like 'demo%'
;
  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |         49152 | 48 kB
 demo_pkey |         32768 | 32 kB
(2 rows)

Visibility

When older versions of data are stored, it leads to space amplification. This happens because of Multi-Version Concurrency control, which ensures that reads are consistent with the transaction or statement read time. These older versions are stored in the heap tables and indexes and are known as "bloat" in PostgreSQL. Additionally, these old versions are responsible for read amplification. For instance, a Seq Scan will read old versions that are no longer needed. An Index Scan may have to read multiple versions scattered within the heap table. Even an Index Only Scan may have to read heap tuples because that's where XMIN/XMAX is stored.

Those old versions stay even when they are not needed anymore by consistent reads. In my example, there's no ongoing transaction, or replication, that need to read the old version:

postgres=# select xmin from (
 select xmin from pg_replication_slots
 union all
 select catalog_xmin from pg_replication_slots
 union all
 select backend_xmin from pg_stat_replication
 union all
 select backend_xid from pg_stat_activity
 union all
 select backend_xmin from pg_stat_activity
 union all
 select transaction from pg_prepared_xacts
) order by age(xmin) nulls last limit 1
;
 xmin
------
 3233
(1 row)

To minimize the amount of space used, it is recommended that you use VACUUM. This process will automatically remove old versions of data, known as "dead tuples," when Auto Vacuum detects that they are no longer needed. However, I disabled Auto Vacuum and will show that this space may be reused before vacuum happens.

Select (Seq Scan)

I'm running a select to read all rows, also displaying CTID to show where the current version is stored:

postgres=# select ctid, id, xmin, xmax, id, substr(value,1,30) , length(value)
           from demo
;

  ctid  | id | xmin | xmax | id |             substr             | length
--------+----+------+------+----+--------------------------------+--------
 (0,1)  |  0 | 3231 |    0 |  0 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,2)  |  1 | 3231 |    0 |  1 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,4)  |  3 | 3231 |    0 |  3 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,5)  |  4 | 3231 |    0 |  4 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,6)  |  5 | 3231 |    0 |  5 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,7)  |  6 | 3231 |    0 |  6 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,8)  |  7 | 3231 |    0 |  7 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,9)  |  8 | 3231 |    0 |  8 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,10) |  9 | 3231 |    0 |  9 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,11) | 10 | 3231 |    0 | 10 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,12) | 11 | 3231 |    0 | 11 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,13) | 12 | 3231 |    0 | 12 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,14) | 13 | 3231 |    0 | 13 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (0,15) | 14 | 3231 |    0 | 14 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (1,1)  | 15 | 3231 |    0 | 15 | xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx |    500
 (1,2)  |  2 | 3232 |    0 |  2 | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |    500
(16 rows)

The updated row comes at the end because it was appended. The table was created with the default FILLFACTOR=100 and had no place to insert the new version closer to the old one.

Remember that no VACUUM was run and Auto Vacuum was disabled. I've run only a SELECT. The tuples in block 0 have changed:

postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30), lp_len
           from heap_page_items(get_raw_page('demo', 0))
;
 t_ctid | lp | lp_flags | t_xmin | t_xmax |                             substr                             | lp_len
--------+----+----------+--------+--------+----------------------------------------------------------------+--------
 (0,1)  |  1 |        1 |   3231 |      0 | \x0000000000000000e0070000787878787878787878787878787878787878 |    536
 (0,2)  |  2 |        1 |   3231 |      0 | \x0100000000000000e0070000787878787878787878787878787878787878 |    536
        |  3 |        3 |        |        |                                                                |      0
 (0,4)  |  4 |        1 |   3231 |      0 | \x0300000000000000e0070000787878787878787878787878787878787878 |    536
 (0,5)  |  5 |        1 |   3231 |      0 | \x0400000000000000e0070000787878787878787878787878787878787878 |    536
 (0,6)  |  6 |        1 |   3231 |      0 | \x0500000000000000e0070000787878787878787878787878787878787878 |    536
 (0,7)  |  7 |        1 |   3231 |      0 | \x0600000000000000e0070000787878787878787878787878787878787878 |    536
 (0,8)  |  8 |        1 |   3231 |      0 | \x0700000000000000e0070000787878787878787878787878787878787878 |    536
 (0,9)  |  9 |        1 |   3231 |      0 | \x0800000000000000e0070000787878787878787878787878787878787878 |    536
 (0,10) | 10 |        1 |   3231 |      0 | \x0900000000000000e0070000787878787878787878787878787878787878 |    536
 (0,11) | 11 |        1 |   3231 |      0 | \x0a00000000000000e0070000787878787878787878787878787878787878 |    536
 (0,12) | 12 |        1 |   3231 |      0 | \x0b00000000000000e0070000787878787878787878787878787878787878 |    536
 (0,13) | 13 |        1 |   3231 |      0 | \x0c00000000000000e0070000787878787878787878787878787878787878 |    536
 (0,14) | 14 |        1 |   3231 |      0 | \x0d00000000000000e0070000787878787878787878787878787878787878 |    536
 (0,15) | 15 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878 |    536
(15 rows)

The value 3 of lp_flags means "dead". The dead tuple is still there but doesn't use any space except to store the header. That's the first point I wanted to show in this blog post: some limited garbage collection can happen on heap tables even without running VACUUM to reduce the space amplification.

In the B-Tree index, nothing has changed. There are still two entries, with one pointing to the dead tuple:

postgres=# select substr(data,1,42), itemoffset, htid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 1))
           order by data
;
                   substr                   | itemoffset |  htid  | itemlen | nulls | vars | dead
--------------------------------------------+------------+--------+---------+-------+------+------
 00 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          2 | (0,1)  |     520 | f     | t    | f
 01 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          3 | (0,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 58 58  |          5 | (1,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          4 | (0,3)  |     520 | f     | t    | f
 03 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          6 | (0,4)  |     520 | f     | t    | f
 04 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          7 | (0,5)  |     520 | f     | t    | f
 05 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          8 | (0,6)  |     520 | f     | t    | f
 06 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          9 | (0,7)  |     520 | f     | t    | f
 07 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         10 | (0,8)  |     520 | f     | t    | f
 08 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         11 | (0,9)  |     520 | f     | t    | f
 09 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         12 | (0,10) |     520 | f     | t    | f
 0a 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         13 | (0,11) |     520 | f     | t    | f
 0b 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         14 | (0,12) |     520 | f     | t    | f
 0c 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         15 | (0,13) |     520 | f     | t    | f
 0d 00 00 00 00 00 00 00                    |          1 |        |      16 | f     | f    |
(15 rows)

Select (Index Only Scan)

The read amplification is still visible when accessing though the index:

postgres=# explain (analyze, buffers)
           select * from demo where id=2
;

                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_pkey on demo  (cost=0.28..8.29 rows=1 width=40) (actual time=0.124..0.125 rows=1 loops=1)
   Index Cond: (id = 2)
   Heap Fetches: 2
   Buffers: local hit=4
 Planning Time: 0.057 ms
 Execution Time: 0.140 ms
(6 rows)

Reading from the index has to read the B-Tree root and one leaf, the block 1, that contains the entries for index condition. It shows 4 buffers reads because there were 2 additional reads from the heap table to check the XMIN/XMAX visibility of both.

If you want to look at the B-Tree root, it is in block 3 with a pointer to each leaf:

postgres=# select substr(data,1,42), itemoffset, ctid, htid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 3))
           order by data
;
         substr          | itemoffset | ctid  | htid | itemlen | nulls | vars | dead
-------------------------+------------+-------+------+---------+-------+------+------
                         |          1 | (1,0) |      |       8 | f     | f    |
 0d 00 00 00 00 00 00 00 |          2 | (2,1) |      |      16 | f     | f    |
(2 rows)

So, what we have seen here is that even an Index Only Scan has to go to the table, and may read many versions to find the one that is visible to the query read time.

But this read has modified the state and the next one has one less buffer to read:

postgres=# explain (analyze, buffers)
select * from demo where id=2;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Index Only Scan using demo_pkey on demo  (cost=0.28..8.29 rows=1 width=40) (actual time=0.015..0.016 rows=1 loops=1)
   Index Cond: (id = 2)
   Heap Fetches: 1
   Buffers: local hit=3
 Planning Time: 0.054 ms
 Execution Time: 0.030 ms
(6 rows)

This subtle change is visible in the dead column of the B-Tree leaf block which has been set to true by the previous read when it has found the dead tuple, so that the next reads don't have to do the same.

postgres=# select substr(data,1,42), itemoffset, htid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 1))
           order by data
;
                   substr                   | itemoffset |  htid  | itemlen | nulls | vars | dead
--------------------------------------------+------------+--------+---------+-------+------+------
 00 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          2 | (0,1)  |     520 | f     | t    | f
 01 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          3 | (0,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 58 58  |          5 | (1,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          4 | (0,3)  |     520 | f     | t    | t
 03 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          6 | (0,4)  |     520 | f     | t    | f
 04 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          7 | (0,5)  |     520 | f     | t    | f
 05 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          8 | (0,6)  |     520 | f     | t    | f
 06 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          9 | (0,7)  |     520 | f     | t    | f
 07 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         10 | (0,8)  |     520 | f     | t    | f
 08 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         11 | (0,9)  |     520 | f     | t    | f
 09 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         12 | (0,10) |     520 | f     | t    | f
 0a 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         13 | (0,11) |     520 | f     | t    | f
 0b 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         14 | (0,12) |     520 | f     | t    | f
 0c 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         15 | (0,13) |     520 | f     | t    | f
 0d 00 00 00 00 00 00 00                    |          1 |        |      16 | f     | f    |
(15 rows)

Without running VACUUM, the read amplification has been reduced for the next reads. What about space amplification?

Insert

I insert additional rows:

postgres=# insert into demo
           select generate_series(16,1000), lpad('y',50,'y')
;
INSERT 0 985

None of them were able to reuse the space in block 0:

 t_ctid | lp | lp_flags | t_xmin | t_xmax |                             substr
--------+----+----------+--------+--------+----------------------------------------------------------------
 (0,1)  |  1 |        1 |   3231 |      0 | \x0000000000000000e0070000787878787878787878787878787878787878
 (0,2)  |  2 |        1 |   3231 |      0 | \x0100000000000000e0070000787878787878787878787878787878787878
        |  3 |        3 |        |        |
 (0,4)  |  4 |        1 |   3231 |      0 | \x0300000000000000e0070000787878787878787878787878787878787878
 (0,5)  |  5 |        1 |   3231 |      0 | \x0400000000000000e0070000787878787878787878787878787878787878
 (0,6)  |  6 |        1 |   3231 |      0 | \x0500000000000000e0070000787878787878787878787878787878787878
 (0,7)  |  7 |        1 |   3231 |      0 | \x0600000000000000e0070000787878787878787878787878787878787878
 (0,8)  |  8 |        1 |   3231 |      0 | \x0700000000000000e0070000787878787878787878787878787878787878
 (0,9)  |  9 |        1 |   3231 |      0 | \x0800000000000000e0070000787878787878787878787878787878787878
 (0,10) | 10 |        1 |   3231 |      0 | \x0900000000000000e0070000787878787878787878787878787878787878
 (0,11) | 11 |        1 |   3231 |      0 | \x0a00000000000000e0070000787878787878787878787878787878787878
 (0,12) | 12 |        1 |   3231 |      0 | \x0b00000000000000e0070000787878787878787878787878787878787878
 (0,13) | 13 |        1 |   3231 |      0 | \x0c00000000000000e0070000787878787878787878787878787878787878
 (0,14) | 14 |        1 |   3231 |      0 | \x0d00000000000000e0070000787878787878787878787878787878787878
 (0,15) | 15 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878
(15 rows)

However with smaller rows, some can reuse the space:

postgres=# insert into demo
           select generate_series(3001,4000), lpad('y',30,'y')
;
INSERT 0 1000
postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30)
           from heap_page_items(get_raw_page('demo', 0))
;
 t_ctid | lp | lp_flags | t_xmin | t_xmax |                             substr
--------+----+----------+--------+--------+----------------------------------------------------------------
 (0,1)  |  1 |        1 |   3231 |      0 | \x0000000000000000e0070000787878787878787878787878787878787878
 (0,2)  |  2 |        1 |   3231 |      0 | \x0100000000000000e0070000787878787878787878787878787878787878
        |  3 |        3 |        |        |
 (0,4)  |  4 |        1 |   3231 |      0 | \x0300000000000000e0070000787878787878787878787878787878787878
 (0,5)  |  5 |        1 |   3231 |      0 | \x0400000000000000e0070000787878787878787878787878787878787878
 (0,6)  |  6 |        1 |   3231 |      0 | \x0500000000000000e0070000787878787878787878787878787878787878
 (0,7)  |  7 |        1 |   3231 |      0 | \x0600000000000000e0070000787878787878787878787878787878787878
 (0,8)  |  8 |        1 |   3231 |      0 | \x0700000000000000e0070000787878787878787878787878787878787878
 (0,9)  |  9 |        1 |   3231 |      0 | \x0800000000000000e0070000787878787878787878787878787878787878
 (0,10) | 10 |        1 |   3231 |      0 | \x0900000000000000e0070000787878787878787878787878787878787878
 (0,11) | 11 |        1 |   3231 |      0 | \x0a00000000000000e0070000787878787878787878787878787878787878
 (0,12) | 12 |        1 |   3231 |      0 | \x0b00000000000000e0070000787878787878787878787878787878787878
 (0,13) | 13 |        1 |   3231 |      0 | \x0c00000000000000e0070000787878787878787878787878787878787878
 (0,14) | 14 |        1 |   3231 |      0 | \x0d00000000000000e0070000787878787878787878787878787878787878
 (0,15) | 15 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878
 (0,16) | 16 |        1 |   3236 |      0 | \xe40b0000000000003f797979797979797979797979797979797979797979
 (0,17) | 17 |        1 |   3236 |      0 | \xe50b0000000000003f797979797979797979797979797979797979797979
 (0,18) | 18 |        1 |   3236 |      0 | \xe60b0000000000003f797979797979797979797979797979797979797979
 (0,19) | 19 |        1 |   3236 |      0 | \xe70b0000000000003f797979797979797979797979797979797979797979
 (0,20) | 20 |        1 |   3236 |      0 | \xe80b0000000000003f797979797979797979797979797979797979797979
 (0,21) | 21 |        1 |   3236 |      0 | \xe90b0000000000003f797979797979797979797979797979797979797979
 (0,22) | 22 |        1 |   3236 |      0 | \xea0b0000000000003f797979797979797979797979797979797979797979
 (0,23) | 23 |        1 |   3236 |      0 | \xeb0b0000000000003f797979797979797979797979797979797979797979
(23 rows)

Even before VACUUM, the space released in the heap table when dead tuples are converted to metadata stubs, limiting the consequence of space amplification.

This is possible in heap tables because a row has no specific place. It can be inserted wherever there is enough free space.

The table size is now:

postgres=# select relname,pg_table_size(oid),pg_size_pretty(pg_table_size(oid))
           from pg_class where relname like 'demo%'
;
  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |        204800 | 200 kB
 demo_pkey |        172032 | 168 kB
(2 rows)

Delete large ranges

In B-Tree, each value has a place so the space can be reused only if we insert values in the same range. However, we have seen that the old versions were marked as dead, reducing read amplification, but not removed, so space amplification cannot be reused.

Let's delete a large range of rows that concern all entries in the block:

postgres=# delete from demo
           where id <= 13 -- #0d
;

They have been marked as old version in the heap table:

postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30)
           from heap_page_items(get_raw_page('demo', 0))
;
 t_ctid | lp | lp_flags | t_xmin | t_xmax |                             substr
--------+----+----------+--------+--------+----------------------------------------------------------------
 (0,1)  |  1 |        1 |   3231 |   3237 | \x0000000000000000e0070000787878787878787878787878787878787878
 (0,2)  |  2 |        1 |   3231 |   3237 | \x0100000000000000e0070000787878787878787878787878787878787878
        |  3 |        3 |        |        |
 (0,4)  |  4 |        1 |   3231 |   3237 | \x0300000000000000e0070000787878787878787878787878787878787878
 (0,5)  |  5 |        1 |   3231 |   3237 | \x0400000000000000e0070000787878787878787878787878787878787878
 (0,6)  |  6 |        1 |   3231 |   3237 | \x0500000000000000e0070000787878787878787878787878787878787878
 (0,7)  |  7 |        1 |   3231 |   3237 | \x0600000000000000e0070000787878787878787878787878787878787878
 (0,8)  |  8 |        1 |   3231 |   3237 | \x0700000000000000e0070000787878787878787878787878787878787878
 (0,9)  |  9 |        1 |   3231 |   3237 | \x0800000000000000e0070000787878787878787878787878787878787878
 (0,10) | 10 |        1 |   3231 |   3237 | \x0900000000000000e0070000787878787878787878787878787878787878
 (0,11) | 11 |        1 |   3231 |   3237 | \x0a00000000000000e0070000787878787878787878787878787878787878
 (0,12) | 12 |        1 |   3231 |   3237 | \x0b00000000000000e0070000787878787878787878787878787878787878
 (0,13) | 13 |        1 |   3231 |   3237 | \x0c00000000000000e0070000787878787878787878787878787878787878
 (0,14) | 14 |        1 |   3231 |   3237 | \x0d00000000000000e0070000787878787878787878787878787878787878
 (0,15) | 15 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878
 (0,16) | 16 |        1 |   3236 |      0 | \xe40b0000000000003f797979797979797979797979797979797979797979
 (0,17) | 17 |        1 |   3236 |      0 | \xe50b0000000000003f797979797979797979797979797979797979797979
 (0,18) | 18 |        1 |   3236 |      0 | \xe60b0000000000003f797979797979797979797979797979797979797979
 (0,19) | 19 |        1 |   3236 |      0 | \xe70b0000000000003f797979797979797979797979797979797979797979
 (0,20) | 20 |        1 |   3236 |      0 | \xe80b0000000000003f797979797979797979797979797979797979797979
 (0,21) | 21 |        1 |   3236 |      0 | \xe90b0000000000003f797979797979797979797979797979797979797979
 (0,22) | 22 |        1 |   3236 |      0 | \xea0b0000000000003f797979797979797979797979797979797979797979
 (0,23) | 23 |        1 |   3236 |      0 | \xeb0b0000000000003f797979797979797979797979797979797979797979
(23 rows)

Nothing changed in the B-Tree leaf:

postgres=# select substr(data,1,42), itemoffset, htid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 1))
           order by data
;
                   substr                   | itemoffset |  ctid  |  htid  | itemlen | nulls | vars | dead
--------------------------------------------+------------+--------+--------+---------+-------+------+------
 00 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          2 | (0,1)  | (0,1)  |     520 | f     | t    | f
 01 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          3 | (0,2)  | (0,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 58 58  |          5 | (1,2)  | (1,2)  |     520 | f     | t    | f
 02 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          4 | (0,3)  | (0,3)  |     520 | f     | t    | t
 03 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          6 | (0,4)  | (0,4)  |     520 | f     | t    | f
 04 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          7 | (0,5)  | (0,5)  |     520 | f     | t    | f
 05 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          8 | (0,6)  | (0,6)  |     520 | f     | t    | f
 06 00 00 00 00 00 00 00 e0 07 00 00 78 78  |          9 | (0,7)  | (0,7)  |     520 | f     | t    | f
 07 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         10 | (0,8)  | (0,8)  |     520 | f     | t    | f
 08 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         11 | (0,9)  | (0,9)  |     520 | f     | t    | f
 09 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         12 | (0,10) | (0,10) |     520 | f     | t    | f
 0a 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         13 | (0,11) | (0,11) |     520 | f     | t    | f
 0b 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         14 | (0,12) | (0,12) |     520 | f     | t    | f
 0c 00 00 00 00 00 00 00 e0 07 00 00 78 78  |         15 | (0,13) | (0,13) |     520 | f     | t    | f
 0d 00 00 00 00 00 00 00                    |          1 | (0,1)  |        |      16 | f     | f    |
(15 rows)

I query this range of rows:

postgres=# explain (analyze, buffers)
           select from demo
           where id <= 13 
;
                                            QUERY PLAN
--------------------------------------------------------------------------------------------------
 Seq Scan on demo  (cost=0.00..52.50 rows=840 width=0) (actual time=0.161..0.161 rows=0 loops=1)
   Filter: (id <= 13)
   Rows Removed by Filter: 1987
   Buffers: local hit=21
 Planning Time: 0.047 ms
 Execution Time: 0.172 ms
(6 rows)

This has removed data for the old versions in the heap table:

postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30)
           from heap_page_items(get_raw_page('demo', 0))
;
 t_ctid | lp | lp_flags | t_xmin | t_xmax |                             substr
--------+----+----------+--------+--------+----------------------------------------------------------------
        |  1 |        3 |        |        |
        |  2 |        3 |        |        |
        |  3 |        3 |        |        |
        |  4 |        3 |        |        |
        |  5 |        3 |        |        |
        |  6 |        3 |        |        |
        |  7 |        3 |        |        |
        |  8 |        3 |        |        |
        |  9 |        3 |        |        |
        | 10 |        3 |        |        |
        | 11 |        3 |        |        |
        | 12 |        3 |        |        |
        | 13 |        3 |        |        |
        | 14 |        3 |        |        |
 (0,15) | 15 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878
 (0,16) | 16 |        1 |   3236 |      0 | \xe40b0000000000003f797979797979797979797979797979797979797979
 (0,17) | 17 |        1 |   3236 |      0 | \xe50b0000000000003f797979797979797979797979797979797979797979
 (0,18) | 18 |        1 |   3236 |      0 | \xe60b0000000000003f797979797979797979797979797979797979797979
 (0,19) | 19 |        1 |   3236 |      0 | \xe70b0000000000003f797979797979797979797979797979797979797979
 (0,20) | 20 |        1 |   3236 |      0 | \xe80b0000000000003f797979797979797979797979797979797979797979
 (0,21) | 21 |        1 |   3236 |      0 | \xe90b0000000000003f797979797979797979797979797979797979797979
 (0,22) | 22 |        1 |   3236 |      0 | \xea0b0000000000003f797979797979797979797979797979797979797979
 (0,23) | 23 |        1 |   3236 |      0 | \xeb0b0000000000003f797979797979797979797979797979797979797979
(23 rows)

However, as it was a Seq Scan, nothing changed in the index.

To mark the index entries as dead, I need an Index Scan (not a Bitmap Index Scan) but this will not release the space.

The size on disk didn't change:

postgres=# select relname,pg_table_size(oid),pg_size_pretty(pg_table_size(oid))
           from pg_class where relname like 'demo%'
;
  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |        204800 | 200 kB
 demo_pkey |        172032 | 168 kB
(2 rows)

Vacuum

To release the space in the B-Tree, we need to run VACUUM:

postgres=# vacuum demo;
VACUUM

This completely removes the dead tuples (data and header):

postgres=# select t_ctid, lp, lp_flags, t_xmin, t_xmax, substr(t_data,1,30)
           from heap_page_items(get_raw_page('demo', 0))
;
 t_ctid | lp | lp_flags | t_xmin | t_xmax |                             substr
--------+----+----------+--------+--------+----------------------------------------------------------------
        |  1 |        0 |        |        |
        |  2 |        0 |        |        |
        |  3 |        0 |        |        |
        |  4 |        0 |        |        |
        |  5 |        0 |        |        |
        |  6 |        0 |        |        |
        |  7 |        0 |        |        |
        |  8 |        0 |        |        |
        |  9 |        0 |        |        |
        | 10 |        0 |        |        |
        | 11 |        0 |        |        |
        | 12 |        0 |        |        |
        | 13 |        0 |        |        |
        | 14 |        0 |        |        |
 (0,15) | 15 |        1 |   3231 |      0 | \x0e00000000000000e0070000787878787878787878787878787878787878
 (0,16) | 16 |        1 |   3236 |      0 | \xe40b0000000000003f797979797979797979797979797979797979797979
 (0,17) | 17 |        1 |   3236 |      0 | \xe50b0000000000003f797979797979797979797979797979797979797979
 (0,18) | 18 |        1 |   3236 |      0 | \xe60b0000000000003f797979797979797979797979797979797979797979
 (0,19) | 19 |        1 |   3236 |      0 | \xe70b0000000000003f797979797979797979797979797979797979797979
 (0,20) | 20 |        1 |   3236 |      0 | \xe80b0000000000003f797979797979797979797979797979797979797979
 (0,21) | 21 |        1 |   3236 |      0 | \xe90b0000000000003f797979797979797979797979797979797979797979
 (0,22) | 22 |        1 |   3236 |      0 | \xea0b0000000000003f797979797979797979797979797979797979797979
 (0,23) | 23 |        1 |   3236 |      0 | \xeb0b0000000000003f797979797979797979797979797979797979797979
(23 rows)

The value 0 of lp_flags means "unused".

It is important to understand that running VACUUM on a table only makes some space reusable and does not reduce the size on disk, except when there are no rows above the dead tuples so the file high water mark can be lowered.

In the index, VACUUM removes dead entries, freeing space for new inserts in the same value range.

Here because I deleted all index entries in the B-Tree leaf block, the full block has been reclaimed to be reused:

postgres=# select substr(data,1,42), itemoffset, ctid, htid, itemlen, nulls, vars, dead
           from bt_page_items(get_raw_page('demo_pkey', 1))
           order by data
;
NOTICE:  page is deleted
NOTICE:  page from block is deleted
 substr | itemoffset | ctid | htid | itemlen | nulls | vars | dead
--------+------------+------+------+---------+-------+------+------
(0 rows)

On disk, the space used didn't reduce and the heap table is even larger:

postgres=# select relname,pg_table_size(oid),pg_size_pretty(pg_table_size(oid))
           from pg_class where relname like 'demo%'
;
  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |        212992 | 208 kB
 demo_pkey |        172032 | 168 kB
(2 rows)

Insert again

I insert the same amount of rows that have been deleted and vacuumed:

postgres=# insert into demo
           select generate_series(5000,5013), lpad('x',500,'x')
;
INSERT 0 14

The table size did not change because the rows were inserted into block 0 where space has been released, but the index is larger by one block:

  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |        212992 | 208 kB
 demo_pkey |        180224 | 176 kB
(2 rows)

This blog is already too long to explain the block 1 was not reused. PostgreSQL MVCC is full of suprises.

Reindex

Indexes can be reorganized without blocking the application:

postgres=# reindex index concurrently demo_pkey;
;
REINDEX

The block that was emptied and deleted is finally reclaimed:

postgres=# select relname,pg_table_size(oid),pg_size_pretty(pg_table_size(oid))
           from pg_class where relname like 'demo%'
;
  relname  | pg_table_size | pg_size_pretty
-----------+---------------+----------------
 demo      |        212992 | 208 kB
 demo_pkey |        172032 | 168 kB
(2 rows)

Before rebuilding indexes, ensure they have enough free space to run efficiently. Packing too much can cause block splitting and fragmentation.

Vacuum full

If you know that you won't be inserting any new data into a table, you can reclaim disk space by reorganizing the table and its index. PostgreSQL offers a tool called VACUUM FULL, but it locks the table during the process and can take a long time. Alternatively, you can use external solutions like pg_repack or extensions like pg_squeeze. This is also a good opportunity to consider the FILLFACTOR: by leaving some free space in the blocks, you can reduce the read amplification that follows updates and allow new versions to be stored in the same block as the old version.

A bit more about the terminology

We have seen multiple states of "dead tuples" here: before being dead, when an old version can still be read, when being expired but still taking full space, when space is released except the header, and when it is definitely gone. Here is more explanation about the right definition of "dead tuple", by Matthias van de Meent. In this article I used the same definition: the dead tuples have passed the longest transaction XMIN horizon and that's why they were cleaned up. Long queries a blocking this process.

Both this article and this explanation has been greatly summarized by Hussein Nasser in a short video.

DEV Community