Every time you delete or update a row in your database, the old records are secretly still hiding in the background and taking up space on your hard drive.
VACUUM process is like emptying the recycling bin on your laptop. It clears up space, reduces indexing time, and keeps your database squeaky clean.
But it's important to understand how
VACUUM works so you can avoid the equivalent of vacuuming your house in the middle of a dinner party.
By understanding how and when Postgres and other databases clean themselves, you will be able to tune your database for low response times and your database server for the right amount of storage.
In this post we will cover:
- What a VACUUM process is and how it works
- Full vs Auto VACUUM
- How to modify and inspect this process
During the life of a database, we might make thousands of changes to a database like above. But deleting a record does not actually free up the disk space.
Postgres uses multi-version concurrency control (also known as MVCC) to ensure concurrent data access and consistency.
Whenever we execute a SQL statement, it uses a snapshot of data instead of every row. This prevents users from viewing inconsistent data generated by concurrent transactions. It also minimizes lock contentions for different sessions trying to read or write data.
Each transaction gets a transaction ID – XID for a block of
COMMIT statement. When a user inserts a row, but the transaction is not committed, other users cannot see the newly inserted row.
For example, in the below image, User A inserts two records into an empty table. If User B were to run a
SELECT statement, they would get zero rows before the record committed. Similarly, another user still gets table data for a delete statement if the delete transaction statement is not committed.
When we execute a
UPDATE statement, Postgres does the following:
Delete command : Postgres does not remove the tuples physically; it marks the existing tuple as
Update command: The update statement is similar to a
INSERTstatement. Therefore, it marks the existing tuple as
DEADand inserts a new tuple.
If you have frequent DML( INSERT, UPDATE, DELETE) statements, these
DEAD tuples can take unnecessary space. To free up space, we have to run the following maintenance processes:
- Remove the dead tuples
- Remove index tuples pointing to the dead tuples
- Update the statistics
With this knowledge of how a SQL statement is executed and the maintenance requirements, we can discuss the
Postgres uses the
VACUUM maintenance process for removing
DEAD tuples. It reclaims space occupied by dead tuples for reuse in the further data insertion.
VACUUM process can run concurrently with other DML transactions because it does not put an exclusive lock on the table. It carries out following operations for removing dead tuples:
- Postgres scans all pages of a target table and builds a list of dead tuples. It freezes the old tuples if required.
- It removes the index tuples pointing to the dead tuples by referencing to the dead tuple list.
- It updates the statistics as well as the system catalog for the target table after the
VACCUMprocessing. It also updates the FSM (Free Space Map) and VM(Visibility Map).
I'll demonstrate the impact of
VACUUM by creating an example table, deleting some values, and then running a
VACUUM command. In this example I also use the
pg_freespacecamp extension to monitor improvements in space utilization.
create table SampleTable(id1 int, id2 int); insert into SampleTable values ( generate_series(1,100000), generate_series(1,100000) );
create extension pg_freespacemap; SELECT count(*) as npages, round(100 * avg(avail) / 8192, 2) as avg_fs_ratio FROM pg_freespace( 'SampleTable');
The table we created uses up 443 pages of space on the hard drive, and because all of the data was added sequentially, it has a perfect free space ratio of 0.
Now I'll delete every record with a value greater that 100. But if we re-run the command above, the number of pages and free space ratio don't change.
delete from SampleTable where id1 > 10;
To remove the unused pages, we have to run a VACUUM command.
Now when we track the space usage we can see the number of used pages has gone down from 443 to 1! But the ratio of free to used space on this page has also increased from 0 to 49%. We can return this space to the operating system with a
VACUUM FULL .
VACUUM process removes DEAD tuples for future usage, but it does not return the space to the operating system.
Therefore, if you perform a bulk data deletion or updates, you might be using too much storage due to space occupied by these DEAD tuples. The
VACUUM FULL process returns the space to the operating system, as seen in the picture below. It does the following tasks.
VACUUM FULLprocess obtains an exclusive lock on the table.
- It creates a new empty storage table file.
- Copy the live tuples to the new table storage.
- Removes the old table file and frees the storage.
- It rebuilds all associated table indexes, updates the system catalogs and statistics.
Let's see how running the
VACUUM FULL command impacts our previous sample table:
vacuum full SampleTable; SELECT count(*) as npages, round(100 * avg(avail) / 8192, 2) as avg_fs_ratio FROM pg_freespace( 'SampleTable');
The free space ratio is now down from almost 50% to 0.
VACUUM cleaning is costly because it needs to scan all pages of a target table. If you have a large table with million rows, it can be harmful to your database resources. To preserve resources, Postgres uses the Visibility Map. Each table in Postgres has a VM that determines whether the page in the table has dead tuples. If the page does not have a dead tuple, the vacuum processing skips the page.
For example, in the below image, we have a table with four pages. Two pages have of DEAD tuples. The visibility map uses a bitmap that defines dead tuples on a specific page.
- Bit 0: No dead tuples on the page, therefore skip
- Bit 1: Page consists of dead tuples; therefore,
VACCUMthat specific page.
Postgres automates the VACCUM processing using auto vacuum daemon. By default, it runs every 1 minute. When the VACCUM wakes up, it invokes three workers. These workers do the VACCUM processing on the target tables.
You can query pg_settings to check various configurations for the autovacuum process in Postgres:
select name, setting, category, short_desc from pg_settings where name like '%autovacuum%'
Automatically cleaning and reindexing the database every minute might not be optimal if you have millions of rows. Therefore, we can configure table level. If you specify a table-level configuration, it bypasses the global setting.
For example, in the below query, we set autovacuum for
SampleTable2 if it has more than 100 DEAD tuples.
ALTER TABLE SampleTable2 SET (autovacuum_vacuum_scale_factor = 0, autovacuum_vacuum_threshold = 100)
As we know, full vacuum process reclaims space to the operating system. However, FULL VACUUM requires an exclusive lock on the table for its processing, and it blocks all other transactions. In the below table, we can summaries the difference between these processes.
In this article, we covered:
- How Postgres implements
VACUUMto remove DEAD tuples
VACUUM FULLto return space back to the OS
- Configuring auto vacuum for individual tables
These concepts are useful for reducing the server costs of databases while maintaining high availability. With the queries shown in the above article you can inspect the internals of your database and its memory consumption.
If you want a SQL editor with an intuitive interface and easy data visualizations, try Arctype today.