DEV Community

Aadil Bashir
Aadil Bashir

Posted on

Vacuum Processing - A Beginner's Guide

Welcome to the sixth article of my blog series. In this blog, I will be describing Vacuum Processing in PostgreSQL.

What it is Vacuum Processing ?

Vacuum processing is an essential maintenance process in PostgreSQL that helps ensure the reliable operation of the database by removing dead tuples and freezing transaction IDs. Dead tuples are rows in tables that are no longer needed and can be safely removed, while freezing transaction IDs prevents the possibility of data corruption. Vacuum processing has two modes: Concurrent VACUUM and Full VACUUM. Concurrent VACUUM removes dead tuples for each page of the table file, and other transactions can read the table while this process is running. In contrast, Full VACUUM removes dead tuples and defragments live tuples for the whole file, and other transactions cannot access tables while Full VACUUM is running.

Despite being crucial to PostgreSQL's performance, the development and improvement of vacuum processing have been slower compared to other functions. Until version 8.0, vacuum processing had to be executed manually, but automation was introduced in 2005 with the implementation of the autovacuum daemon. Since vacuum processing involves scanning whole tables, it can be a costly process. However, PostgreSQL version 8.4 introduced the Visibility Map (VM) to improve the efficiency of removing dead tuples by keeping track of pages containing only live tuples, which allows the vacuum process to skip unnecessary scans. In version 9.6, further improvements were made to the freeze process by enhancing the VM. The PostgreSQL documentation includes detailed sections on various aspects of vacuum processing, such as concurrent VACUUM, the VM, freeze processing, removing unnecessary clog files, the autovacuum daemon, and Full VACUUM, providing valuable information to optimize vacuum processing for PostgreSQL databases.

Tasks of Vacuum Processing

  1. Removing dead tuples: It removes dead tuples from tables and defragments live tuples on a page-by-page basis. It also removes index tuples that point to dead tuples.
  2. Freezing old txids: It freezes old transaction IDs (txids) of tuples if necessary. This involves updating frozen txid-related system catalogs and ensuring data integrity.
  3. Removing unnecessary parts of the clog: It removes unnecessary portions of the commit log (clog) if possible, freeing up space.

Furthermore,the vacuum process consists of three blocks within a loop for each table and post-processing steps:

  1. The first block scans the target table, identifying dead tuples and freezing old tuples. It builds a list of dead tuples stored in memory.
  2. The second block removes dead tuples, defragments live tuples, and updates the free space map (FSM) and visibility map (VM) on a page-by-page basis.
  3. The third block performs cleanup after removing index tuples and updates statistics and system catalogs related to vacuum processing for each table. If the last page of a table is empty, it is truncated from the table file.

Finally, post-processing involves updating statistics, system catalogs, and removing unnecessary parts of the commit log.

Overall, vacuum processing in PostgreSQL aims to optimize database performance by removing dead tuples, freezing transaction IDs, and maintaining data integrity while minimizing resource usage.

Visibility Map

The introduction of the Visibility Map (VM) in PostgreSQL 8.4 aimed to reduce the cost of vacuum processing. The VM is a table-specific map that tracks page visibility, indicating whether pages contain dead tuples or not. By using the VM, the vacuum process can skip pages that do not have dead tuples, optimizing performance and reducing resource usage.

The VM is stored as one or more 8 KB pages in a separate file. It complements the Free Space Map (FSM) and provides valuable information about page visibility. This allows the vacuum process to efficiently target and clean up pages that require attention, improving the overall efficiency of vacuum processing in PostgreSQL.

Freeze Processing

PostgreSQL offers two modes for freeze processing: lazy mode and eager mode. In lazy mode, the freeze process scans only the pages containing dead tuples, utilizing the visibility map (VM) of the target tables. This approach efficiently skips pages without dead tuples. Conversely, eager mode scans all pages, regardless of the presence of dead tuples. It updates relevant system catalogs, eliminates unnecessary files and pages from the commit log, and performs a thorough freeze process. The selection of the mode depends on specific conditions, with lazy mode being the default and eager mode triggered under certain criteria.

PostgreSQL version 9.6 introduced enhancements to improve the efficiency of eager mode. The visibility map was enhanced to include information indicating whether all tuples on a page are frozen. This enhancement enables the eager mode freeze process to bypass pages that exclusively contain frozen tuples. By leveraging this information from the visibility map, the need for scanning pages with fully frozen tuples is eliminated, resulting in faster and more efficient freeze processing.

In conclusion, PostgreSQL offers two modes for freeze processing: lazy mode (default) and eager mode. Lazy mode scans only pages with dead tuples, while eager mode scans all pages and performs comprehensive freeze processing. The introduction of the visibility map and subsequent enhancements in eager mode in version 9.6 improve efficiency by skipping pages with fully frozen tuples. These modes optimize the freeze process and maintain data integrity in the PostgreSQL database.

Removing Unnecessary Clog Files

In PostgreSQL, the commit log (clog) is responsible for storing transaction states. When the minimum frozen transaction ID, indicated by pg_database.datfrozenxid, is updated, PostgreSQL takes steps to eliminate redundant clog files. By identifying the clog file that holds the minimum frozen transaction ID, PostgreSQL can safely remove older clog files and their associated pages, thereby reclaiming disk space. This process ensures that the necessary transaction information is retained for maintaining data integrity in the database while optimizing disk usage.

Autovacuum Daemon

The introduction of the autovacuum daemon in PostgreSQL has automated the vacuum processing, offering increased convenience in managing the database. The autovacuum daemon periodically triggers multiple autovacuum_worker processes. By default, it wakes up every minute (as defined by autovacuum_naptime) and spawns three workers (as defined by autovacuum_max_workers). These autovacuum workers concurrently perform vacuum processing on different tables. This approach ensures a gradual and efficient cleanup of dead tuples while minimizing disruptions to ongoing database activities. The automation provided by the autovacuum daemon simplifies vacuum management, leading to improved performance and stability of the PostgreSQL database.

Full VACUUM

While concurrent VACUUM in PostgreSQL is crucial for maintaining database performance, it has limitations when it comes to reducing table size. Even if a significant number of dead tuples are removed during VACUUM, the physical size of the table remains unchanged. This situation is illustrated in the provided example (Figure 6.8), where a table consisting of three pages, each containing six tuples per page, undergoes dead tuple removal through a DELETE command followed by a VACUUM command. Despite the removal of dead tuples, the table's size does not shrink.

This inefficiency results in wasted disk space and can negatively impact database performance. Accessing even a subset of tuples requires loading the entire pages from disk, leading to increased disk I/O operations and slower query execution times. To address this issue, additional operations such as table reorganization or clustering are necessary. These operations physically reorganize the table and compact its contents, allowing unused space to be reclaimed. By performing these additional steps, the table size can be reduced, optimizing disk usage and improving database performance.

References

If you want to read it in more detail, click the link below:
Chapter-6 The Internals of PostgreSQL

Top comments (0)