Autovacuum Launcher Crash: Assert in pgstat_count_io_op (IOOP_EXTEND on pg_database's VM)
Technical Problem
This thread reports a crash in the autovacuum launcher triggered by an assertion failure in pgstat_count_io_op when performing an IOOP_EXTEND operation on the visibility map (VM) of pg_database. The crash was discovered during stress testing on master (commit e2b35735b00) with assertions enabled, using a workload involving rapid creation and dropping of databases in a tight loop.
Root Cause Analysis
The core issue lies at the intersection of several PostgreSQL subsystems:
-
Visibility Map Extension: When autovacuum processes
pg_database(a shared catalog), it may need to extend the visibility map. The VM is a fork of the relation that tracks which pages are known to contain only tuples visible to all transactions. -
I/O Statistics Tracking: The
pgstat_count_io_opfunction tracks I/O operations for the statistics subsystem. It contains assertions that validate the combination of I/O operation type, backend type, and I/O object being operated on. -
The Assertion Failure: The autovacuum launcher (as distinct from autovacuum workers) is apparently performing or triggering an I/O operation (
IOOP_EXTENDon a relation fork) that the statistics subsystem does not expect from that particular backend type. The pgstat I/O statistics infrastructure has a matrix of valid combinations of(BackendType, IOObject, IOContext, IOOp), and the autovacuum launcher extending a VM file likely falls outside the expected valid combinations.
Architectural Significance
This bug exposes a subtle design tension:
-
Autovacuum launcher vs. worker distinction: The launcher is primarily responsible for scheduling and spawning workers, but it also performs lightweight vacuuming of shared catalogs (like
pg_database) directly. This dual role means the launcher can perform I/O operations that the statistics infrastructure may not have accounted for. -
Shared catalog vacuuming:
pg_databaseis a shared catalog, and vacuuming it in the launcher process is a long-standing optimization to avoid spawning a full worker for a typically small table. However, if the table's VM needs extension (e.g., after many databases are created and dropped causingpg_databaseto grow), this triggers a code path that wasn't properly validated in the IO stats assertion matrix. -
Statistics subsystem completeness: The pgstat I/O tracking introduced relatively recently (PG16+) aims to comprehensively track all I/O operations by backend type. This bug suggests the matrix of valid
(backend_type, io_object, io_context, io_op)tuples is incomplete for the autovacuum launcher's catalog maintenance path.
Proposed Solutions
Given this is a single-message thread (initial bug report), no formal patches have been proposed yet. However, the likely fixes would include:
-
Expand the IO stats valid combination matrix: Add
IOOP_EXTENDas a valid operation for the autovacuum launcher backend type when operating on relation forks (specifically VM/FSM forks of shared catalogs). -
Separate the launcher's vacuum path: Potentially refactor so that shared catalog vacuuming is delegated to a worker rather than performed in the launcher, though this would be a larger architectural change and may not be desirable for performance reasons.
-
Relax the assertion: Make the assertion less strict for edge cases, though this is the least desirable fix as it reduces the diagnostic value of the assertion framework.
Key Technical Context
- The visibility map extension during vacuum is triggered when
visibilitymap_set()needs to mark a page as all-visible but the VM doesn't yet have enough pages to cover the heap's current size. - The stress test of create/drop database in a tight loop causes
pg_databaseto grow (new tuples for each database), and after drops, vacuum marks pages all-visible, potentially needing VM extension. - This is an assertion-only failure — production builds without assertions would not crash but would have incorrect I/O statistics tracking (silent data corruption in stats).