Why clearing the VM doesn't require registering vm buffer in WAL record
The Core Bug
This thread uncovers a long-standing but only-recently-consequential correctness bug in how PostgreSQL WAL-logs visibility map (VM) modifications during heap_insert, heap_update, and heap_delete. When these operations need to clear VM bits (because a heap page is being modified and is no longer all-visible/all-frozen), they call visibilitymap_clear() on the VM buffer but do not register the VM buffer in the WAL record produced by the heap operation.
This violates the standard PostgreSQL WAL contract, which is roughly: any buffer modified within a critical section must be registered with the WAL record so that (a) an FPI can be emitted when needed for torn-page protection, and (b) WAL-consuming tools (summarizer, incremental backup, pg_rewind) know the block changed.
The consequences identified during the thread are a layered set of failures, each introduced by a subsequent feature:
- Checksums (9.3, commit 96ef3b8ff1c, 2013) — Without an FPI, a crash during a torn write of the VM page leaves an unrecoverable checksum failure, because the bit-clearing change is never replayed (it piggybacks on the heap record, which doesn't reference the VM block).
- Revamped WAL format (9.5, commit 2c03216d831, 2014) — The new block-reference scheme made the omission more structurally visible; it should have been fixed then.
- Incremental backup / WAL summarizer (17) — This is where the bug becomes a silent data corruption vector, because
walsummarizer only tracks blocks explicitly referenced in WAL. After a clear, the incremental backup will contain the old VM page with stale all-visible bits set, while the heap page correctly shows !PD_ALL_VISIBLE. Andres produced a minimal reproducer (INSERT+VACUUM, full backup, UPDATE, incremental, combine → inconsistent VM). verify_heapam() did not detect it; only the next VACUUM would notice via "page is not marked all-visible but visibility map bit is set".
pg_rewind is saved only by accident — it doesn't parse WAL for non-main forks and copies VM wholesale. Third-party tools (pg_probackup, WAL-G) have known about this for years and worked around it by copying VM wholesale, as Yura Sokolov confirmed.
The Original Design Rationale (and why it failed)
Robert Haas, who wrote the crash-safe VM in 2011 (commit 503c7305a1e), reconstructed his reasoning: clearing a VM bit is idempotent and doesn't depend on prior page state, so replaying the heap record's "clear VM bit" side-effect without an FPI seemed adequate. Setting a bit required a dedicated WAL record (XLOG_HEAP2_VISIBLE) because there was nothing else to piggyback on. This reasoning was never documented in a comment — a significant lesson about capturing non-obvious invariants.
Robert's reasoning was correct for its time but brittle against future features that depend on block-level WAL granularity:
- Checksums require torn-page protection, hence FPIs for any modification.
- Incremental backup requires the WAL to faithfully enumerate every modified block.
His recommendation: do not make the fix conditional on wal_level, checksums, or other settings. Always register the VM buffer. This avoids a matrix of behavioral variants that would make the code fragile.
A Second, Related Bug: visibilitymap_prepare_truncate
While Andres was building a fix, he added assertions disallowing non-FSM reads by the startup process and discovered that visibilitymap_prepare_truncate() violates WAL-logging discipline in a different way. It:
- WAL-logs its changes separately from the later
XLOG_SMGR_TRUNCATE (different critical sections, different records).
- Only logs conditionally on
XLogHintBitIsNeeded().
- Does not set page LSN.
- Runs during recovery, re-dirtying the buffer.
Andres posited a torn-page scenario where a standby replays an FPI, then later re-runs prepare_truncate during XLOG_SMGR_TRUNCATE replay, dirties the VM page, crashes mid-write, and finds a torn VM page with a stale checksum on restart. vm_readbuf()'s ZERO_ON_ERROR provides a grubby safety net, but the design is clearly wrong.
Robert's response is categorical and worth quoting in spirit: functions named prepare_* must only do things that can truly be done ahead of the operation; the actual state change must happen in a single critical section with all buffer locks held together with the WAL insertion. The current code hopes that two separate critical sections' changes will be atomic with respect to crashes — which can never work. He identifies a concrete corruption path: if the truncate WAL record fails to replay but the VM changes persist, relation re-extension can produce pages where !PD_ALL_VISIBLE but the VM bit is set — the forbidden state. He attributes the bug to over-abstraction: too much was hidden behind the visibilitymap_* API, where callers couldn't see the WAL implications.
The Patch (Melanie's v1, Apr 30 2026)
Melanie produced two patch sets: one for master/19 and one for 18 (backpatched to 17, the oldest branch affected because that's when incremental backup landed). Key elements:
- Register VM buffers in heap_{insert,update,delete,multi_insert} WAL records. For
heap_update, new block-reference constants (HEAP_UPDATE_BLKREF_VM_OLD, HEAP_UPDATE_BLKREF_VM_NEW) distinguish the old-page VM buffer from the new-page VM buffer when they differ. Andres suggested extending these symbolic constants to all the heap operations for readability.
- Backpatch compatibility: The PG18 redo routines can read both the old and new WAL record formats, so minor-version upgrades don't break WAL replay. Master does not need this.
- Set page LSN on the VM page and perform
visibilitymap_clear_locked() with the VM buffer registered.
- Defensive additions on master:
verify_heapam check that PD_ALL_VISIBLE clear + VM bit set is flagged.
- Stop masking
PD_ALL_VISIBLE in wal_consistency_checking (so it actually compares).
- A validator (from Andres' branch) that ensures every block registered in a WAL record was actually read during replay — catching future bugs where redo routines forget to call
XLogReadBufferForRedoExtended() on a registered block.
- TAP test using the incremental-backup reproducer across several scenarios.
The Deadlock Andres Spotted
Andres found a latent undetected deadlock in the updated heap_update path. Consider two backends updating tuples that cross-migrate between heap pages covered by different VM pages:
B1: locks HA1, HB1 (ascending heap order, fine)
B2: locks HA2, HB2 (ascending heap order, fine)
B1: locks VM1 (covers HA1), then wants VM2 (for HB1) → waits
B2: locks VM2 (covers HB2), then wants VM1 (for HA2) → deadlock
The heap lock ordering is preserved but the derived VM lock ordering isn't, because multiple heap pages map to each VM page and the ordering can invert. The fix is to acquire VM buffer content locks in ascending VM block-number order, decoupled from the heap lock acquisition order. Only the lock acquisition needs reordering; the actual bit-setting/clearing can remain tied to each heap page.
Andres also noted an orthogonal mini-bug: when newtupsize > pagefree triggers VM-clearing, the code clears "all frozen" even though the page is about to be updated and all visible will be cleared anyway — "inexplicably... WHYYY!?!"
Broader Infrastructure Gaps
The thread turns into a minor design discussion about detecting these classes of bugs:
- WAL consistency checking masks too aggressively (including
PD_ALL_VISIBLE) and runs immediately after the redo routine, so it cannot catch bugs spanning multiple WAL records (like the prepare_truncate split-brain, or checkpoint-straddling invariants).
- No detection for "page dirtied without WAL logging": Andres proposed a buffer-descriptor bit to track whether non-hint-bit dirty was done, checked at unlock time.
- No post-recovery primary/replica comparator using the masking infrastructure — a clear gap.
- FSM is not WAL-logged at all, making it impossible to assert "every checksum failure is a bug." Tomas suggested this increasingly looks like a bad design choice; Andres floated treating the FSM like hint bits: WAL-log the first modification per checkpoint when checksums/hint-logging is on.
Tomas' suggestion of an offline WAL-reading verification tool runs into the fundamental issue Andres raises: such a tool can't detect missing block references without baking in knowledge of every WAL record's intended semantics (e.g., "XLH_INSERT_ALL_VISIBLE_CLEARED implies a VM block ref must exist"). The bug is the discrepancy between intended semantics and encoded block refs.
Tradeoffs Explicitly Considered
- Always emit FPI for VM vs. conditional: Robert's "always" position won. VM pages are few relative to heap, so the FPI cost is negligible outside contrived workloads.
- Leaving older branches (16 and earlier) unfixed: Andres explicitly flagged this as an arguable corruption case but defended the call — the changes are non-trivial, inter-branch deltas are large, and without incremental backup the realistic exposure is limited (pg_rewind copies VM entirely).
- Standard page format for VM/FSM: Matthias noted that because VM doesn't use the standard page format, every FPI is a full 8KB even for nearly-empty VM pages — a minor cost but relevant.
- Test log verbosity:
pg_combinebackup --debug produces ~5000 lines. Consensus: route debug output to a side file (like server logs) rather than inline in regress_log.
Architectural Lessons
- WAL-logging rules are not negotiable abstractions: every buffer modified in a critical section must be registered. Piggybacking on another record's effects is an optimization that breaks whenever a downstream consumer (checksums, summarizer) needs block-level granularity.
- Invariants must be documented: Robert's 2011 reasoning was sound for 2011 but silent in the code, so no one revisited it when the premises changed.
- Abstraction hazard: hiding WAL semantics behind
visibilitymap_* helpers made the callers non-obviously incorrect. prepare_* names encouraged a split-critical-section design.
- Feature interactions compound silently: the checksum hazard was latent for a decade; incremental backup is what made the bug loud. Features that depend on WAL faithfully enumerating block changes surface WAL-logging imperfections the hard way.