Why clearing the VM doesn't require registering vm buffer in wal record

First seen: 2026-03-05 19:56:13+00:00 · Messages: 22 · Participants: 6

Latest Update

2026-05-07 · opus 4.7

New in this round (Melanie's v2, May 6 2026)

Melanie posted v2 patch sets (for PG18 and PG19/master) responding to Andres' review. Most of the message is "done, done, done" on minor review points, but several items have genuine technical substance:

1. A newly-identified pre-existing bug in `heap_lock_updated_tuple_rec()`

LLM-assisted self-review flagged that cleared_all_frozen is never reset across iterations in this function. If the loop iterates over multiple tuple versions, or re-enters via the l4 goto (retry for the same tuple version), a stale true value can leak forward. Post-patch consequence: the VM buffer gets unnecessarily registered (and all_frozen unnecessarily cleared on the standby) for a page where nothing actually changed. She couldn't construct a reproducer and didn't include a fix, soliciting confirmation. She also asks whether ZERO_AND_LOCK on the VM read masks this, whether other code paths rely purely on registered buffers, and whether pre-17 branches are even reachable.

2. A second call site was under-fixed: a separate location was emitting two WAL records (one for the heap page, one for the VM update) instead of a single record covering both. v2 consolidates these into one WAL record with both buffers registered. This is the same class of bug as the original — splitting a logical change across two WAL records — and is an important catch because it means the v1 fix was incomplete in the same way the original code was wrong.

3. Proposed restructuring: take the VM buffer content lock before the critical section

Melanie is considering moving LockBuffer(vmbuffer, BUFFER_LOCK_EXCLUSIVE) out of the critical section in heap_insert (and analogous paths). The rewrite sketch: decide clear_all_visible and take the VM lock first, then START_CRIT_SECTION(), then perform the heap insert and VM clear. This would naturally fit the "acquire both locks before modifying either page" discipline that Andres alluded to (and which Melanie notes isn't documented anywhere she could find). Not yet committed to — she's thinking out loud.

4. Deliberate rejection of Andres' suggested replay structure

Andres apparently suggested a different organization of the redo routine (ordering the single-VM-page vs two-VM-page branches differently). Melanie pushed back: she prefers the current structure because (a) each case is self-contained and the reader can see what's registered without cross-referencing if branches, and (b) it mirrors the normal-operation structure, where the single-vs-two-VM-page cases diverge more sharply due to the VM-block-order locking rule. This is a minor but explicit design-choice disagreement.

5. PG19 API simplification

On master/19, instead of adding a visibilitymap_clear_locked() sibling to visibilitymap_clear(), she changed visibilitymap_clear()'s signature to take a RelFileLocator instead of a Relation (matching visibilitymap_set() on 19). On 18, visibilitymap_clear() is preserved for backwards compatibility, with visibilitymap_clear_locked() added alongside.

6. Open question: where to set the VM page LSN on 19

Two options: have visibilitymap_clear() accept an LSN parameter, or require callers to call PageSetLSN() themselves. Andres' earlier sketch used a separate visibilitymap_clear_redo() that took an LSN. Melanie finds a wrapper-that-just-sets-LSN "silly" and is leaning toward callers doing it. Awaiting input.

7. Open question on 18: reuse `visibilitymap_clear()` for corruption-repair in vacuumlazy?

In 18, visibilitymap_clear() is documented as kept "for backwards compatibility," but Melanie is using it in vacuumlazy.c's path that clears a corrupted VM bit without WAL logging. She asks whether the "backwards compatibility only" label is strict or advisory.

8. Minor confirmations

XLOG_PAGE_MAGIC is deliberately not bumped on back branches (the WAL format changes are backward-compatible, not a break).
The Cluster.pm change that redirects pg_combinebackup --debug output to a temp file (shown only on test failure) was AI-assisted and needs a Perl-idiom review.
Confirmed that the update-path test wasn't previously exercising the intended scenario (it was locking first and thus skipping the buggy path); v2 fixes the test.
Clarifies the verify_heapam / visibilitymap_get_status() commentary: the point is that visibilitymap_get_status() won't itself raise on corrupted heap or VM pages, and this should be explained to the reader — ZERO_ON_ERROR isn't the reason it doesn't fault.

Nothing in this round changes the overall direction or the architectural conclusions; it's one review-response iteration, but with three non-trivial new items (the heap_lock_updated_tuple_rec bug, the second broken call site, and the lock-before-CRIT restructuring proposal).

History (1 prior analysis)

2026-05-06 · opus 4.7

Why clearing the VM doesn't require registering vm buffer in WAL record

The Core Bug

This thread uncovers a long-standing but only-recently-consequential correctness bug in how PostgreSQL WAL-logs visibility map (VM) modifications during heap_insert, heap_update, and heap_delete. When these operations need to clear VM bits (because a heap page is being modified and is no longer all-visible/all-frozen), they call visibilitymap_clear() on the VM buffer but do not register the VM buffer in the WAL record produced by the heap operation.

This violates the standard PostgreSQL WAL contract, which is roughly: any buffer modified within a critical section must be registered with the WAL record so that (a) an FPI can be emitted when needed for torn-page protection, and (b) WAL-consuming tools (summarizer, incremental backup, pg_rewind) know the block changed.

The consequences identified during the thread are a layered set of failures, each introduced by a subsequent feature:

Checksums (9.3, commit 96ef3b8ff1c, 2013) — Without an FPI, a crash during a torn write of the VM page leaves an unrecoverable checksum failure, because the bit-clearing change is never replayed (it piggybacks on the heap record, which doesn't reference the VM block).
Revamped WAL format (9.5, commit 2c03216d831, 2014) — The new block-reference scheme made the omission more structurally visible; it should have been fixed then.
Incremental backup / WAL summarizer (17) — This is where the bug becomes a silent data corruption vector, because walsummarizer only tracks blocks explicitly referenced in WAL. After a clear, the incremental backup will contain the old VM page with stale all-visible bits set, while the heap page correctly shows !PD_ALL_VISIBLE. Andres produced a minimal reproducer (INSERT+VACUUM, full backup, UPDATE, incremental, combine → inconsistent VM). verify_heapam() did not detect it; only the next VACUUM would notice via "page is not marked all-visible but visibility map bit is set".

pg_rewind is saved only by accident — it doesn't parse WAL for non-main forks and copies VM wholesale. Third-party tools (pg_probackup, WAL-G) have known about this for years and worked around it by copying VM wholesale, as Yura Sokolov confirmed.

The Original Design Rationale (and why it failed)

Robert Haas, who wrote the crash-safe VM in 2011 (commit 503c7305a1e), reconstructed his reasoning: clearing a VM bit is idempotent and doesn't depend on prior page state, so replaying the heap record's "clear VM bit" side-effect without an FPI seemed adequate. Setting a bit required a dedicated WAL record (XLOG_HEAP2_VISIBLE) because there was nothing else to piggyback on. This reasoning was never documented in a comment — a significant lesson about capturing non-obvious invariants.

Robert's reasoning was correct for its time but brittle against future features that depend on block-level WAL granularity:

Checksums require torn-page protection, hence FPIs for any modification.
Incremental backup requires the WAL to faithfully enumerate every modified block.

His recommendation: do not make the fix conditional on wal_level, checksums, or other settings. Always register the VM buffer. This avoids a matrix of behavioral variants that would make the code fragile.

A Second, Related Bug: `visibilitymap_prepare_truncate`

While Andres was building a fix, he added assertions disallowing non-FSM reads by the startup process and discovered that visibilitymap_prepare_truncate() violates WAL-logging discipline in a different way. It:

WAL-logs its changes separately from the later XLOG_SMGR_TRUNCATE (different critical sections, different records).
Only logs conditionally on XLogHintBitIsNeeded().
Does not set page LSN.
Runs during recovery, re-dirtying the buffer.

Andres posited a torn-page scenario where a standby replays an FPI, then later re-runs prepare_truncate during XLOG_SMGR_TRUNCATE replay, dirties the VM page, crashes mid-write, and finds a torn VM page with a stale checksum on restart. vm_readbuf()'s ZERO_ON_ERROR provides a grubby safety net, but the design is clearly wrong.

Robert's response is categorical and worth quoting in spirit: functions named prepare_* must only do things that can truly be done ahead of the operation; the actual state change must happen in a single critical section with all buffer locks held together with the WAL insertion. The current code hopes that two separate critical sections' changes will be atomic with respect to crashes — which can never work. He identifies a concrete corruption path: if the truncate WAL record fails to replay but the VM changes persist, relation re-extension can produce pages where !PD_ALL_VISIBLE but the VM bit is set — the forbidden state. He attributes the bug to over-abstraction: too much was hidden behind the visibilitymap_* API, where callers couldn't see the WAL implications.

The Patch (Melanie's v1, Apr 30 2026)

Melanie produced two patch sets: one for master/19 and one for 18 (backpatched to 17, the oldest branch affected because that's when incremental backup landed). Key elements:

Register VM buffers in heap_{insert,update,delete,multi_insert} WAL records. For heap_update, new block-reference constants (HEAP_UPDATE_BLKREF_VM_OLD, HEAP_UPDATE_BLKREF_VM_NEW) distinguish the old-page VM buffer from the new-page VM buffer when they differ. Andres suggested extending these symbolic constants to all the heap operations for readability.
Backpatch compatibility: The PG18 redo routines can read both the old and new WAL record formats, so minor-version upgrades don't break WAL replay. Master does not need this.
Set page LSN on the VM page and perform visibilitymap_clear_locked() with the VM buffer registered.
Defensive additions on master:
- verify_heapam check that PD_ALL_VISIBLE clear + VM bit set is flagged.
- Stop masking PD_ALL_VISIBLE in wal_consistency_checking (so it actually compares).
- A validator (from Andres' branch) that ensures every block registered in a WAL record was actually read during replay — catching future bugs where redo routines forget to call XLogReadBufferForRedoExtended() on a registered block.
TAP test using the incremental-backup reproducer across several scenarios.

The Deadlock Andres Spotted

Andres found a latent undetected deadlock in the updated heap_update path. Consider two backends updating tuples that cross-migrate between heap pages covered by different VM pages:

B1: locks HA1, HB1 (ascending heap order, fine)
B2: locks HA2, HB2 (ascending heap order, fine)
B1: locks VM1 (covers HA1), then wants VM2 (for HB1) → waits
B2: locks VM2 (covers HB2), then wants VM1 (for HA2) → deadlock

The heap lock ordering is preserved but the derived VM lock ordering isn't, because multiple heap pages map to each VM page and the ordering can invert. The fix is to acquire VM buffer content locks in ascending VM block-number order, decoupled from the heap lock acquisition order. Only the lock acquisition needs reordering; the actual bit-setting/clearing can remain tied to each heap page.

Andres also noted an orthogonal mini-bug: when newtupsize > pagefree triggers VM-clearing, the code clears "all frozen" even though the page is about to be updated and all visible will be cleared anyway — "inexplicably... WHYYY!?!"

Broader Infrastructure Gaps

The thread turns into a minor design discussion about detecting these classes of bugs:

WAL consistency checking masks too aggressively (including PD_ALL_VISIBLE) and runs immediately after the redo routine, so it cannot catch bugs spanning multiple WAL records (like the prepare_truncate split-brain, or checkpoint-straddling invariants).
No detection for "page dirtied without WAL logging": Andres proposed a buffer-descriptor bit to track whether non-hint-bit dirty was done, checked at unlock time.
No post-recovery primary/replica comparator using the masking infrastructure — a clear gap.
FSM is not WAL-logged at all, making it impossible to assert "every checksum failure is a bug." Tomas suggested this increasingly looks like a bad design choice; Andres floated treating the FSM like hint bits: WAL-log the first modification per checkpoint when checksums/hint-logging is on.

Tomas' suggestion of an offline WAL-reading verification tool runs into the fundamental issue Andres raises: such a tool can't detect missing block references without baking in knowledge of every WAL record's intended semantics (e.g., "XLH_INSERT_ALL_VISIBLE_CLEARED implies a VM block ref must exist"). The bug is the discrepancy between intended semantics and encoded block refs.

Tradeoffs Explicitly Considered

Always emit FPI for VM vs. conditional: Robert's "always" position won. VM pages are few relative to heap, so the FPI cost is negligible outside contrived workloads.
Leaving older branches (16 and earlier) unfixed: Andres explicitly flagged this as an arguable corruption case but defended the call — the changes are non-trivial, inter-branch deltas are large, and without incremental backup the realistic exposure is limited (pg_rewind copies VM entirely).
Standard page format for VM/FSM: Matthias noted that because VM doesn't use the standard page format, every FPI is a full 8KB even for nearly-empty VM pages — a minor cost but relevant.
Test log verbosity: pg_combinebackup --debug produces ~5000 lines. Consensus: route debug output to a side file (like server logs) rather than inline in regress_log.

Architectural Lessons

WAL-logging rules are not negotiable abstractions: every buffer modified in a critical section must be registered. Piggybacking on another record's effects is an optimization that breaks whenever a downstream consumer (checksums, summarizer) needs block-level granularity.
Invariants must be documented: Robert's 2011 reasoning was sound for 2011 but silent in the code, so no one revisited it when the premises changed.
Abstraction hazard: hiding WAL semantics behind visibilitymap_* helpers made the callers non-obviously incorrect. prepare_* names encouraged a split-critical-section design.
Feature interactions compound silently: the checksum hazard was latent for a decade; incremental backup is what made the bug loud. Features that depend on WAL faithfully enumerating block changes surface WAL-logging imperfections the hard way.