[PATCH] pg_surgery: Fix WAL corruption from concurrent heap_force_kill

First seen: 2026-04-20 20:20:39+00:00 · Messages: 5 · Participants: 3

Latest Update

2026-05-07 · opus 4.7

New Development: Scope of the Bug Expands Significantly

Two new messages, and the second one materially changes the shape of this thread.

Fabrízio posts v2 (brief)

Fabrízio acknowledges Zsolt's review, confirms the if-split was a leftover, and posts v2 folding the injection point inside the RelationNeedsWAL branch and adopting the pg_stat_activity poll-loop for test synchronization. He also hedges on whether the test will ultimately be committed ("TBH, I'm not sure if the test will be pushed with the fix"), suggesting he anticipates committer pushback on the xloginsert.c injection point — matching the concern flagged in the previous analysis.

Melanie Plageman: the fix is insufficient

Melanie raises a substantially more serious objection that reframes the whole patch. Her argument has two layers:

  1. Agrees with the immediate fix — the VM buffer lock must be held across WAL emission, not just across visibilitymap_clear. This confirms the v2 patch's direction.

  2. But the real bug is deeper: atomicity across heap + VM. heap_force_common() currently emits two separate WAL records — one for the heap page (which clears PD_ALL_VISIBLE) and one FPI for the VM page (which clears the VM bit). On crash recovery, a replay that stops between these two records leaves the database in a forbidden state: PD_ALL_VISIBLE cleared on the heap page while the VM bit is still set. The on-disk invariant is "VM bit set ⇒ PD_ALL_VISIBLE set"; the current code violates it during replay.

    The correct fix is to log both page modifications in a single WAL record while holding locks on both buffers simultaneously — making the change atomic from recovery's perspective.

  3. Cross-reference to a larger in-flight fix. Melanie points to her own thread [CAAKRu_Z_KAAZAHtuorifR2MRc_MkEcHf-C_t4b9HZaHpa3nriw] where she is systematically fixing VM-clear WAL logging across the code base, and where she already proposes the single-WAL-record pattern. The implication: pg_surgery's bug is an instance of a broader pattern, and the cleanest fix may be to adopt the helper/pattern from that thread rather than solve it locally here.

Why this matters

This is a scope escalation, not just a review nit. Under Fabrízio's v2:

If Melanie is right — and her analysis is consistent with the well-established PD_ALL_VISIBLE/VM invariant — then v2 turns a visible, noisy corruption (bad CRC detected by pg_walinspect) into a subtler corruption (torn heap/VM state after crash recovery, potentially causing index-only scans to return wrong results from heap tuples whose visibility state the VM misrepresents). That's arguably worse than the original bug, because it's silent.

Likely path forward

Fabrízio now has three options:

  1. Wait for / coordinate with Melanie's thread. Her patch introduces infrastructure for combined heap+VM WAL records; pg_surgery can be a consumer.
  2. Restructure v3 locally to hold both locks and emit one combined WAL record, duplicating some of Melanie's work.
  3. Land the narrow CRC fix now (as a strict improvement) and file the atomicity bug as a follow-up dependent on Melanie's work.

Option 1 is the cleanest but introduces a cross-thread dependency. Option 3 is pragmatic for back-patching, since Melanie's broader fix is unlikely to be back-patchable to all supported branches, while a minimal lock-holding fix could be.

The thread has moved from "small targeted bug fix in a contrib module" to "instance of a class of bugs being addressed thread-wide" — a significant change in character.

History (1 prior analysis)
2026-05-06 · opus 4.7

pg_surgery WAL Corruption via Concurrent heap_force_kill on Shared VM Pages

The Core Problem: A TOCTOU Race Inside XLogInsert

This thread addresses a subtle but genuinely dangerous bug in pg_surgery's heap_force_kill() that can produce WAL records whose CRC does not match the payload actually written to WAL. The consequence is downstream corruption: pg_walinspect reports "incorrect resource manager data checksum", and standbys / PITR replay will reject or misinterpret the record.

The root cause is an interaction between three independently-correct pieces of code:

  1. visibilitymap_clear() follows a documented contract of releasing the exclusive content lock on the VM buffer before returning. Callers that want to keep operating on the VM page must re-acquire the lock themselves.
  2. heap_force_kill() in contrib/pg_surgery/heap_surgery.c clears VM bits via visibilitymap_clear() and then, later in the same critical section, issues log_newpage_buffer(vmbuf, false) to log a full-page image of the VM page. Between those two calls it holds no content lock on vmbuf — only a pin.
  3. XLogInsert() in xloginsert.c reads registered buffer pages twice via the XLogRecData chain: once in XLogRecordAssemble() to compute the CRC, and a second time in CopyXLogRecordToWAL() to copy the bytes into WAL buffers. The standard convention is that the caller holds an exclusive content lock on any buffer registered for FPI, guaranteeing the two reads see identical bytes.

When two concurrent heap_force_kill sessions operate on different heap pages that happen to share the same VM page (VM pages cover ~32k heap pages each, so this is easy to hit), session A can compute its CRC against VM page bytes that session B then mutates (via its own visibilitymap_clear) before session A's CopyXLogRecordToWAL runs. The FPI on disk is self-inconsistent: its CRC covers a pre-modification image, its bytes are post-modification.

This is a textbook TOCTOU, but notable because the "time of check" and "time of use" are both inside a single XLogInsert call. The WAL machinery's correctness depends on an invariant (buffer content stable across assemble→copy) that pg_surgery silently violates.

Why It Matters Architecturally

  • VM pages are uniquely vulnerable to this class of bug because they are densely shared: one VM page describes thousands of heap pages, so independent heap-level operations routinely contend on the same VM buffer.
  • The bug is silent at insert time. Nothing in the inserting backend fails; the corruption is only observed on replay or by tools that verify CRCs. That means it could have been shipping undetected for a long time in environments that heavily use pg_surgery (typically disaster-recovery scenarios — exactly where WAL integrity matters most).
  • It reinforces a general rule that has caught other callers before: any buffer registered with XLogRegisterBuffer() (or passed to log_newpage_buffer) must be held under exclusive content lock for the full duration of XLogInsert, not merely pinned. The visibilitymap_clearlog_newpage_buffer idiom naturally tempts callers to forget this because the clearing routine helpfully drops the lock for them.

The Fix

The patch re-acquires BUFFER_LOCK_EXCLUSIVE on vmbuf immediately after visibilitymap_clear() returns, and releases it after log_newpage_buffer(). The lock is acquired only when RelationNeedsWAL(rel) is true, since unlogged relations skip the FPI entirely and there is nothing to protect.

Structurally this restores the standard "locked for the whole WAL-emission window" invariant. It is a minimal, local change to heap_surgery.c; it does not alter the visibilitymap_clear API (which would affect many more callers and is not at fault).

Test Infrastructure: Three Injection Points

Reproducing the race deterministically requires pausing session 1 inside XLogInsert, between CRC computation and payload copy — a window normally measured in microseconds. The patch introduces three injection points:

  • heap-force-kill-vm-pin — outside the critical section, used to set up DSM-backed wait state before the code path where palloc etc. are forbidden.
  • heap-force-kill-before-vm-wal — inside the critical section between the heap FPI and VM FPI writes, acts as a rendezvous.
  • wal-insert-after-crc — in xloginsert.c itself, between XLogRecordAssemble and XLogInsertRecord, the precise point needed to expose the bug.

The xloginsert.c injection point uses INJECTION_POINT_CACHED, meaning the point is resolved once by the caller outside the critical section and only fires when explicitly pre-loaded. This is important: a naive INJECTION_POINT call inside XLogInsert would risk lookups and allocations on the hottest path in the server. INJECTION_POINT_CACHED makes the addition essentially free when not armed, which is the bar any change to XLogInsert must clear.

Review Feedback (Zsolt Parragi, Percona)

Zsolt reproduced both the bug and the fix, and raised two substantive points:

  1. Redundant if split. The patch writes:

    if (did_modify_vm)
        INJECTION_POINT_CACHED("heap-force-kill-before-vm-wal", NULL);
    
    if (did_modify_vm && RelationNeedsWAL(rel))
    {
        log_newpage_buffer(vmbuf, false);
        LockBuffer(vmbuf, BUFFER_LOCK_UNLOCK);
    }
    

    The first if fires the injection point even on unlogged tables, where there is no subsequent WAL action to synchronize against. Zsolt suggests folding the injection point inside the RelationNeedsWAL branch. This is a legitimate cleanliness/correctness concern: firing the rendezvous without a corresponding WAL write could mislead tests or desynchronize the barrier.

  2. Test stability. The original test uses a fixed usleep(500_000) to give session 2 time to reach the VM buffer lock. Zsolt correctly flags this as CI-unstable and proposes a poll loop against pg_stat_activity, waiting until the session is either observed in Buffer/BufferExclusive wait (fixed build, blocked on the lock) or idle (broken build, finished without blocking). This is both faster in the common case (~1.55s vs ~2s) and robust against slow CI runners. This pattern — poll pg_stat_activity.wait_event with a deadline — is the established idiom in PostgreSQL's TAP tests for synchronizing on lock acquisition, so the suggestion aligns with project conventions.

Design Tradeoffs and Open Questions

  • Scope of the fix. The patch fixes only heap_force_kill. heap_force_freeze in the same file has similar structure and likely needs the same treatment; whether this patch should cover both or be split is not yet discussed.
  • Should visibilitymap_clear change its contract? Keeping the lock on return would prevent this entire class of bug but would require auditing every caller (vacuum, heap_update, etc.) that currently relies on the drop. The patch wisely does not attempt this.
  • Injection point in xloginsert.c. Adding any instrumentation to XLogInsert invites scrutiny. The INJECTION_POINT_CACHED variant is the right tool, but committers may still push back on whether the test is worth a permanent injection point in a hot core path, or whether the bug fix should land without the deterministic test (relying on code review + a probabilistic test).

Assessment

The bug is real, the diagnosis is precise, and the fix is minimal and correct. The review points from Zsolt are both valid and easy to address in a v3. The most likely committer-level concerns going forward will be (a) whether heap_force_freeze needs the same fix, (b) whether the xloginsert.c injection point is acceptable in the tree, and (c) back-patching scope — this is a data-integrity bug in a contrib module present since pg_surgery was introduced, so back-patching to all supported branches is warranted.