[PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort

2026-05-06 · opus 4.7

Core Problem: Stale Wait State After Subtransaction Abort in WAIT FOR LSN

Background: WAIT FOR LSN Infrastructure

WAIT FOR LSN is a relatively recent SQL-level feature (implemented in xlogwait.c) that allows a session to block until WAL replay/flush has reached a specified LSN. It's particularly useful in replication and causality scenarios where a client needs to ensure a particular change is durable or visible on a replica before proceeding.

The implementation maintains a per-backend wait registration in shared memory, organized as a pairing heap (the "wait heap") keyed by target LSN. Each PGPROC-associated WaitLSNProcInfo entry carries an inHeap flag indicating whether this backend currently has an outstanding registration. The heap is consulted by the walreceiver/startup/flush paths to wake waiters whose target LSN has been reached.

The critical invariant is: a backend must never attempt to insert itself into the wait heap while inHeap == true. Violating this corrupts the heap (double-insertion into a pairing heap destroys its structure), hence the defensive Assert(!procInfo->inHeap) that fired in the reproducer.

The Bug: Asymmetric Cleanup Between Transaction and Subtransaction Abort

Cleanup of the wait-heap entry is handled by WaitLSNCleanup(), which was wired into:

Normal completion of the wait (the waiter removes itself).
AbortTransaction() — top-level abort path, after LWLockReleaseAll().
Backend exit (via on_shmem_exit or similar).

What was missing is the subtransaction abort path: AbortSubTransaction(). When an interrupt (e.g., statement_timeout, query cancel) fires while the backend is sleeping inside WaitForLSN() within a savepoint, control unwinds via ProcessInterrupts → ereport(ERROR, ...). If the error is caught by a subtransaction's error-recovery (i.e., the subtransaction is aborted but the top-level transaction survives, as happens with ROLLBACK TO SAVEPOINT), only AbortSubTransaction() runs — not AbortTransaction().

Consequence: the WaitLSNProcInfo.inHeap flag stays true, and the backend's node remains linked in the shared pairing heap even though the backend is no longer waiting. A subsequent WAIT FOR LSN in the same backend trips the assertion (debug builds) or corrupts the heap (release builds — potentially causing arbitrary misbehavior in the flush/replay wakeup path).

Reproducer Mechanics

The test case is carefully constructed:

Enter a subtransaction (SAVEPOINT s).
Set a short statement_timeout so the wait is interrupted by an error rather than completing normally.
WAIT FOR LSN '<future-lsn>' with a target far beyond current WAL — guarantees the backend actually enters the wait loop and registers in the heap before being interrupted.
ROLLBACK TO s — aborts the subtransaction; the backend does not exit and top-level AbortTransaction() is not called.
A second WAIT FOR LSN — trips Assert(!procInfo->inHeap).

The WITH (MODE 'primary_flush') option is relevant: it directs the wait to the flush-LSN path (WaitForWalFlush wait event) rather than the replay path, ensuring deterministic behavior on a primary.

The Fix

The patch is architecturally minimal and correct: add a WaitLSNCleanup() call inside AbortSubTransaction(), placed after LWLockReleaseAll(). Ordering matters because WaitLSNCleanup() itself takes the WaitLSNLock LWLock to unlink from the shared heap; if the backend were still holding some unrelated LWLock from the aborted operation, lock-ordering rules could be violated. This mirrors precisely the ordering already used in AbortTransaction().

This is the canonical "symmetric cleanup" bug pattern in PostgreSQL: any resource that has a per-backend registration in shared memory must be cleaned in both abort paths, because subtransaction abort is a strict subset of state reset that nevertheless must release resources scoped to a single statement. Similar patterns exist for LWLocks, buffer pins, catalog snapshots, etc.

Review Feedback (Xuneng Zhou)

Three substantive review points:

Comment accuracy: The existing header comment for WaitLSNCleanup() described it as cleanup "for exiting process." Post-patch this is false — the process is not exiting, it's merely transitioning out of a wait scope. Suggested rewording: "Clean up any LSN wait state for the current process." Alexander Korotkov agreed, noting "existing process" is confusing even in the current AbortTransaction() caller.
Deterministic test synchronization: The original test relied on statement_timeout = '100ms' firing while the waiter was registered. This is a race: the backend might be interrupted before it enters the heap, or it might not yet have reached the sleep. The standard PostgreSQL TAP-test idiom is to launch the waiter in a background psql session (BackgroundPsql) and poll pg_stat_activity until wait_event = 'WaitForWalFlush' appears for that PID. Only then is the backend guaranteed to be in the heap, so cancellation deterministically exercises the stale-registration path.
Test validation LSN: The patch originally validated recovery by issuing WAIT FOR LSN '0/0' (which returns immediately because LSN 0/0 is trivially reached). Zhou argued for re-using an unreachable LSN with a short TIMEOUT, so that the second wait actually performs heap insertion (the buggy code path). A fast-success 0/0 call might bypass heap registration entirely depending on the early-exit logic in WaitForLSN, thereby failing to prove the fix.

Point (3) is the most important review comment technically: it distinguishes between testing that the code doesn't crash versus testing that the specific code path containing the bug is exercised.

Decision Path

Alexander Korotkov, the original author/committer of the WAIT FOR LSN feature, concurred quickly and indicated he would push. Given that the fix is a one-liner plus test, follows existing symmetry (AbortTransaction → AbortSubTransaction), and the reviewer's refinements improve robustness without changing semantics, there was no design-level disagreement. The thread converged in under three hours.

Broader Implications

Audit opportunity: Any feature maintaining per-backend shared-memory state (similar to WaitLSNProcInfo) needs review for AbortSubTransaction coverage. Features added quickly often ship with AbortTransaction cleanup and miss the subtransaction path because developer testing rarely wraps the feature in a savepoint.
Assertion value: The Assert(!procInfo->inHeap) caught this. Without it, release builds would silently corrupt a shared pairing heap — a much nastier bug to diagnose, potentially manifesting as missed wake-ups or walreceiver hangs.
Test-harness lesson: Reliance on statement_timeout for "interrupt mid-operation" tests is fragile; the pg_stat_activity-polling pattern is now the accepted standard.