Core Problem: Stale Wait State After Subtransaction Abort in WAIT FOR LSN
Background: WAIT FOR LSN Infrastructure
WAIT FOR LSN is a relatively recent SQL-level feature (implemented in xlogwait.c) that allows a session to block until WAL replay/flush has reached a specified LSN. It's particularly useful in replication and causality scenarios where a client needs to ensure a particular change is durable or visible on a replica before proceeding.
The implementation maintains a per-backend wait registration in shared memory, organized as a pairing heap (the "wait heap") keyed by target LSN. Each PGPROC-associated WaitLSNProcInfo entry carries an inHeap flag indicating whether this backend currently has an outstanding registration. The heap is consulted by the walreceiver/startup/flush paths to wake waiters whose target LSN has been reached.
The critical invariant is: a backend must never attempt to insert itself into the wait heap while inHeap == true. Violating this corrupts the heap (double-insertion into a pairing heap destroys its structure), hence the defensive Assert(!procInfo->inHeap) that fired in the reproducer.
The Bug: Asymmetric Cleanup Between Transaction and Subtransaction Abort
Cleanup of the wait-heap entry is handled by WaitLSNCleanup(), which was wired into:
- Normal completion of the wait (the waiter removes itself).
AbortTransaction()— top-level abort path, afterLWLockReleaseAll().- Backend exit (via
on_shmem_exitor similar).
What was missing is the subtransaction abort path: AbortSubTransaction(). When an interrupt (e.g., statement_timeout, query cancel) fires while the backend is sleeping inside WaitForLSN() within a savepoint, control unwinds via ProcessInterrupts → ereport(ERROR, ...). If the error is caught by a subtransaction's error-recovery (i.e., the subtransaction is aborted but the top-level transaction survives, as happens with ROLLBACK TO SAVEPOINT), only AbortSubTransaction() runs — not AbortTransaction().
Consequence: the WaitLSNProcInfo.inHeap flag stays true, and the backend's node remains linked in the shared pairing heap even though the backend is no longer waiting. A subsequent WAIT FOR LSN in the same backend trips the assertion (debug builds) or corrupts the heap (release builds — potentially causing arbitrary misbehavior in the flush/replay wakeup path).
Reproducer Mechanics
The test case is carefully constructed:
- Enter a subtransaction (
SAVEPOINT s). - Set a short
statement_timeoutso the wait is interrupted by an error rather than completing normally. WAIT FOR LSN '<future-lsn>'with a target far beyond current WAL — guarantees the backend actually enters the wait loop and registers in the heap before being interrupted.ROLLBACK TO s— aborts the subtransaction; the backend does not exit and top-levelAbortTransaction()is not called.- A second
WAIT FOR LSN— tripsAssert(!procInfo->inHeap).
The WITH (MODE 'primary_flush') option is relevant: it directs the wait to the flush-LSN path (WaitForWalFlush wait event) rather than the replay path, ensuring deterministic behavior on a primary.
The Fix
The patch is architecturally minimal and correct: add a WaitLSNCleanup() call inside AbortSubTransaction(), placed after LWLockReleaseAll(). Ordering matters because WaitLSNCleanup() itself takes the WaitLSNLock LWLock to unlink from the shared heap; if the backend were still holding some unrelated LWLock from the aborted operation, lock-ordering rules could be violated. This mirrors precisely the ordering already used in AbortTransaction().
This is the canonical "symmetric cleanup" bug pattern in PostgreSQL: any resource that has a per-backend registration in shared memory must be cleaned in both abort paths, because subtransaction abort is a strict subset of state reset that nevertheless must release resources scoped to a single statement. Similar patterns exist for LWLocks, buffer pins, catalog snapshots, etc.
Review Feedback (Xuneng Zhou)
Three substantive review points:
-
Comment accuracy: The existing header comment for
WaitLSNCleanup()described it as cleanup "for exiting process." Post-patch this is false — the process is not exiting, it's merely transitioning out of a wait scope. Suggested rewording: "Clean up any LSN wait state for the current process." Alexander Korotkov agreed, noting "existing process" is confusing even in the currentAbortTransaction()caller. -
Deterministic test synchronization: The original test relied on
statement_timeout = '100ms'firing while the waiter was registered. This is a race: the backend might be interrupted before it enters the heap, or it might not yet have reached the sleep. The standard PostgreSQL TAP-test idiom is to launch the waiter in a backgroundpsqlsession (BackgroundPsql) and pollpg_stat_activityuntilwait_event = 'WaitForWalFlush'appears for that PID. Only then is the backend guaranteed to be in the heap, so cancellation deterministically exercises the stale-registration path. -
Test validation LSN: The patch originally validated recovery by issuing
WAIT FOR LSN '0/0'(which returns immediately because LSN 0/0 is trivially reached). Zhou argued for re-using an unreachable LSN with a shortTIMEOUT, so that the second wait actually performs heap insertion (the buggy code path). A fast-success0/0call might bypass heap registration entirely depending on the early-exit logic inWaitForLSN, thereby failing to prove the fix.
Point (3) is the most important review comment technically: it distinguishes between testing that the code doesn't crash versus testing that the specific code path containing the bug is exercised.
Decision Path
Alexander Korotkov, the original author/committer of the WAIT FOR LSN feature, concurred quickly and indicated he would push. Given that the fix is a one-liner plus test, follows existing symmetry (AbortTransaction → AbortSubTransaction), and the reviewer's refinements improve robustness without changing semantics, there was no design-level disagreement. The thread converged in under three hours.
Broader Implications
- Audit opportunity: Any feature maintaining per-backend shared-memory state (similar to
WaitLSNProcInfo) needs review forAbortSubTransactioncoverage. Features added quickly often ship withAbortTransactioncleanup and miss the subtransaction path because developer testing rarely wraps the feature in a savepoint. - Assertion value: The
Assert(!procInfo->inHeap)caught this. Without it, release builds would silently corrupt a shared pairing heap — a much nastier bug to diagnose, potentially manifesting as missed wake-ups or walreceiver hangs. - Test-harness lesson: Reliance on
statement_timeoutfor "interrupt mid-operation" tests is fragile; thepg_stat_activity-polling pattern is now the accepted standard.