[PATCH] Fix WAIT FOR LSN cleanup on subtransaction abort

First seen: 2026-05-06 07:18:31+00:00 · Messages: 4 · Participants: 3

Latest Update

2026-05-06 · opus 4.7

Core Problem: Stale Wait State After Subtransaction Abort in WAIT FOR LSN

Background: WAIT FOR LSN Infrastructure

WAIT FOR LSN is a relatively recent SQL-level feature (implemented in xlogwait.c) that allows a session to block until WAL replay/flush has reached a specified LSN. It's particularly useful in replication and causality scenarios where a client needs to ensure a particular change is durable or visible on a replica before proceeding.

The implementation maintains a per-backend wait registration in shared memory, organized as a pairing heap (the "wait heap") keyed by target LSN. Each PGPROC-associated WaitLSNProcInfo entry carries an inHeap flag indicating whether this backend currently has an outstanding registration. The heap is consulted by the walreceiver/startup/flush paths to wake waiters whose target LSN has been reached.

The critical invariant is: a backend must never attempt to insert itself into the wait heap while inHeap == true. Violating this corrupts the heap (double-insertion into a pairing heap destroys its structure), hence the defensive Assert(!procInfo->inHeap) that fired in the reproducer.

The Bug: Asymmetric Cleanup Between Transaction and Subtransaction Abort

Cleanup of the wait-heap entry is handled by WaitLSNCleanup(), which was wired into:

What was missing is the subtransaction abort path: AbortSubTransaction(). When an interrupt (e.g., statement_timeout, query cancel) fires while the backend is sleeping inside WaitForLSN() within a savepoint, control unwinds via ProcessInterruptsereport(ERROR, ...). If the error is caught by a subtransaction's error-recovery (i.e., the subtransaction is aborted but the top-level transaction survives, as happens with ROLLBACK TO SAVEPOINT), only AbortSubTransaction() runs — not AbortTransaction().

Consequence: the WaitLSNProcInfo.inHeap flag stays true, and the backend's node remains linked in the shared pairing heap even though the backend is no longer waiting. A subsequent WAIT FOR LSN in the same backend trips the assertion (debug builds) or corrupts the heap (release builds — potentially causing arbitrary misbehavior in the flush/replay wakeup path).

Reproducer Mechanics

The test case is carefully constructed:

  1. Enter a subtransaction (SAVEPOINT s).
  2. Set a short statement_timeout so the wait is interrupted by an error rather than completing normally.
  3. WAIT FOR LSN '<future-lsn>' with a target far beyond current WAL — guarantees the backend actually enters the wait loop and registers in the heap before being interrupted.
  4. ROLLBACK TO s — aborts the subtransaction; the backend does not exit and top-level AbortTransaction() is not called.
  5. A second WAIT FOR LSN — trips Assert(!procInfo->inHeap).

The WITH (MODE 'primary_flush') option is relevant: it directs the wait to the flush-LSN path (WaitForWalFlush wait event) rather than the replay path, ensuring deterministic behavior on a primary.

The Fix

The patch is architecturally minimal and correct: add a WaitLSNCleanup() call inside AbortSubTransaction(), placed after LWLockReleaseAll(). Ordering matters because WaitLSNCleanup() itself takes the WaitLSNLock LWLock to unlink from the shared heap; if the backend were still holding some unrelated LWLock from the aborted operation, lock-ordering rules could be violated. This mirrors precisely the ordering already used in AbortTransaction().

This is the canonical "symmetric cleanup" bug pattern in PostgreSQL: any resource that has a per-backend registration in shared memory must be cleaned in both abort paths, because subtransaction abort is a strict subset of state reset that nevertheless must release resources scoped to a single statement. Similar patterns exist for LWLocks, buffer pins, catalog snapshots, etc.

Review Feedback (Xuneng Zhou)

Three substantive review points:

  1. Comment accuracy: The existing header comment for WaitLSNCleanup() described it as cleanup "for exiting process." Post-patch this is false — the process is not exiting, it's merely transitioning out of a wait scope. Suggested rewording: "Clean up any LSN wait state for the current process." Alexander Korotkov agreed, noting "existing process" is confusing even in the current AbortTransaction() caller.

  2. Deterministic test synchronization: The original test relied on statement_timeout = '100ms' firing while the waiter was registered. This is a race: the backend might be interrupted before it enters the heap, or it might not yet have reached the sleep. The standard PostgreSQL TAP-test idiom is to launch the waiter in a background psql session (BackgroundPsql) and poll pg_stat_activity until wait_event = 'WaitForWalFlush' appears for that PID. Only then is the backend guaranteed to be in the heap, so cancellation deterministically exercises the stale-registration path.

  3. Test validation LSN: The patch originally validated recovery by issuing WAIT FOR LSN '0/0' (which returns immediately because LSN 0/0 is trivially reached). Zhou argued for re-using an unreachable LSN with a short TIMEOUT, so that the second wait actually performs heap insertion (the buggy code path). A fast-success 0/0 call might bypass heap registration entirely depending on the early-exit logic in WaitForLSN, thereby failing to prove the fix.

Point (3) is the most important review comment technically: it distinguishes between testing that the code doesn't crash versus testing that the specific code path containing the bug is exercised.

Decision Path

Alexander Korotkov, the original author/committer of the WAIT FOR LSN feature, concurred quickly and indicated he would push. Given that the fix is a one-liner plus test, follows existing symmetry (AbortTransactionAbortSubTransaction), and the reviewer's refinements improve robustness without changing semantics, there was no design-level disagreement. The thread converged in under three hours.

Broader Implications