WAIT FOR LSN: Built-in Read-Your-Writes Consistency for Standbys
The Core Problem
PostgreSQL's streaming replication provides eventual consistency on standbys, but applications needing read-your-writes guarantees across a primary/replica split must currently poll pg_last_wal_replay_lsn() in a loop. This is wasteful (CPU, network round-trips) and clumsy. The feature adds a first-class server-side wait primitive so a client that committed a transaction on the primary and received its commit LSN can, on a standby, block until replay catches up — without burning cycles.
The architectural tension that has blocked this feature for years (dating back to Heikki's earlier review of a stored-procedure version, commit-then-revert cycles) is a subtle snapshot-versus-replay deadlock:
- Any regular SQL-callable function (or procedure with output params) executes inside a transaction and therefore holds at minimum a catalog snapshot, often an active snapshot.
- A snapshot on a standby pins
xmin, which can block WAL replay via recovery conflict avoidance / hot-standby feedback mechanics. - If replay is blocked, the LSN the caller is waiting for will never arrive → self-deadlock.
Previous attempts solved this with a stored procedure (commit from an earlier cycle, referenced as E1sZwuz-002NPQ), but procedures have two fatal limitations Korotkov enumerates upfront:
CALLrequires a catalog lookup to resolve the procedure, which forces a transaction snapshot to be taken — impossible inside aREPEATABLE READtransaction that hasn't yet acquired one (anddefault_transaction_isolation = 'repeatable read'makes this bite on every implicit transaction).- OUT parameters in procedures hold an additional snapshot that is unsafe to release mid-call.
The Chosen Design: A Utility Statement
Korotkov's solution is a new top-level utility command parsed in gram.y:
WAIT FOR LSN '0/12345' [ WITH ( TIMEOUT 'n[ms|s|min]', NO_THROW, MODE '...' ) ]
Key design decisions:
-
Utility statement, not function/procedure. Parsing does not require catalog lookups, so no snapshot needs to be taken before execution begins. Inside
ExecWaitStmt(), bothPopActiveSnapshot()andInvalidateCatalogSnapshot()are called before entering the wait loop, andPlannedStmtRequiresSnapshot()was revised so the statement can return a result row without re-establishing one. -
Minimal grammar footprint. Initial versions used
generic_option_list, later reworked toutility_option_list(matching VACUUM/EXPLAIN/CREATE PUBLICATION) after Álvaro Herrera's pushback. Álvaro also forced the grammar into the canonical[ WITH ( ... ) ]shape (mandatory-WITH-if-present, no bare parenthesized variant), pointing to the REPACK grammar as a cautionary tale. The MODE option was initially proposed as a first-class keyword (WAIT FOR LSN '...' MODE FLUSH) but consolidated into the WITH clause at Korotkov's request to avoid four new unreserved keywords. -
Pairing-heap-based waiter registry.
xlogwait.cmaintains, perWaitLSNType, a shared-memory pairing heap keyed by target LSN plus aminWaitedLSNatomic per type used as a fast-path skip. Waiters register themselves (addLSNWaiter), sleep on their latch, and are woken when the replay/write/flush frontier advances past their target.WaitLSNWakeupprocesses waiters in fixed-size batches (array of 16) to avoidpallocinside the wakeup path — a deliberate Yura Sokolov suggestion to keep the hot path allocation-free. -
NO_THROW mode. Rather than forcing applications to parse localized error strings (for timeout / promotion),
NO_THROWreturns a status row (success,timeout,not in recovery). This was Korotkov's answer to Tomas Vondra's question about the feature's motivation. -
Multiple LSN modes. The initial commit supported only
standby_replay. A follow-on cycle addedstandby_write,standby_flush, andprimary_flush. Naming was debated: Xuneng proposed a unifiedflushmode (letting the server auto-select primary vs. standby semantics), but Korotkov correctly identified a race condition — a promotion occurring between query submission and execution would silently change the mode's meaning. The final design exposes primary_flush and standby_flush as distinct user-visible modes, and for symmetry renamed the others tostandby_write/standby_replay.
Technical Pitfalls Discovered Post-Commit
This thread is notable for the volume of post-commit breakage, each revealing a subtle concurrency or initialization issue:
1. Promotion wakeup bug (fast-path mis-applied to InvalidXLogRecPtr)
WaitLSNWakeup() had a fast-path: if minWaitedLSN > currentLSN, skip. But during promotion, xlog.c calls WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr) meaning "wake everybody regardless" — and InvalidXLogRecPtr == 0 compared against any real minWaitedLSN trivially triggered the skip. Waiters blocked until timeout (observed as a 60-second delay in the TAP test). Fix: guard the fast-path with XLogRecPtrIsValid(currentLSN).
2. Recovery-conflict termination of the waiter itself
ResolveRecoveryConflictWithTablespace() calls GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid) — a nuclear cancel that targets all backends, irrespective of xmin. A WAIT FOR LSN session running on a standby during replay of DROP TABLESPACE gets its statement canceled even though it touched no tablespaces. On CI (macOS particularly) this surfaced as flapping 031_recovery_conflict. Korotkov initially hypothesized it was an xmin-holding issue solvable by moving snapshot releases earlier; Xuneng correctly diagnosed that xmin is irrelevant for this conflict class. Fix: wrap WAIT FOR LSN in a retry loop in wait_for_catchup() that falls back to polling on "conflict with recovery" error strings. Andres Freund forced a revert of wait_for_catchup() conversion until the fix landed, since CI spurious-failure rates were unacceptable.
3. Stale writtenUpto / flushedUpto on walreceiver startup (Tom Lane's slowdown)
Tom's buildfarm animals showed 003_extrafiles.pl and 033_replay_tsp_drops.pl go from 3–4s to 45s after wait_for_catchup() was converted. Root cause: WalRcv->writtenUpto is zero-initialized in shared memory and only advances when XLogWalRcvWrite() processes incoming data. WalRcv->flushedUpto is initialized to the segment-aligned streaming start by RequestXLogStreaming(), and only advances when LogstreamResult.Flush < LogstreamResult.Write — which is never true at startup because both are seeded from GetXLogReplayRecPtr(). On an idle primary (as in these tests), the shared positions stay stale forever, and a WAIT FOR LSN targeting a position already on disk waits for timeout. Fix: seed both writtenUpto and flushedUpto from the replay position at walreceiver start and wake any pre-registered waiters. Andres committed a minimal fix immediately to stop the buildfarm bleeding.
4. Missing memory barriers (Andres's review)
Andres flagged that XLogWalRcvWrite writes writtenUpto and then reads minWaitedLSN with no barrier; a reader in GetCurrentLSNForWaitType → GetWalRcvWriteRecPtr had no barrier either. The implicit ordering via LWLocks in addLSNWaiter/deleteLSNWaiter covers the first iteration of the wait loop but provides no guarantee after WaitLatch() returns. Resolution: switch to pg_atomic_read_membarrier_u64 / pg_atomic_write_membarrier_u64 on both the producer (writtenUpto write) and consumer (GetCurrentLSNForWaitType read) sides, and also ResetLatch() unconditionally in the wait loop so a timeout iteration re-reads fresh values before bailing out.
5. Archive recovery without walreceiver
A separate bug Xuneng identified: during pure archive recovery there is no walreceiver, so standby_write waiters sleep forever while startup replays past their target but only wakes STANDBY_REPLAY waiters. Fixed by broadening wakeup logic.
6. Code duplication in wakeup sites
Andres asked pointedly: why are there five copies of the minWaitedLSN >= ... ? WaitLSNWakeup(...) precheck when WaitLSNWakeup itself does the check, and why does caller code test waitLSNState for non-NULL? Cleanup patch encapsulated this properly.
7. Translatability of error messages (Álvaro)
Xuneng introduced a WaitLSNTypeDesc struct with ->verb and ->noun fields to build error messages like "Recovery ended before target LSN %X was %s; last %s LSN %X". Álvaro vetoed: sentence fragments assembled via %s are untranslatable because word order and agreement vary by language. Replaced with an explicit switch emitting fully spelled-out messages per mode — verbose but correct. Álvaro also enforced the canonical errmsg("unrecognized value for %s option \"%s\": \"%s\"", "WAIT", ...) form (commit 502e256f2262) so translations are shared across commands.
Noteworthy Architectural Observations
-
Timeline-blind semantics. The command intentionally does not reason about timelines. An LSN on TLI 2 post-promotion is a different logical position than the same numeric LSN on TLI 1, so the command aborts waits on promotion rather than silently continuing. This is documented explicitly as a limitation.
-
Per-process shmem slot. Initial implementation allocated a slot per backend; Xuneng extended this to all processes so callers like a re-worked
read_local_xlog_page_guts()(a separate thread) could replace their check-sleep-repeat loop with a condition-variable-style wait. -
WaitLSNProcInfopacking. Korotkov noticed thatinHeapandheapNodearrays grew per-process with each newWaitLSNType. Since a single process can only wait on one LSN at a time, these were collapsed back to scalars — a nice space win. -
Follow-on use cases. Andres called out that
wait_for_catchup()polling should use this; the read_local_xlog_page_guts polling in logical decoding is another candidate; Korotkov mentioned waiting on a specific XID replay as a plausible future extension using the same infrastructure.
Whose Opinions Drove the Design
- Alexander Korotkov (committer, author): the throughline — designed the feature, pushed it across multiple commitfests, did all the final commits.
- Álvaro Herrera (committer): decisive on grammar (mandatory WITH, utility_option_list over generic_option_list, no bare parens), and on translatability (killed the verb/noun struct).
- Andres Freund (committer): post-commit gatekeeper. Forced revert when CI broke, demanded memory-barrier correctness, flagged the code-duplication smell, did the minimal fast-path fix for Tom's slowdown.
- Tom Lane (committer): buildfarm canary — caught the walreceiver initialization regression that nobody else saw.
- Tomas Vondra (committer): extensive early review on docs, syntax ergonomics, the
throwflag's motivation. - Heikki Linnakangas (committer): referenced for the original recovery-conflict / polling concern and spotted the misplaced
WaitLSNWakeupcall inPerformWalRecoverythat risked missed wakeups on pause/stop/promote. - Yura Sokolov: shaped the wakeup hot path (static 16-element array instead of palloc, struct packing for
inHeap). - Xuneng Zhou: de-facto co-author through the later cycles — rebases, MODE extension, every post-commit bug fix, test harnessing improvements.