2026-05-06 · opus 4.7

WAIT FOR LSN: Built-in Read-Your-Writes Consistency for Standbys

The Core Problem

PostgreSQL's streaming replication provides eventual consistency on standbys, but applications needing read-your-writes guarantees across a primary/replica split must currently poll pg_last_wal_replay_lsn() in a loop. This is wasteful (CPU, network round-trips) and clumsy. The feature adds a first-class server-side wait primitive so a client that committed a transaction on the primary and received its commit LSN can, on a standby, block until replay catches up — without burning cycles.

The architectural tension that has blocked this feature for years (dating back to Heikki's earlier review of a stored-procedure version, commit-then-revert cycles) is a subtle snapshot-versus-replay deadlock:

Any regular SQL-callable function (or procedure with output params) executes inside a transaction and therefore holds at minimum a catalog snapshot, often an active snapshot.
A snapshot on a standby pins xmin, which can block WAL replay via recovery conflict avoidance / hot-standby feedback mechanics.
If replay is blocked, the LSN the caller is waiting for will never arrive → self-deadlock.

Previous attempts solved this with a stored procedure (commit from an earlier cycle, referenced as E1sZwuz-002NPQ), but procedures have two fatal limitations Korotkov enumerates upfront:

CALL requires a catalog lookup to resolve the procedure, which forces a transaction snapshot to be taken — impossible inside a REPEATABLE READ transaction that hasn't yet acquired one (and default_transaction_isolation = 'repeatable read' makes this bite on every implicit transaction).
OUT parameters in procedures hold an additional snapshot that is unsafe to release mid-call.

The Chosen Design: A Utility Statement

Korotkov's solution is a new top-level utility command parsed in gram.y:

WAIT FOR LSN '0/12345' [ WITH ( TIMEOUT 'n[ms|s|min]', NO_THROW, MODE '...' ) ]

Key design decisions:

Utility statement, not function/procedure. Parsing does not require catalog lookups, so no snapshot needs to be taken before execution begins. Inside ExecWaitStmt(), both PopActiveSnapshot() and InvalidateCatalogSnapshot() are called before entering the wait loop, and PlannedStmtRequiresSnapshot() was revised so the statement can return a result row without re-establishing one.
Minimal grammar footprint. Initial versions used generic_option_list, later reworked to utility_option_list (matching VACUUM/EXPLAIN/CREATE PUBLICATION) after Álvaro Herrera's pushback. Álvaro also forced the grammar into the canonical [ WITH ( ... ) ] shape (mandatory-WITH-if-present, no bare parenthesized variant), pointing to the REPACK grammar as a cautionary tale. The MODE option was initially proposed as a first-class keyword (WAIT FOR LSN '...' MODE FLUSH) but consolidated into the WITH clause at Korotkov's request to avoid four new unreserved keywords.
Pairing-heap-based waiter registry. xlogwait.c maintains, per WaitLSNType, a shared-memory pairing heap keyed by target LSN plus a minWaitedLSN atomic per type used as a fast-path skip. Waiters register themselves (addLSNWaiter), sleep on their latch, and are woken when the replay/write/flush frontier advances past their target. WaitLSNWakeup processes waiters in fixed-size batches (array of 16) to avoid palloc inside the wakeup path — a deliberate Yura Sokolov suggestion to keep the hot path allocation-free.
NO_THROW mode. Rather than forcing applications to parse localized error strings (for timeout / promotion), NO_THROW returns a status row (success, timeout, not in recovery). This was Korotkov's answer to Tomas Vondra's question about the feature's motivation.
Multiple LSN modes. The initial commit supported only standby_replay. A follow-on cycle added standby_write, standby_flush, and primary_flush. Naming was debated: Xuneng proposed a unified flush mode (letting the server auto-select primary vs. standby semantics), but Korotkov correctly identified a race condition — a promotion occurring between query submission and execution would silently change the mode's meaning. The final design exposes primary_flush and standby_flush as distinct user-visible modes, and for symmetry renamed the others to standby_write / standby_replay.

Technical Pitfalls Discovered Post-Commit

This thread is notable for the volume of post-commit breakage, each revealing a subtle concurrency or initialization issue:

1. Promotion wakeup bug (fast-path mis-applied to `InvalidXLogRecPtr`)

WaitLSNWakeup() had a fast-path: if minWaitedLSN > currentLSN, skip. But during promotion, xlog.c calls WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr) meaning "wake everybody regardless" — and InvalidXLogRecPtr == 0 compared against any real minWaitedLSN trivially triggered the skip. Waiters blocked until timeout (observed as a 60-second delay in the TAP test). Fix: guard the fast-path with XLogRecPtrIsValid(currentLSN).

2. Recovery-conflict termination of the waiter itself

ResolveRecoveryConflictWithTablespace() calls GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid) — a nuclear cancel that targets all backends, irrespective of xmin. A WAIT FOR LSN session running on a standby during replay of DROP TABLESPACE gets its statement canceled even though it touched no tablespaces. On CI (macOS particularly) this surfaced as flapping 031_recovery_conflict. Korotkov initially hypothesized it was an xmin-holding issue solvable by moving snapshot releases earlier; Xuneng correctly diagnosed that xmin is irrelevant for this conflict class. Fix: wrap WAIT FOR LSN in a retry loop in wait_for_catchup() that falls back to polling on "conflict with recovery" error strings. Andres Freund forced a revert of wait_for_catchup() conversion until the fix landed, since CI spurious-failure rates were unacceptable.

3. Stale `writtenUpto` / `flushedUpto` on walreceiver startup (Tom Lane's slowdown)

Tom's buildfarm animals showed 003_extrafiles.pl and 033_replay_tsp_drops.pl go from 3–4s to 45s after wait_for_catchup() was converted. Root cause: WalRcv->writtenUpto is zero-initialized in shared memory and only advances when XLogWalRcvWrite() processes incoming data. WalRcv->flushedUpto is initialized to the segment-aligned streaming start by RequestXLogStreaming(), and only advances when LogstreamResult.Flush < LogstreamResult.Write — which is never true at startup because both are seeded from GetXLogReplayRecPtr(). On an idle primary (as in these tests), the shared positions stay stale forever, and a WAIT FOR LSN targeting a position already on disk waits for timeout. Fix: seed both writtenUpto and flushedUpto from the replay position at walreceiver start and wake any pre-registered waiters. Andres committed a minimal fix immediately to stop the buildfarm bleeding.

4. Missing memory barriers (Andres's review)

Andres flagged that XLogWalRcvWrite writes writtenUpto and then reads minWaitedLSN with no barrier; a reader in GetCurrentLSNForWaitType → GetWalRcvWriteRecPtr had no barrier either. The implicit ordering via LWLocks in addLSNWaiter/deleteLSNWaiter covers the first iteration of the wait loop but provides no guarantee after WaitLatch() returns. Resolution: switch to pg_atomic_read_membarrier_u64 / pg_atomic_write_membarrier_u64 on both the producer (writtenUpto write) and consumer (GetCurrentLSNForWaitType read) sides, and also ResetLatch() unconditionally in the wait loop so a timeout iteration re-reads fresh values before bailing out.

5. Archive recovery without walreceiver

A separate bug Xuneng identified: during pure archive recovery there is no walreceiver, so standby_write waiters sleep forever while startup replays past their target but only wakes STANDBY_REPLAY waiters. Fixed by broadening wakeup logic.

6. Code duplication in wakeup sites

Andres asked pointedly: why are there five copies of the minWaitedLSN >= ... ? WaitLSNWakeup(...) precheck when WaitLSNWakeup itself does the check, and why does caller code test waitLSNState for non-NULL? Cleanup patch encapsulated this properly.

7. Translatability of error messages (Álvaro)

Xuneng introduced a WaitLSNTypeDesc struct with ->verb and ->noun fields to build error messages like "Recovery ended before target LSN %X was %s; last %s LSN %X". Álvaro vetoed: sentence fragments assembled via %s are untranslatable because word order and agreement vary by language. Replaced with an explicit switch emitting fully spelled-out messages per mode — verbose but correct. Álvaro also enforced the canonical errmsg("unrecognized value for %s option \"%s\": \"%s\"", "WAIT", ...) form (commit 502e256f2262) so translations are shared across commands.

Noteworthy Architectural Observations

Timeline-blind semantics. The command intentionally does not reason about timelines. An LSN on TLI 2 post-promotion is a different logical position than the same numeric LSN on TLI 1, so the command aborts waits on promotion rather than silently continuing. This is documented explicitly as a limitation.
Per-process shmem slot. Initial implementation allocated a slot per backend; Xuneng extended this to all processes so callers like a re-worked read_local_xlog_page_guts() (a separate thread) could replace their check-sleep-repeat loop with a condition-variable-style wait.
WaitLSNProcInfo packing. Korotkov noticed that inHeap and heapNode arrays grew per-process with each new WaitLSNType. Since a single process can only wait on one LSN at a time, these were collapsed back to scalars — a nice space win.
Follow-on use cases. Andres called out that wait_for_catchup() polling should use this; the read_local_xlog_page_guts polling in logical decoding is another candidate; Korotkov mentioned waiting on a specific XID replay as a plausible future extension using the same infrastructure.

Whose Opinions Drove the Design

Alexander Korotkov (committer, author): the throughline — designed the feature, pushed it across multiple commitfests, did all the final commits.
Álvaro Herrera (committer): decisive on grammar (mandatory WITH, utility_option_list over generic_option_list, no bare parens), and on translatability (killed the verb/noun struct).
Andres Freund (committer): post-commit gatekeeper. Forced revert when CI broke, demanded memory-barrier correctness, flagged the code-duplication smell, did the minimal fast-path fix for Tom's slowdown.
Tom Lane (committer): buildfarm canary — caught the walreceiver initialization regression that nobody else saw.
Tomas Vondra (committer): extensive early review on docs, syntax ergonomics, the throw flag's motivation.
Heikki Linnakangas (committer): referenced for the original recovery-conflict / polling concern and spotted the misplaced WaitLSNWakeup call in PerformWalRecovery that risked missed wakeups on pause/stop/promote.
Yura Sokolov: shaped the wakeup hot path (static 16-element array instead of palloc, struct packing for inHeap).
Xuneng Zhou: de-facto co-author through the later cycles — rebases, MODE extension, every post-commit bug fix, test harnessing improvements.

Implement waiting for wal lsn replay: reloaded

Latest Update

WAIT FOR LSN: Built-in Read-Your-Writes Consistency for Standbys

The Core Problem

The Chosen Design: A Utility Statement

Technical Pitfalls Discovered Post-Commit

1. Promotion wakeup bug (fast-path mis-applied to `InvalidXLogRecPtr`)

2. Recovery-conflict termination of the waiter itself

3. Stale `writtenUpto` / `flushedUpto` on walreceiver startup (Tom Lane's slowdown)

4. Missing memory barriers (Andres's review)

5. Archive recovery without walreceiver

6. Code duplication in wakeup sites

7. Translatability of error messages (Álvaro)

Noteworthy Architectural Observations

Whose Opinions Drove the Design

Implement waiting for wal lsn replay: reloaded

Latest Update

WAIT FOR LSN: Built-in Read-Your-Writes Consistency for Standbys

The Core Problem

The Chosen Design: A Utility Statement

Technical Pitfalls Discovered Post-Commit

1. Promotion wakeup bug (fast-path mis-applied to InvalidXLogRecPtr)

2. Recovery-conflict termination of the waiter itself

3. Stale writtenUpto / flushedUpto on walreceiver startup (Tom Lane's slowdown)

4. Missing memory barriers (Andres's review)

5. Archive recovery without walreceiver

6. Code duplication in wakeup sites

7. Translatability of error messages (Álvaro)

Noteworthy Architectural Observations

Whose Opinions Drove the Design

1. Promotion wakeup bug (fast-path mis-applied to `InvalidXLogRecPtr`)

3. Stale `writtenUpto` / `flushedUpto` on walreceiver startup (Tom Lane's slowdown)