Implement waiting for wal lsn replay: reloaded

First seen: 2024-11-27 04:08:51+00:00 · Messages: 139 · Participants: 12

Latest Update

2026-05-06 · opus 4.7

WAIT FOR LSN: Built-in Read-Your-Writes Consistency for Standbys

The Core Problem

PostgreSQL's streaming replication provides eventual consistency on standbys, but applications needing read-your-writes guarantees across a primary/replica split must currently poll pg_last_wal_replay_lsn() in a loop. This is wasteful (CPU, network round-trips) and clumsy. The feature adds a first-class server-side wait primitive so a client that committed a transaction on the primary and received its commit LSN can, on a standby, block until replay catches up — without burning cycles.

The architectural tension that has blocked this feature for years (dating back to Heikki's earlier review of a stored-procedure version, commit-then-revert cycles) is a subtle snapshot-versus-replay deadlock:

Previous attempts solved this with a stored procedure (commit from an earlier cycle, referenced as E1sZwuz-002NPQ), but procedures have two fatal limitations Korotkov enumerates upfront:

  1. CALL requires a catalog lookup to resolve the procedure, which forces a transaction snapshot to be taken — impossible inside a REPEATABLE READ transaction that hasn't yet acquired one (and default_transaction_isolation = 'repeatable read' makes this bite on every implicit transaction).
  2. OUT parameters in procedures hold an additional snapshot that is unsafe to release mid-call.

The Chosen Design: A Utility Statement

Korotkov's solution is a new top-level utility command parsed in gram.y:

WAIT FOR LSN '0/12345' [ WITH ( TIMEOUT 'n[ms|s|min]', NO_THROW, MODE '...' ) ]

Key design decisions:

  1. Utility statement, not function/procedure. Parsing does not require catalog lookups, so no snapshot needs to be taken before execution begins. Inside ExecWaitStmt(), both PopActiveSnapshot() and InvalidateCatalogSnapshot() are called before entering the wait loop, and PlannedStmtRequiresSnapshot() was revised so the statement can return a result row without re-establishing one.

  2. Minimal grammar footprint. Initial versions used generic_option_list, later reworked to utility_option_list (matching VACUUM/EXPLAIN/CREATE PUBLICATION) after Álvaro Herrera's pushback. Álvaro also forced the grammar into the canonical [ WITH ( ... ) ] shape (mandatory-WITH-if-present, no bare parenthesized variant), pointing to the REPACK grammar as a cautionary tale. The MODE option was initially proposed as a first-class keyword (WAIT FOR LSN '...' MODE FLUSH) but consolidated into the WITH clause at Korotkov's request to avoid four new unreserved keywords.

  3. Pairing-heap-based waiter registry. xlogwait.c maintains, per WaitLSNType, a shared-memory pairing heap keyed by target LSN plus a minWaitedLSN atomic per type used as a fast-path skip. Waiters register themselves (addLSNWaiter), sleep on their latch, and are woken when the replay/write/flush frontier advances past their target. WaitLSNWakeup processes waiters in fixed-size batches (array of 16) to avoid palloc inside the wakeup path — a deliberate Yura Sokolov suggestion to keep the hot path allocation-free.

  4. NO_THROW mode. Rather than forcing applications to parse localized error strings (for timeout / promotion), NO_THROW returns a status row (success, timeout, not in recovery). This was Korotkov's answer to Tomas Vondra's question about the feature's motivation.

  5. Multiple LSN modes. The initial commit supported only standby_replay. A follow-on cycle added standby_write, standby_flush, and primary_flush. Naming was debated: Xuneng proposed a unified flush mode (letting the server auto-select primary vs. standby semantics), but Korotkov correctly identified a race condition — a promotion occurring between query submission and execution would silently change the mode's meaning. The final design exposes primary_flush and standby_flush as distinct user-visible modes, and for symmetry renamed the others to standby_write / standby_replay.

Technical Pitfalls Discovered Post-Commit

This thread is notable for the volume of post-commit breakage, each revealing a subtle concurrency or initialization issue:

1. Promotion wakeup bug (fast-path mis-applied to InvalidXLogRecPtr)

WaitLSNWakeup() had a fast-path: if minWaitedLSN > currentLSN, skip. But during promotion, xlog.c calls WaitLSNWakeup(WAIT_LSN_TYPE_REPLAY, InvalidXLogRecPtr) meaning "wake everybody regardless" — and InvalidXLogRecPtr == 0 compared against any real minWaitedLSN trivially triggered the skip. Waiters blocked until timeout (observed as a 60-second delay in the TAP test). Fix: guard the fast-path with XLogRecPtrIsValid(currentLSN).

2. Recovery-conflict termination of the waiter itself

ResolveRecoveryConflictWithTablespace() calls GetConflictingVirtualXIDs(InvalidTransactionId, InvalidOid) — a nuclear cancel that targets all backends, irrespective of xmin. A WAIT FOR LSN session running on a standby during replay of DROP TABLESPACE gets its statement canceled even though it touched no tablespaces. On CI (macOS particularly) this surfaced as flapping 031_recovery_conflict. Korotkov initially hypothesized it was an xmin-holding issue solvable by moving snapshot releases earlier; Xuneng correctly diagnosed that xmin is irrelevant for this conflict class. Fix: wrap WAIT FOR LSN in a retry loop in wait_for_catchup() that falls back to polling on "conflict with recovery" error strings. Andres Freund forced a revert of wait_for_catchup() conversion until the fix landed, since CI spurious-failure rates were unacceptable.

3. Stale writtenUpto / flushedUpto on walreceiver startup (Tom Lane's slowdown)

Tom's buildfarm animals showed 003_extrafiles.pl and 033_replay_tsp_drops.pl go from 3–4s to 45s after wait_for_catchup() was converted. Root cause: WalRcv->writtenUpto is zero-initialized in shared memory and only advances when XLogWalRcvWrite() processes incoming data. WalRcv->flushedUpto is initialized to the segment-aligned streaming start by RequestXLogStreaming(), and only advances when LogstreamResult.Flush < LogstreamResult.Write — which is never true at startup because both are seeded from GetXLogReplayRecPtr(). On an idle primary (as in these tests), the shared positions stay stale forever, and a WAIT FOR LSN targeting a position already on disk waits for timeout. Fix: seed both writtenUpto and flushedUpto from the replay position at walreceiver start and wake any pre-registered waiters. Andres committed a minimal fix immediately to stop the buildfarm bleeding.

4. Missing memory barriers (Andres's review)

Andres flagged that XLogWalRcvWrite writes writtenUpto and then reads minWaitedLSN with no barrier; a reader in GetCurrentLSNForWaitTypeGetWalRcvWriteRecPtr had no barrier either. The implicit ordering via LWLocks in addLSNWaiter/deleteLSNWaiter covers the first iteration of the wait loop but provides no guarantee after WaitLatch() returns. Resolution: switch to pg_atomic_read_membarrier_u64 / pg_atomic_write_membarrier_u64 on both the producer (writtenUpto write) and consumer (GetCurrentLSNForWaitType read) sides, and also ResetLatch() unconditionally in the wait loop so a timeout iteration re-reads fresh values before bailing out.

5. Archive recovery without walreceiver

A separate bug Xuneng identified: during pure archive recovery there is no walreceiver, so standby_write waiters sleep forever while startup replays past their target but only wakes STANDBY_REPLAY waiters. Fixed by broadening wakeup logic.

6. Code duplication in wakeup sites

Andres asked pointedly: why are there five copies of the minWaitedLSN >= ... ? WaitLSNWakeup(...) precheck when WaitLSNWakeup itself does the check, and why does caller code test waitLSNState for non-NULL? Cleanup patch encapsulated this properly.

7. Translatability of error messages (Álvaro)

Xuneng introduced a WaitLSNTypeDesc struct with ->verb and ->noun fields to build error messages like "Recovery ended before target LSN %X was %s; last %s LSN %X". Álvaro vetoed: sentence fragments assembled via %s are untranslatable because word order and agreement vary by language. Replaced with an explicit switch emitting fully spelled-out messages per mode — verbose but correct. Álvaro also enforced the canonical errmsg("unrecognized value for %s option \"%s\": \"%s\"", "WAIT", ...) form (commit 502e256f2262) so translations are shared across commands.

Noteworthy Architectural Observations

Whose Opinions Drove the Design