BUG: Cascading standby fails to reconnect after falling back to archive recovery

First seen: 2026-01-28 17:03:24+00:00 · Messages: 32 · Participants: 3

Latest Update

2026-05-06 · opus 4.7

Cascading Standby Fails to Reconnect After Archive-Recovery Fallback

The Core Problem

This thread addresses a long-standing (since PostgreSQL 9.3) bug in cascading streaming replication that manifests in a very specific topology:

When the cascade's streaming connection breaks (upstream restart, cascade restart, transient failure), startup falls back to archive recovery. Both nodes then replay WAL segments independently from the shared archive. When the cascade later tries to reconnect to the upstream, the walsender rejects the request with:

ERROR: requested starting point 0/A000000 is ahead of the WAL flush position of this server 0/9000000

and the cascade is unable to resume streaming — often permanently, because each retry reproduces the same condition.

Why the Gap Exists (the Key Architectural Insight)

Initial analyses (including Marco's and Xuneng's early messages) incorrectly attributed the off-by-one gap to the segment-boundary round-down in RequestXLogStreaming() (walreceiverfuncs.c ~line 276):

if (XLogSegmentOffset(recptr, wal_segment_size) != 0)
    recptr -= XLogSegmentOffset(recptr, wal_segment_size);

Xuneng eventually pointed out this rounds down, not up, so it can't be the root cause. Marco's corrected diagnosis is the important one:

  1. Archive recovery consumes WAL at whole-segment granularity (restore_command fetches entire files).
  2. After both nodes replay the same archived segment N, the cascade's RecPtr (next page to read) naturally lands at the start of segment N+1.
  3. The upstream, however, is an archive-only standby with no walreceiver. Its GetStandbyFlushRecPtr() returns only replayPtr, which is the end of the last replayed record — still inside segment N (where the last record's padding/continuation ended).
  4. The walsender's sanity check in StartReplication() (added by Heikki's commit abfd192b1b5, PG 9.3) rejects any start point beyond the flush position.

So the cascade is legitimately "one segment ahead" of the upstream's reported flush, even though both have replayed identical data. The condition is self-perpetuating: the next archived segment arrives, both nodes restore it, both advance by one full segment, and the mismatch reappears.

Critically, a normal standby upstream doesn't exhibit this because it runs a walreceiver: GetStandbyFlushRecPtr() then considers the walreceiver's flushedUpto, which tracks received-but-not-yet-replayed WAL and stays ahead of replayPtr. Archive-only standbys have no such buffer.

Why the Check Exists and Why Fixing the Server Side is Wrong

The rejection in StartReplication() was introduced alongside timeline-switch support (abfd192b1b5). It's semantically correct: a walsender should not claim to serve WAL it has neither received nor generated. The fix must therefore live on the requester side — the cascade must not ask for data the upstream provably doesn't have.

Evolution of the Proposed Fix

v1 (Xuneng): Track lastStreamedFlush in process-local state on the walreceiver and clamp the streaming start position to it.

Fujii-san's killer counterexample: A simple restart of the cascade reproduces the bug, and process-local state doesn't survive a restart. This invalidated v1's approach entirely.

v2 (Marco, "handshake clamp"): Use the IDENTIFY_SYSTEM reply, which already contains the upstream's current WAL position. Before issuing START_REPLICATION, compare startpoint with the upstream's flush LSN; if ahead on the same timeline, wait wal_retrieve_retry_interval and retry. Because this queries live state on every connection, it is restart-safe and requires no new persistent state.

Backpatch variant (v2+): The master patch extends walrcv_identify_system_fn (a function pointer in WalReceiverFunctionsType), which is an ABI break on stable branches. The backpatch variant stashes the flush LSN in a module-global WalRcvIdentifySystemLsn set inside libpqrcv_identify_system() and read by the walreceiver — ugly but ABI-safe. This dual-patch strategy (clean API for master, global for backbranches) is the standard Postgres convention.

v4–v6 refinements (driven by Xuneng's pushback):

Remaining Disagreement

Xuneng remained uncomfortable with interval polling even at v7:

  1. Latency on quick catchup: up to one wal_retrieve_retry_interval of added reconnect delay.
  2. The wal_segment_size bound is debatable. Xuneng argued a multi-segment gap is a legitimate operational scenario: if the upstream is down for maintenance while the primary keeps archiving, the cascade legitimately advances many segments ahead via archive, and when the upstream returns the gap may be large. Marco's position is that handling this is a feature, not a bug fix — this patch's scope is strictly the sub-segment gap inherent to archive-granularity replay. Larger gaps fall through to normal archive fallback.
  3. Test non-determinism: The TAP test asserts outcomes (wait event hit, no error in log window) but never verifies the actual gap size. On systems with small wal_segment_size or heavy background WAL, the test could accidentally trigger the multi-segment path and pass for the wrong reason.

The polling-vs-event-driven concern is real: ideally the walreceiver would receive a notification when the upstream's flush advances past startpoint. But the replication protocol has no such primitive short of actually starting replication, so polling IDENTIFY_SYSTEM is the pragmatic compromise.

Architectural Implications

This thread is a nice case study in replication protocol asymmetries:

The fix is essentially teaching the walreceiver that its own read position can legitimately diverge from any upstream's serve-ability, and to negotiate that explicitly via IDENTIFY_SYSTEM before sending START_REPLICATION.