Implement waiting for wal lsn replay: reloaded

First seen: 2024-11-27 04:08:51+00:00 · Messages: 148 · Participants: 13

Latest Update

2026-05-27 · claude-opus-4-6

Minimal Progress: Commit Intent and Minor Comment Cleanups

The three new messages contain no new technical arguments, no new patches with behavioral changes, and no position shifts:

  1. Korotkov signals commit intent (2025-05-25): States he will push the wait_for_slot_catchup() fix unless objections arise. This confirms the approach from the prior analysis is finalized.

  2. Xuneng proposes comment cleanup (2025-05-26): Notes that while reading 019_replslot_limit.pl, he (via Codex) found inconsistencies in comments. Asks Korotkov to fold in a small cleanup alongside the fix. No code logic changes.

  3. Xuneng follows up (2025-05-26): Updated the comment above wait_for_slot_catchup to reflect its actual usage. Pure documentation/comment refinement.

No new technical insights, no architectural changes, no new participants.

History (3 prior analyses)
2026-05-25 · claude-opus-4-6

No Substantive Technical Progress

The two new messages from Xuneng Zhou are simply:

  1. Agreement/LGTM on Korotkov's patch approach (calling wait_for_slot_catchup() directly in the test rather than introducing a wrapper function). No new technical arguments or alternative designs proposed.
  2. A minor correction to a copy-paste error in his previous message (wrong function name referenced).

No new patches, no new participants, no position shifts, no new technical insights beyond what was already covered in the prior analysis.


2026-05-22 · claude-opus-4-6

Resolution of the 019_replslot_limit.pl Race Condition

The thread reaches consensus on fixing the replication slot state race exposed by the faster WAIT FOR LSN-based wait_for_catchup().

Xuneng Zhou's Diagnosis and Fix Strategy

Xuneng confirms he was aware of the semantic change (WAIT FOR LSN returns when the standby reaches the target, whereas the old polling checked pg_stat_replication on the primary, which implicitly guaranteed feedback round-trip completion). He initially thought this weaker guarantee was harmless but now acknowledges it breaks 019_replslot_limit.pl.

His key architectural decision: fix the test, not wait_for_catchup(). His reasoning:

  • wait_for_catchup()'s natural semantics are standby-local (replay/write/flush reached target LSN).
  • The old primary-side polling was a side effect that happened to also confirm feedback processing — not an intentional guarantee.
  • 019_replslot_limit.pl has a specific need: it checks primary-side slot state, which depends on walsender processing standby feedback. This dependency should be made explicit in that test rather than burdening all callers with unnecessary round-trip waiting.

His patch introduces a wait_for_standby_and_slot_catchup() wrapper that: (1) waits for standby replay via WAIT FOR LSN, then (2) polls restart_lsn on the primary until it advances past the target.

Korotkov's Simplification

Korotkov agrees with the strategy but proposes eliminating the wrapper function. Instead of wait_for_standby_and_slot_catchup(), the test should call $node->wait_for_slot_catchup() directly after the existing wait_for_catchup(). This avoids introducing a new abstraction layer and keeps the slot-specific synchronization visually explicit at the call site. He provides an attached patch implementing this approach.


2026-05-20 · claude-opus-4-6

New Bug: wait_for_catchup() with WAIT FOR LSN Causes Race in Replication Slot State Tests

Alexander Korotkov reports a new class of test failure caused by the WAIT FOR LSN-based wait_for_catchup() implementation — this time in 019_replslot_limit.pl, a test that checks replication slot WAL availability states transition through extendedunreservedlost.

Root Cause: WAIT FOR LSN Returns Too Early Relative to Slot Advancement

The fundamental issue is a semantic gap between WAL replay position and replication slot state updates. wait_for_catchup() using WAIT FOR LSN guarantees that the standby has replayed (or written/flushed) up to a given LSN, but the primary's restart_lsn for the replication slot is advanced asynchronously — it depends on the walreceiver sending a reply back to the walsender, and the walsender processing that reply to update the slot's restart_lsn.

The timeline of the race:

  1. Test generates WAL, calls wait_for_catchup() which returns as soon as the standby replays to the target LSN.
  2. Test immediately queries pg_replication_slots on the primary to check wal_status.
  3. But the walreceiver hasn't yet sent its reply (or the walsender hasn't processed it), so slot->data.restart_lsn hasn't advanced yet.
  4. GetWALAvailability() computes the slot state based on the stale restart_lsn, returning a state one step behind what the test expects (e.g., unreserved instead of extended, or lost instead of unreserved).

Reproduction

Korotkov reproduced deterministically by injecting pg_usleep() delays:

  • 10ms in XLogWalRcvSendReply() (delays the standby's feedback to the primary)
  • 100ms in ProcessStandbyReplyMessage() (delays the primary's processing of feedback)

With these delays, the test fails on iterations 2, 1, and 7 out of 100. Without WAIT FOR LSN (reverting to the old polling-based wait_for_catchup()), 100 iterations pass.

Why the Old Implementation Was Immune

The previous polling-based wait_for_catchup() likely had enough latency in its poll loop that by the time it confirmed catchup, the walreceiver reply had already propagated. WAIT FOR LSN is faster — it returns the instant replay reaches the target — which exposes this pre-existing race between replay progress and slot metadata propagation.

Diagnostic Evidence

Korotkov's logging shows the key difference: in failed runs, targetSeg (22) < oldestSlotSeg (23), meaning the slot's restart_lsn has already moved past the segment boundary, making the state unreserved. In successful runs, targetSeg == oldestSlotSeg (both 23), yielding the expected extended state.

Implications

This is a deeper problem than previous bugs — it reveals that wait_for_catchup() via WAIT FOR LSN provides a weaker guarantee than the test assumes. The test doesn't just need "standby has replayed to here"; it needs "primary's slot state reflects the standby's progress." This may require either:

  • A supplementary wait/check on the primary side for slot advancement
  • Test-specific adjustment to not rely on immediate slot state after catchup
  • A new wait mode or mechanism that confirms the round-trip (standby replay → feedback → walsender slot update)