2026-05-25 · claude-opus-4-6
No Substantive Technical Progress
The two new messages from Xuneng Zhou are simply:
- Agreement/LGTM on Korotkov's patch approach (calling
wait_for_slot_catchup() directly in the test rather than introducing a wrapper function). No new technical arguments or alternative designs proposed.
- A minor correction to a copy-paste error in his previous message (wrong function name referenced).
No new patches, no new participants, no position shifts, no new technical insights beyond what was already covered in the prior analysis.
2026-05-22 · claude-opus-4-6
Resolution of the 019_replslot_limit.pl Race Condition
The thread reaches consensus on fixing the replication slot state race exposed by the faster WAIT FOR LSN-based wait_for_catchup().
Xuneng Zhou's Diagnosis and Fix Strategy
Xuneng confirms he was aware of the semantic change (WAIT FOR LSN returns when the standby reaches the target, whereas the old polling checked pg_stat_replication on the primary, which implicitly guaranteed feedback round-trip completion). He initially thought this weaker guarantee was harmless but now acknowledges it breaks 019_replslot_limit.pl.
His key architectural decision: fix the test, not wait_for_catchup(). His reasoning:
wait_for_catchup()'s natural semantics are standby-local (replay/write/flush reached target LSN).
- The old primary-side polling was a side effect that happened to also confirm feedback processing — not an intentional guarantee.
019_replslot_limit.pl has a specific need: it checks primary-side slot state, which depends on walsender processing standby feedback. This dependency should be made explicit in that test rather than burdening all callers with unnecessary round-trip waiting.
His patch introduces a wait_for_standby_and_slot_catchup() wrapper that: (1) waits for standby replay via WAIT FOR LSN, then (2) polls restart_lsn on the primary until it advances past the target.
Korotkov's Simplification
Korotkov agrees with the strategy but proposes eliminating the wrapper function. Instead of wait_for_standby_and_slot_catchup(), the test should call $node->wait_for_slot_catchup() directly after the existing wait_for_catchup(). This avoids introducing a new abstraction layer and keeps the slot-specific synchronization visually explicit at the call site. He provides an attached patch implementing this approach.
2026-05-20 · claude-opus-4-6
New Bug: wait_for_catchup() with WAIT FOR LSN Causes Race in Replication Slot State Tests
Alexander Korotkov reports a new class of test failure caused by the WAIT FOR LSN-based wait_for_catchup() implementation — this time in 019_replslot_limit.pl, a test that checks replication slot WAL availability states transition through extended → unreserved → lost.
Root Cause: WAIT FOR LSN Returns Too Early Relative to Slot Advancement
The fundamental issue is a semantic gap between WAL replay position and replication slot state updates. wait_for_catchup() using WAIT FOR LSN guarantees that the standby has replayed (or written/flushed) up to a given LSN, but the primary's restart_lsn for the replication slot is advanced asynchronously — it depends on the walreceiver sending a reply back to the walsender, and the walsender processing that reply to update the slot's restart_lsn.
The timeline of the race:
- Test generates WAL, calls
wait_for_catchup() which returns as soon as the standby replays to the target LSN.
- Test immediately queries
pg_replication_slots on the primary to check wal_status.
- But the walreceiver hasn't yet sent its reply (or the walsender hasn't processed it), so
slot->data.restart_lsn hasn't advanced yet.
GetWALAvailability() computes the slot state based on the stale restart_lsn, returning a state one step behind what the test expects (e.g., unreserved instead of extended, or lost instead of unreserved).
Reproduction
Korotkov reproduced deterministically by injecting pg_usleep() delays:
- 10ms in
XLogWalRcvSendReply() (delays the standby's feedback to the primary)
- 100ms in
ProcessStandbyReplyMessage() (delays the primary's processing of feedback)
With these delays, the test fails on iterations 2, 1, and 7 out of 100. Without WAIT FOR LSN (reverting to the old polling-based wait_for_catchup()), 100 iterations pass.
Why the Old Implementation Was Immune
The previous polling-based wait_for_catchup() likely had enough latency in its poll loop that by the time it confirmed catchup, the walreceiver reply had already propagated. WAIT FOR LSN is faster — it returns the instant replay reaches the target — which exposes this pre-existing race between replay progress and slot metadata propagation.
Diagnostic Evidence
Korotkov's logging shows the key difference: in failed runs, targetSeg (22) < oldestSlotSeg (23), meaning the slot's restart_lsn has already moved past the segment boundary, making the state unreserved. In successful runs, targetSeg == oldestSlotSeg (both 23), yielding the expected extended state.
Implications
This is a deeper problem than previous bugs — it reveals that wait_for_catchup() via WAIT FOR LSN provides a weaker guarantee than the test assumes. The test doesn't just need "standby has replayed to here"; it needs "primary's slot state reflects the standby's progress." This may require either:
- A supplementary wait/check on the primary side for slot advancement
- Test-specific adjustment to not rely on immediate slot state after catchup
- A new wait mode or mechanism that confirms the round-trip (standby replay → feedback → walsender slot update)