Exit walsender before confirming remote flush in logical replication

First seen: 2022-12-22 05:46:11+00:00 · Messages: 126 · Participants: 20

Latest Update

2026-06-04 · claude-opus-4-6

Refined Root Cause: Authentication-Phase Shutdown, Not `ready_to_display` Race

The prior analysis hypothesized the race was about ready_to_display being set before the walreceiver entered its main loop. The new messages refine this to a more specific failure mode: the walsender is killed during authentication, before it ever begins replication.

Kuroda's Hypothesis (Confirmed by Korotkov)

Kuroda identified that the publisher log shows:

FATAL: canceling authentication due to timeout

This means the walsender process was still in PerformAuthentication() when the shutdown signal arrived. The test's assumption — that by the time the walreceiver PID is visible, the walsender has completed authentication and entered replication streaming — is violated under load or with any delay in the authentication path.

Korotkov confirmed this by inserting a 500ms pg_usleep() inside ClientAuthentication() (in auth.c) specifically guarded by am_walsender. This reliably reproduces the test failure with the exact same log signature: the walsender gets terminated during authentication, the connection never reaches the streaming state, and the expected "stalled replication" condition is never established.

Korotkov's Correction

Korotkov acknowledged he had initially pointed to the wrong log file in his earlier investigation. The subscriber-side synchronized_standby_slots error was a red herring. The actual failure is entirely publisher-side: the walsender is shut down before completing authentication.

Silitskiy's Proposed Fix

Silitskiy proposes a test-level fix: add a check that ensures the walreceiver on the standby is fully initialized and replication has actually started before the test proceeds to shut down the publisher. This is a synchronization fix in the test harness, not a code change to the feature itself. Silitskiy reports the fix eliminates the reproducer.

This confirms the prior analysis's conclusion that the bug is in test infrastructure, not in wal_sender_shutdown_timeout logic itself.

History (1 prior analysis)

2026-06-01 · claude-opus-4-6

New Buildfarm Failure: Test Race Condition in 038_walsnd_shutdown_timeout

Alexander Korotkov reports a new buildfarm failure on tamandua (FreeBSD) in the test 038_walsnd_shutdown_timeout.pl, distinct from the previous post-commit bugs (sleep-time miscalculation and blocking pq_flush).

Root Cause: `ready_to_display` Race

The test checks that walsender exits due to wal_sender_shutdown_timeout when replication is stalled. However, it relies on the walreceiver PID being available (test step 3 passes: "have walreceiver pid"), then immediately stops the publisher expecting the timeout to fire.

The problem: walrcv->ready_to_display is set to true before WalReceiverMain() actually establishes the streaming connection and enters its main loop. This means the test can observe the walreceiver PID and proceed to shut down the publisher before the walreceiver has connected. The walsender never gets into a stalled-replication state because the connection was never fully established.

Korotkov confirmed this with a minimal reproducer — inserting a 500ms pg_usleep() between ready_to_display = true and the actual connection establishment reliably triggers the failure.

Log Evidence

The subscriber log shows the apply worker received a synchronized_standby_slots error and exited, then the restarted apply worker couldn't connect because the publisher was already shutting down. The expected stall condition (walsender blocked waiting for flush confirmation) was never reached.

Status

Korotkov asks Fujii Masao to investigate and fix the test. This is a test-infrastructure bug (race in what ready_to_display actually signals) rather than a logic bug in the wal_sender_shutdown_timeout feature itself. A proper fix would likely either:

Wait for the walreceiver to actually be streaming (not just have a PID), or
Use a more robust synchronization point in the test before triggering shutdown.