Refined Root Cause: Authentication-Phase Shutdown, Not ready_to_display Race
The prior analysis hypothesized the race was about ready_to_display being set before the walreceiver entered its main loop. The new messages refine this to a more specific failure mode: the walsender is killed during authentication, before it ever begins replication.
Kuroda's Hypothesis (Confirmed by Korotkov)
Kuroda identified that the publisher log shows:
FATAL: canceling authentication due to timeout
This means the walsender process was still in PerformAuthentication() when the shutdown signal arrived. The test's assumption — that by the time the walreceiver PID is visible, the walsender has completed authentication and entered replication streaming — is violated under load or with any delay in the authentication path.
Korotkov confirmed this by inserting a 500ms pg_usleep() inside ClientAuthentication() (in auth.c) specifically guarded by am_walsender. This reliably reproduces the test failure with the exact same log signature: the walsender gets terminated during authentication, the connection never reaches the streaming state, and the expected "stalled replication" condition is never established.
Korotkov's Correction
Korotkov acknowledged he had initially pointed to the wrong log file in his earlier investigation. The subscriber-side synchronized_standby_slots error was a red herring. The actual failure is entirely publisher-side: the walsender is shut down before completing authentication.
Silitskiy's Proposed Fix
Silitskiy proposes a test-level fix: add a check that ensures the walreceiver on the standby is fully initialized and replication has actually started before the test proceeds to shut down the publisher. This is a synchronization fix in the test harness, not a code change to the feature itself. Silitskiy reports the fix eliminates the reproducer.
This confirms the prior analysis's conclusion that the bug is in test infrastructure, not in wal_sender_shutdown_timeout logic itself.