Exit walsender before confirming remote flush in logical replication
The Core Problem
PostgreSQL's walsender has historically refused to shut down until it both (a) drained all pending WAL from its output side and (b) received confirmation that the remote peer flushed all previously sent WAL (sentPtr == replicatedPtr && !pq_is_send_pending()). This behavior was installed by commit 985bd7d (2012) to enable clean switchover in physical replication: if the old primary has fully drained to the standby before stopping, promoting the standby and re-attaching the old primary as a new standby is guaranteed to succeed.
This architectural assumption makes sense for physical replication, but is actively harmful for logical replication for several reasons:
- No equivalent switchover semantics. Logical replication does not support role reversal publisher↔subscriber; a subscriber cannot request WAL starting from an arbitrary LSN, so the "drain-before-shutdown" invariant buys nothing operationally.
- No independent walreceiver. In physical replication, a dedicated walreceiver process flushes WAL as fast as the network allows, so send-buffer fullness is rare. In logical replication, the apply worker both receives and applies changes in the same process. If the apply worker blocks on a heavyweight lock held by a local backend, TCP backpressure propagates into the walsender,
pq_is_send_pending()stays true, and the publisher cannot shut down. - Time-delayed logical replication amplifies this. The companion thread (CF 3581) proposes
min_apply_delay. With a large delay, the apply worker is intentionally asleep holding pending WAL, so the publisher becomes effectively un-shuttable-down.
Kuroda demonstrated the hang with a minimal reproducer: concurrent INSERT ... generate_series(1,5000) on both nodes causes pg_ctl stop to never complete. Two distinct stall traces were identified:
- Type (i): walsender reaches
WalSndDone()butsentPtr != replicatedPtr— all data sent, just not confirmed flushed. - Type (ii): walsender stuck earlier in
ProcessPendingWrites()because the send buffer backed up while still decoding a large transaction; it never reachesWalSndDone()at all.
Evolution of the Design
The design went through four distinct phases driven by committer pushback:
Phase 1: Unconditional relaxation (rejected)
Kuroda's initial patch simply dropped condition (b) for logical walsenders. Horiguchi worried this altered long-standing behavior users might rely on. Amit Kapila asked whether condition (a) (pending send) also needed to be dropped — Kuroda showed it did, because of the Type (ii) stall.
Phase 2: Shutdown-mode awareness (rejected by Andres)
Kuroda proposed teaching walsenders to distinguish smart vs fast shutdown, exiting immediately only in fast mode. Andres Freund's intervention was decisive: "Smart shutdown is practically unusable. I don't think it makes sense to tie behavior of walsender to it in any way." He also raised the deeper objection that unconditional relaxation would make it impossible to reliably decommission a logical primary — autovacuum, bgwriter, and checkpointer keep generating WAL, so without the drain-guarantee, operators can never know they've captured the tail.
Phase 3: START_REPLICATION option (partially accepted, then abandoned)
Andres proposed a per-connection option in the START_REPLICATION command: the subscriber (which knows whether it is time-delayed) opts into the new behavior. This keeps the default unchanged and matches the "subscriber knows its own semantics" model. Kuroda produced multiple patch versions extending the replication grammar with SHUTDOWN_MODE { 'wait_flush' | 'immediate' }. The thread went dormant in Feb 2023.
Phase 4 (2025–2026): Publisher-side GUC with timeout
Silitskiy revived the thread with a fundamentally different architectural choice: put the decision on the publisher, via a GUC, not the subscriber. His reasoning, echoing synchronous_commit/synchronous_standby_names: the primary decides who it waits for. Additionally, a subscriber-side option is operationally dangerous because a single mis-configured subscriber can block the whole publisher.
Fujii Masao refined this with the insight that PGC_USERSET allows per-role/per-connection override via primary_conninfo or the options=-c ... mechanism in CREATE SUBSCRIPTION CONNECTION, giving per-connection granularity without protocol changes.
Ronan Dunklau then made the pivotal suggestion that a boolean mode is too coarse — operators want a bounded wait, not "infinite or zero." This led to the final design: wal_sender_shutdown_timeout (integer, milliseconds), default -1 (legacy behavior), 0 for immediate, positive for bounded wait. This is the form Alexander Korotkov committed.
The Commit and Its Aftermath
Fujii Masao committed the feature (a8f45dee917) just before feature freeze. Two follow-up bugs surfaced almost immediately:
Bug 1: Sleep-time miscalculation
Fujii found the timeout was ignored if wal_sender_timeout was large: WalSndComputeSleeptime() was called before shutdown_request_timestamp was set, so the walsender would sleep for up to wal_sender_timeout/2. Vitaly Davydov diagnosed it correctly. Fix: ensure WalSndCheckShutdownTimeout() runs first to stamp shutdown_request_timestamp, and add a missing call inside WalSndWaitForWal() (found by Korotkov).
Bug 2: Blocking pq_flush() in WalSndDone()
CI failures on FreeBSD revealed that WalSndDone() calls pq_flush() (blocking) even after timeout logic is in place. On FreeBSD, Unix-domain sockets write directly into the peer's receive buffer (per the man page), so the "send buffer can't be full here" assumption breaks. Evan Li pinpointed this. Fujii's fix introduced EndCommandExtended() (non-blocking CommandComplete queue via pq_putmessage_noblock) and replaced pq_flush() with a local loop that calls pq_flush_if_writable() while continuing to honor wal_sender_shutdown_timeout and wal_sender_timeout. Evan initially warned that using the generic ProcessPendingWrites() would mix reply processing in at the wrong time; Fujii inlined a narrower loop in response.
Key Technical Insights
-
The 985bd7d invariant is physical-only. The thread establishes that clean-switchover drain semantics are meaningless in logical replication (no LSN-based resume from subscriber) and are the root cause of operational pain. Dilip Bapat and Horiguchi confirm the c6c3334 two-phase walsender shutdown (2017) is orthogonal — it prevents PANIC from WAL-generating walsenders during the shutdown checkpoint, not switchover correctness.
-
Logical replication conflates transport and apply. Unlike physical replication's separate walreceiver, the logical apply worker applies synchronously. Any apply-side lock wait becomes publisher-side TCP backpressure. Amit Kapila correctly noted this is architectural: "Maybe we have assumed that the decoded WALs are consumed in as short time."
-
Publisher-side control is better than subscriber-side. Silitskiy's argument — that a single mis-configured subscriber can block the whole publisher — flipped the original Andres design. Per-role/per-conninfo GUC overrides recover per-connection granularity without replication-protocol changes, avoiding version-compatibility footguns Paquier warned about.
-
Timeout > boolean. Dunklau's suggestion reframes the problem: operators want bounded SLA, not a binary choice. This also neatly handles the pg_upgrade case where one wants to give replication a chance to drain before aborting.
-
sentPtrsemantics are subtle. Kuroda's early attempt (craft a feedback withbegin_data.final_lsn) failed becausesentPtris "the next WAL location to send," not what has been sent, andCOMMIT— notBEGIN— carries the end-LSN. This dead-end surfaced important invariants. -
Blocking primitives in the shutdown path are a systemic hazard. The FreeBSD post-commit bug exposed that
EndCommand()uses blockingpq_putmessage, andpq_flush()blocks on Unix-domain sockets even when the TCP-send-buffer heuristic suggests safety. This motivated theEndCommandExtended()API addition.
Participant Weight
- Andres Freund (committer, replication expert): shaped the design twice — killed the smart-shutdown-coupling idea, and pushed for a per-connection opt-in rather than global behavior change. His post-commit CI failure report triggered the final hardening.
- Amit Kapila (committer, logical replication maintainer): drove the original design review, summarized options, enforced rigor on "why do we need drain at all" questions.
- Fujii Masao (committer): ultimately committed the feature, refined the GUC semantics (PGC_USERSET, per-conninfo), and authored the post-commit blocking-I/O fixes.
- Alexander Korotkov (committer): shepherded the 2026 revival, committed the timeout-based design.
- Michael Paquier (committer): raised version-compatibility concerns about protocol-level changes, which the GUC-based design sidesteps.
- Kyotaro Horiguchi: domain-expert review, especially around the historical intent of 985bd7d and interaction with synchronous_commit.
- Hayato Kuroda: original author; carried the patch through phases 1–3.
- Andrey Silitskiy / Vitaly Davydov (Postgres Pro): revived and redesigned the feature into its final publisher-GUC form.
- Evan Li: diagnosed the post-commit FreeBSD/Unix-socket blocking issue.
- Ronan Dunklau: proposed the decisive timeout-vs-boolean reframing.
- Greg Sabino Mullane: operational motivation (Patroni failover SLA).
Implementation Summary (as committed)
- New GUC
wal_sender_shutdown_timeout(PGC_USERSET, integer ms, default -1). - Applies to both physical and logical walsenders; can be overridden per connection via
primary_conninfo/CONNECTION 'options=-c ...'or per role viaALTER ROLE. WalSndCheckShutdownTimeout()stampsshutdown_request_timestampon first post-shutdown entry; walsender sleep-time is bounded by the remaining timeout.WalSndDoneImmediate()path skips thesentPtr == replicatedPtr && !pq_is_send_pending()check and exits after a best-effortpq_flush_if_writable().- Post-commit fix:
EndCommandExtended()allows non-blocking CommandComplete queueing;WalSndDone()now has a local drain loop that respects bothwal_sender_shutdown_timeoutandwal_sender_timeoutinstead of calling blockingpq_flush().