Fix pg_stat_wal_receiver to show CONNECTING status

First seen: 2026-05-19 05:55:14+00:00 · Messages: 3 · Participants: 2

Latest Update

2026-05-20 · claude-opus-4-6

Fix pg_stat_wal_receiver to Show CONNECTING Status

Core Problem

Commit a36164e74 introduced a WALRCV_CONNECTING state to the WAL receiver subsystem, explicitly motivated by monitoring use cases—particularly in high-latency environments where connection establishment takes significant time. However, the implementation contains a logical contradiction: the pg_stat_wal_receiver view is gated by WalRcv->ready_to_display, which is only set to true after the connection is established. This means the very state that was added for observability (connecting) is never actually observable through the monitoring view.

The bug is straightforward: when ready_to_display is false and the WAL receiver process exists (pid != 0), the SQL function pg_stat_get_wal_receiver() returns NULL, producing an empty result set. This defeats the entire purpose of the CONNECTING state.

Architectural Context

The WAL receiver (walreceiver.c) manages replication connections from standby to primary. Its shared memory structure (WalRcv) is protected by a spinlock and contains connection metadata (host, port, conninfo), streaming state (LSNs, timelines), and timing information (last message timestamps). The ready_to_display flag was introduced as a gate to prevent exposing partially-initialized or stale data in the monitoring view.

The tension here is between two valid concerns:

  1. Data integrity in the view: Don't show stale/misleading connection metadata from a previous connection attempt when the receiver is in an intermediate state.
  2. Observability: The CONNECTING state exists specifically to be monitored.

Proposed Solutions

v1 (Evan Li's approach): Relax the gate in the SQL function

The patch modifies pg_stat_get_wal_receiver() to only return NULL when pid == 0 (no WAL receiver process exists). When ready_to_display is false but a process exists, it returns a tuple with only PID and status populated, leaving all other columns NULL.

-	if (pid == 0 || !ready_to_display)
+	if (pid == 0)
 		PG_RETURN_NULL();

Tradeoff: Simple change, minimal code surface. All connection-related fields are NULL during CONNECTING, which accurately represents the state—no connection exists yet, so no connection metadata is meaningful.

v2 (Michael Paquier's approach): Split WAL receiver initialization into two spinlock acquisitions

The approach restructures the WAL receiver startup sequence:

  1. Before walrcv_connect(): Acquire spinlock, reset all connection-related fields, set ready_to_display = true, release spinlock. This makes the CONNECTING state visible immediately.
  2. After walrcv_connect(): Acquire spinlock again, fill in connection-related fields (host, port, conninfo).

Tradeoff: Keeps ready_to_display as the single authoritative gate in the SQL function (preserving its role as an early-exit condition), but requires an extra spinlock acquisition. The latency of walrcv_connect() dwarfs the spinlock cost, so performance is not a concern.

Critical Flaw in v2

Evan Li identified that v2 has a data correctness issue: because ready_to_display is set to true before field cleanup is complete, stale timestamp fields from the previous WAL receiver process lifecycle leak through. Specifically:

These get populated with the standby server start time rather than actual message timestamps, because the WalRcv shared memory retains values from initialization or a prior session. Each standby restart shows these timestamps updating to the new start time, even though no WAL messages have been exchanged with a (fake) primary.

This is a subtle but important correctness issue: showing timestamps implies communication occurred when none did. It could mislead monitoring tools and DBAs into believing the receiver had recent contact with a primary.

Design Decision: Semantic Meaning of ready_to_display

The fundamental disagreement is about what ready_to_display gates:

Evan's argument is architecturally cleaner in one sense: PID and status are properties of the WAL receiver process, not properties of the connection. The ready_to_display flag semantically means "connection data is ready to display." This reframing suggests v1 is the more principled fix, with a documentation/comment improvement to clarify the flag's scope.

Implications for Release

Michael's initial response acknowledges this "stands for improvement before the release," indicating this is being treated as a bug fix for the current release cycle (likely PostgreSQL 18). The feature was committed with the explicit goal of monitoring visibility, so shipping it in a state where the key state is invisible would be a regression in the feature's stated purpose.

Outstanding Questions

  1. Should v2 be enhanced to also reset the timestamp fields before setting ready_to_display = true? This would address the stale data issue while preserving Michael's preferred architecture.
  2. Is there a race condition concern with v1 where a reader could see status = connecting with stale data from other fields if ready_to_display is not checked?
  3. The thread appears to be converging toward v1's approach (or a hybrid) but no final consensus or commit has occurred.