Fix pg_stat_wal_receiver to show CONNECTING status

First seen: 2026-05-19 05:55:14+00:00 · Messages: 15 · Participants: 3

Latest Update

2026-05-25 · claude-opus-4-6

Incremental Update: Patches Committed + Post-Commit Review

Resolution: Both Patches Applied

Michael Paquier committed the v3 patch set on 2026-05-22:

This confirms the final architectural choice: ready_to_display remains the single authoritative gate in the SQL function, and the shared memory state is properly initialized before the flag is set. The v1 approach (relaxing the gate to show partial tuples) was not adopted.

Post-Commit Review by Xuneng Zhou

Xuneng Zhou provided a post-commit review agreeing with the design decisions:

  1. Agrees with Michael that showing partial columns (only PID + status) gated by ready_to_display would have been problematic — it implies something broken and creates documentation burden.
  2. Acknowledges the timestamp semantics tradeoff: The timeline/LSN and timestamp values visible during CONNECTING reveal when the WAL receiver entered that code path, which has operational utility. However, these values weren't designed for that interpretation, so there's "some risk for misinterpretation" — but this pre-dates these changes and isn't a new problem.

No actionable concerns were raised; this appears to be a confirmatory review with no further changes expected.

History (2 prior analyses)
2026-05-22 · claude-opus-4-6

Incremental Update: Resolution on v2 Approach + New Bug Discovery

Convergence on v2

The debate about whether to show timestamp fields during CONNECTING state has resolved. Michael Paquier defended showing the initialized timestamp values (standby start time) during CONNECTING, arguing:

  1. The timestamps represent the actual initialized state of the WAL receiver code, which can be useful for knowing when a connection attempt began.
  2. These same initial values would also be visible briefly after walrcv_connect() completes (before actual messages arrive), so exposing them slightly earlier during CONNECTING is not fundamentally different.
  3. The original purpose of ready_to_display was specifically to prevent inconsistent connection information (conninfo, host, port) across multiple calls—not to gate timestamp fields.

Evan Li accepted this reasoning after Michael pointed to the original discussion that motivated ready_to_display, and proposed only a documentation improvement to clarify the timestamp semantics.

New Bug: conninfo Leak in WAL Receiver Reuse Path

Evan Li discovered a separate, pre-existing bug that affects stable branches (not just HEAD). The issue is in the WAL receiver reuse path:

  1. WalRcvWaitForStartPosition() sets state to WALRCV_WAITING
  2. RequestXLogStreaming() copies raw conninfo into walrcv->conninfo and sets state to WALRCV_RESTARTING
  3. WalRcvWaitForStartPosition() transitions to WALRCV_CONNECTING, but does not clear walrcv->conninfo

This means that when v2 sets ready_to_display = true during CONNECTING (after clearing conninfo), the reuse path can still leak stale conninfo because the clearing happens before RequestXLogStreaming() re-populates it.

Proposed Fix for conninfo Leak

Evan's solution: only copy conninfo into shared memory when RequestXLogStreaming() is switching to WALRCV_STARTING (initial launch). In the WALRCV_WAITINGWALRCV_RESTARTING reuse path, the WAL receiver already has an active connection (wrconn), so it doesn't need raw conninfo copied again.

Michael confirmed this is a legitimate bug, agreed with the fix approach, and noted it should be backpatched to all stable branches as it's independent of the CONNECTING visibility feature. He also cautioned against clobbering conninfo in the reuse path since that would lose the user-displayable string returned by walrcv_get_conninfo().

Michael asked Evan to send a formal patch for this backpatchable fix.


2026-05-20 · claude-opus-4-6

Fix pg_stat_wal_receiver to Show CONNECTING Status

Core Problem

Commit a36164e74 introduced a WALRCV_CONNECTING state to the WAL receiver subsystem, explicitly motivated by monitoring use cases—particularly in high-latency environments where connection establishment takes significant time. However, the implementation contains a logical contradiction: the pg_stat_wal_receiver view is gated by WalRcv->ready_to_display, which is only set to true after the connection is established. This means the very state that was added for observability (connecting) is never actually observable through the monitoring view.

The bug is straightforward: when ready_to_display is false and the WAL receiver process exists (pid != 0), the SQL function pg_stat_get_wal_receiver() returns NULL, producing an empty result set. This defeats the entire purpose of the CONNECTING state.

Architectural Context

The WAL receiver (walreceiver.c) manages replication connections from standby to primary. Its shared memory structure (WalRcv) is protected by a spinlock and contains connection metadata (host, port, conninfo), streaming state (LSNs, timelines), and timing information (last message timestamps). The ready_to_display flag was introduced as a gate to prevent exposing partially-initialized or stale data in the monitoring view.

The tension here is between two valid concerns:

  1. Data integrity in the view: Don't show stale/misleading connection metadata from a previous connection attempt when the receiver is in an intermediate state.
  2. Observability: The CONNECTING state exists specifically to be monitored.

Proposed Solutions

v1 (Evan Li's approach): Relax the gate in the SQL function

The patch modifies pg_stat_get_wal_receiver() to only return NULL when pid == 0 (no WAL receiver process exists). When ready_to_display is false but a process exists, it returns a tuple with only PID and status populated, leaving all other columns NULL.

-	if (pid == 0 || !ready_to_display)
+	if (pid == 0)
 		PG_RETURN_NULL();

Tradeoff: Simple change, minimal code surface. All connection-related fields are NULL during CONNECTING, which accurately represents the state—no connection exists yet, so no connection metadata is meaningful.

v2 (Michael Paquier's approach): Split WAL receiver initialization into two spinlock acquisitions

The approach restructures the WAL receiver startup sequence:

  1. Before walrcv_connect(): Acquire spinlock, reset all connection-related fields, set ready_to_display = true, release spinlock. This makes the CONNECTING state visible immediately.
  2. After walrcv_connect(): Acquire spinlock again, fill in connection-related fields (host, port, conninfo).

Tradeoff: Keeps ready_to_display as the single authoritative gate in the SQL function (preserving its role as an early-exit condition), but requires an extra spinlock acquisition. The latency of walrcv_connect() dwarfs the spinlock cost, so performance is not a concern.

Critical Flaw in v2

Evan Li identified that v2 has a data correctness issue: because ready_to_display is set to true before field cleanup is complete, stale timestamp fields from the previous WAL receiver process lifecycle leak through. Specifically:

  • last_msg_send_time
  • last_msg_receipt_time
  • latest_end_time

These get populated with the standby server start time rather than actual message timestamps, because the WalRcv shared memory retains values from initialization or a prior session. Each standby restart shows these timestamps updating to the new start time, even though no WAL messages have been exchanged with a (fake) primary.

This is a subtle but important correctness issue: showing timestamps implies communication occurred when none did. It could mislead monitoring tools and DBAs into believing the receiver had recent contact with a primary.

Design Decision: Semantic Meaning of ready_to_display

The fundamental disagreement is about what ready_to_display gates:

  • Michael's interpretation: It gates the entire tuple. The SQL function should use it as a single early-exit check. The fix should ensure the shared memory state is correct when the flag is set.
  • Evan's interpretation: It gates connection-related metadata specifically. PID and status are process-level information that should always be visible when the process exists, independent of whether connection data is ready.

Evan's argument is architecturally cleaner in one sense: PID and status are properties of the WAL receiver process, not properties of the connection. The ready_to_display flag semantically means "connection data is ready to display." This reframing suggests v1 is the more principled fix, with a documentation/comment improvement to clarify the flag's scope.

Implications for Release

Michael's initial response acknowledges this "stands for improvement before the release," indicating this is being treated as a bug fix for the current release cycle (likely PostgreSQL 18). The feature was committed with the explicit goal of monitoring visibility, so shipping it in a state where the key state is invisible would be a regression in the feature's stated purpose.

Outstanding Questions

  1. Should v2 be enhanced to also reset the timestamp fields before setting ready_to_display = true? This would address the stale data issue while preserving Michael's preferred architecture.
  2. Is there a race condition concern with v1 where a reader could see status = connecting with stale data from other fields if ready_to_display is not checked?
  3. The thread appears to be converging toward v1's approach (or a hybrid) but no final consensus or commit has occurred.