2026-05-22 · claude-opus-4-6
Incremental Update: Resolution on v2 Approach + New Bug Discovery
Convergence on v2
The debate about whether to show timestamp fields during CONNECTING state has resolved. Michael Paquier defended showing the initialized timestamp values (standby start time) during CONNECTING, arguing:
- The timestamps represent the actual initialized state of the WAL receiver code, which can be useful for knowing when a connection attempt began.
- These same initial values would also be visible briefly after
walrcv_connect() completes (before actual messages arrive), so exposing them slightly earlier during CONNECTING is not fundamentally different.
- The original purpose of
ready_to_display was specifically to prevent inconsistent connection information (conninfo, host, port) across multiple calls—not to gate timestamp fields.
Evan Li accepted this reasoning after Michael pointed to the original discussion that motivated ready_to_display, and proposed only a documentation improvement to clarify the timestamp semantics.
New Bug: conninfo Leak in WAL Receiver Reuse Path
Evan Li discovered a separate, pre-existing bug that affects stable branches (not just HEAD). The issue is in the WAL receiver reuse path:
WalRcvWaitForStartPosition() sets state to WALRCV_WAITING
RequestXLogStreaming() copies raw conninfo into walrcv->conninfo and sets state to WALRCV_RESTARTING
WalRcvWaitForStartPosition() transitions to WALRCV_CONNECTING, but does not clear walrcv->conninfo
This means that when v2 sets ready_to_display = true during CONNECTING (after clearing conninfo), the reuse path can still leak stale conninfo because the clearing happens before RequestXLogStreaming() re-populates it.
Proposed Fix for conninfo Leak
Evan's solution: only copy conninfo into shared memory when RequestXLogStreaming() is switching to WALRCV_STARTING (initial launch). In the WALRCV_WAITING → WALRCV_RESTARTING reuse path, the WAL receiver already has an active connection (wrconn), so it doesn't need raw conninfo copied again.
Michael confirmed this is a legitimate bug, agreed with the fix approach, and noted it should be backpatched to all stable branches as it's independent of the CONNECTING visibility feature. He also cautioned against clobbering conninfo in the reuse path since that would lose the user-displayable string returned by walrcv_get_conninfo().
Michael asked Evan to send a formal patch for this backpatchable fix.
2026-05-20 · claude-opus-4-6
Fix pg_stat_wal_receiver to Show CONNECTING Status
Core Problem
Commit a36164e74 introduced a WALRCV_CONNECTING state to the WAL receiver subsystem, explicitly motivated by monitoring use cases—particularly in high-latency environments where connection establishment takes significant time. However, the implementation contains a logical contradiction: the pg_stat_wal_receiver view is gated by WalRcv->ready_to_display, which is only set to true after the connection is established. This means the very state that was added for observability (connecting) is never actually observable through the monitoring view.
The bug is straightforward: when ready_to_display is false and the WAL receiver process exists (pid != 0), the SQL function pg_stat_get_wal_receiver() returns NULL, producing an empty result set. This defeats the entire purpose of the CONNECTING state.
Architectural Context
The WAL receiver (walreceiver.c) manages replication connections from standby to primary. Its shared memory structure (WalRcv) is protected by a spinlock and contains connection metadata (host, port, conninfo), streaming state (LSNs, timelines), and timing information (last message timestamps). The ready_to_display flag was introduced as a gate to prevent exposing partially-initialized or stale data in the monitoring view.
The tension here is between two valid concerns:
- Data integrity in the view: Don't show stale/misleading connection metadata from a previous connection attempt when the receiver is in an intermediate state.
- Observability: The CONNECTING state exists specifically to be monitored.
Proposed Solutions
v1 (Evan Li's approach): Relax the gate in the SQL function
The patch modifies pg_stat_get_wal_receiver() to only return NULL when pid == 0 (no WAL receiver process exists). When ready_to_display is false but a process exists, it returns a tuple with only PID and status populated, leaving all other columns NULL.
- if (pid == 0 || !ready_to_display)
+ if (pid == 0)
PG_RETURN_NULL();
Tradeoff: Simple change, minimal code surface. All connection-related fields are NULL during CONNECTING, which accurately represents the state—no connection exists yet, so no connection metadata is meaningful.
v2 (Michael Paquier's approach): Split WAL receiver initialization into two spinlock acquisitions
The approach restructures the WAL receiver startup sequence:
- Before
walrcv_connect(): Acquire spinlock, reset all connection-related fields, set ready_to_display = true, release spinlock. This makes the CONNECTING state visible immediately.
- After
walrcv_connect(): Acquire spinlock again, fill in connection-related fields (host, port, conninfo).
Tradeoff: Keeps ready_to_display as the single authoritative gate in the SQL function (preserving its role as an early-exit condition), but requires an extra spinlock acquisition. The latency of walrcv_connect() dwarfs the spinlock cost, so performance is not a concern.
Critical Flaw in v2
Evan Li identified that v2 has a data correctness issue: because ready_to_display is set to true before field cleanup is complete, stale timestamp fields from the previous WAL receiver process lifecycle leak through. Specifically:
last_msg_send_time
last_msg_receipt_time
latest_end_time
These get populated with the standby server start time rather than actual message timestamps, because the WalRcv shared memory retains values from initialization or a prior session. Each standby restart shows these timestamps updating to the new start time, even though no WAL messages have been exchanged with a (fake) primary.
This is a subtle but important correctness issue: showing timestamps implies communication occurred when none did. It could mislead monitoring tools and DBAs into believing the receiver had recent contact with a primary.
Design Decision: Semantic Meaning of ready_to_display
The fundamental disagreement is about what ready_to_display gates:
- Michael's interpretation: It gates the entire tuple. The SQL function should use it as a single early-exit check. The fix should ensure the shared memory state is correct when the flag is set.
- Evan's interpretation: It gates connection-related metadata specifically. PID and status are process-level information that should always be visible when the process exists, independent of whether connection data is ready.
Evan's argument is architecturally cleaner in one sense: PID and status are properties of the WAL receiver process, not properties of the connection. The ready_to_display flag semantically means "connection data is ready to display." This reframing suggests v1 is the more principled fix, with a documentation/comment improvement to clarify the flag's scope.
Implications for Release
Michael's initial response acknowledges this "stands for improvement before the release," indicating this is being treated as a bug fix for the current release cycle (likely PostgreSQL 18). The feature was committed with the explicit goal of monitoring visibility, so shipping it in a state where the key state is invisible would be a regression in the feature's stated purpose.
Outstanding Questions
- Should v2 be enhanced to also reset the timestamp fields before setting
ready_to_display = true? This would address the stale data issue while preserving Michael's preferred architecture.
- Is there a race condition concern with v1 where a reader could see
status = connecting with stale data from other fields if ready_to_display is not checked?
- The thread appears to be converging toward v1's approach (or a hybrid) but no final consensus or commit has occurred.