Deep Technical Analysis: FATAL Message Lost Inside libpq During Standby Logical Decoding
Core Problem
The test 035_standby_logical_decoding intermittently fails on the buildfarm because a FATAL error message sent by the server (specifically, "terminating connection due to conflict with recovery" with SQLSTATE 57P04) is received by libpq at the socket level but is not properly surfaced to the client application (pg_recvlogical). Instead, the client reports a generic "server closed the connection unexpectedly" error, which doesn't match the test's expected regex pattern (?^:conflict with recovery).
Architectural Context
The Protocol Sequence
In a successful scenario, the protocol exchange during logical replication on a standby proceeds as:
- Connection establishment — Authentication, parameter status messages, ReadyForQuery
- IDENTIFY_SYSTEM — Replication protocol handshake
- START_REPLICATION — Initiates logical decoding
- CopyBothResponse (
Wmessage) — Server enters streaming mode - CopyData (
dmessages) — Keepalive/data streaming - ErrorResponse (
Emessage with FATAL) — Server terminates due to recovery conflict
When the FATAL arrives after CopyBothResponse, libpq is in "copy both" state and properly processes the error message, surfacing it to pg_recvlogical which then outputs the expected error text.
The Race Condition
The problem occurs when the recovery conflict (database drop replayed from primary) kills the walsender connection before it has entered the CopyBoth streaming phase — specifically, during or just after ReplicationSlotAcquire() in StartLogicalReplication() but before CopyBothResponse is sent.
In this timing window:
- The server has already sent initial protocol messages (ParameterStatus, IDENTIFY_SYSTEM response, ReadyForQuery)
- The server sends the FATAL ErrorResponse directly (not inside a COPY stream)
- libpq receives the FATAL in the same
pqsecure_raw_read()buffer or in a subsequent read - However, because libpq is in a state where it's waiting for a command response to
START_REPLICATION, the FATAL message gets lost or misattributed
The hex dump evidence confirms this: in the failing case, the 4th read contains the FATAL directly after the IDENTIFY_SYSTEM response, with no intervening W (CopyBothResponse, 0x57) message. The subsequent reads return n=0 (EOF), and libpq reports "server closed the connection unexpectedly" — discarding the already-received FATAL.
Why libpq Loses the Message
The critical issue is in how libpq handles asynchronous error messages during command processing. When PQgetResult() is processing the response to the replication command:
- libpq reads data from the socket into its input buffer
- It parses messages sequentially
- If the connection is closed (n=0 read) before the result is fully assembled, libpq may prioritize the "connection lost" error over a FATAL that was already buffered but not yet processed as the command's result
This is a libpq state machine bug — the FATAL ErrorResponse is valid protocol and should be treated as the command's result, but the subsequent EOF triggers connection-lost handling that overwrites or ignores the buffered error.
Evidence and Reproduction
The author demonstrates the race with a simple pg_usleep(1000000) inserted after ReplicationSlotAcquire() in walsender.c. This widens the timing window so that the recovery conflict signal always arrives before the walsender progresses to send CopyBothResponse, making the failure deterministic.
The hex dump analysis is definitive:
45000000b4= ErrorResponse message, 180 bytes payload- Contains:
SFATAL,VFATAL,C57P04,Mterminating connection due to conflict with recovery - This data was read from the socket (confirmed by the
pqsecure_raw_readtrace) - Yet it does not appear in
pg_recvlogical's error output
Implications
- Test reliability: This affects REL_16_STABLE through master, meaning the test has been flaky since its introduction
- libpq correctness: This is not merely a test issue — any libpq client issuing
START_REPLICATIONthat receives a FATAL before entering copy mode will lose the error message. This affects monitoring, error reporting, and automated failover tools that depend on parsing error messages. - Broader pattern: Any command where the server sends FATAL during processing (after accepting the command but before sending the expected response type) could potentially trigger this libpq behavior.
Potential Solutions
-
Fix in libpq: Ensure that a buffered ErrorResponse is always preserved as the command result, even if a subsequent read returns EOF. The EOF should be treated as confirmation of the FATAL, not as a superseding error.
-
Fix in walsender: Send CopyBothResponse earlier (before slot acquisition and conflict checks), so that the FATAL always arrives within the copy stream where libpq handles it correctly. However, this changes protocol semantics and may have other implications.
-
Fix in the test: Make the regex also accept "server closed the connection unexpectedly" as a valid outcome. This is a workaround, not a fix, and masks the underlying libpq issue.
-
Fix in pg_recvlogical: Add retry logic or secondary error checking that queries the slot state after a connection failure, to determine if invalidation occurred regardless of the error message received.