035_standby_logical_decoding might fail due to FATAL message lost inside libpq

First seen: 2026-05-17 15:00:00+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-18 · claude-opus-4-6

Deep Technical Analysis: FATAL Message Lost Inside libpq During Standby Logical Decoding

Core Problem

The test 035_standby_logical_decoding intermittently fails on the buildfarm because a FATAL error message sent by the server (specifically, "terminating connection due to conflict with recovery" with SQLSTATE 57P04) is received by libpq at the socket level but is not properly surfaced to the client application (pg_recvlogical). Instead, the client reports a generic "server closed the connection unexpectedly" error, which doesn't match the test's expected regex pattern (?^:conflict with recovery).

Architectural Context

The Protocol Sequence

In a successful scenario, the protocol exchange during logical replication on a standby proceeds as:

  1. Connection establishment — Authentication, parameter status messages, ReadyForQuery
  2. IDENTIFY_SYSTEM — Replication protocol handshake
  3. START_REPLICATION — Initiates logical decoding
  4. CopyBothResponse (W message) — Server enters streaming mode
  5. CopyData (d messages) — Keepalive/data streaming
  6. ErrorResponse (E message with FATAL) — Server terminates due to recovery conflict

When the FATAL arrives after CopyBothResponse, libpq is in "copy both" state and properly processes the error message, surfacing it to pg_recvlogical which then outputs the expected error text.

The Race Condition

The problem occurs when the recovery conflict (database drop replayed from primary) kills the walsender connection before it has entered the CopyBoth streaming phase — specifically, during or just after ReplicationSlotAcquire() in StartLogicalReplication() but before CopyBothResponse is sent.

In this timing window:

The hex dump evidence confirms this: in the failing case, the 4th read contains the FATAL directly after the IDENTIFY_SYSTEM response, with no intervening W (CopyBothResponse, 0x57) message. The subsequent reads return n=0 (EOF), and libpq reports "server closed the connection unexpectedly" — discarding the already-received FATAL.

Why libpq Loses the Message

The critical issue is in how libpq handles asynchronous error messages during command processing. When PQgetResult() is processing the response to the replication command:

  1. libpq reads data from the socket into its input buffer
  2. It parses messages sequentially
  3. If the connection is closed (n=0 read) before the result is fully assembled, libpq may prioritize the "connection lost" error over a FATAL that was already buffered but not yet processed as the command's result

This is a libpq state machine bug — the FATAL ErrorResponse is valid protocol and should be treated as the command's result, but the subsequent EOF triggers connection-lost handling that overwrites or ignores the buffered error.

Evidence and Reproduction

The author demonstrates the race with a simple pg_usleep(1000000) inserted after ReplicationSlotAcquire() in walsender.c. This widens the timing window so that the recovery conflict signal always arrives before the walsender progresses to send CopyBothResponse, making the failure deterministic.

The hex dump analysis is definitive:

Implications

  1. Test reliability: This affects REL_16_STABLE through master, meaning the test has been flaky since its introduction
  2. libpq correctness: This is not merely a test issue — any libpq client issuing START_REPLICATION that receives a FATAL before entering copy mode will lose the error message. This affects monitoring, error reporting, and automated failover tools that depend on parsing error messages.
  3. Broader pattern: Any command where the server sends FATAL during processing (after accepting the command but before sending the expected response type) could potentially trigger this libpq behavior.

Potential Solutions

  1. Fix in libpq: Ensure that a buffered ErrorResponse is always preserved as the command result, even if a subsequent read returns EOF. The EOF should be treated as confirmation of the FATAL, not as a superseding error.

  2. Fix in walsender: Send CopyBothResponse earlier (before slot acquisition and conflict checks), so that the FATAL always arrives within the copy stream where libpq handles it correctly. However, this changes protocol semantics and may have other implications.

  3. Fix in the test: Make the regex also accept "server closed the connection unexpectedly" as a valid outcome. This is a workaround, not a fix, and masks the underlying libpq issue.

  4. Fix in pg_recvlogical: Add retry logic or secondary error checking that queries the slot state after a connection failure, to determine if invalidation occurred regardless of the error message received.