BackgroundPsql swallowing errors on windows

First seen: 2025-02-13 17:39:04+00:00 · Messages: 42 · Participants: 9

Latest Update

2026-06-04 · claude-opus-4-6

BackgroundPsql Swallowing Errors on Windows: Deep Technical Analysis

Core Problem

The PostgreSQL TAP testing infrastructure relies on BackgroundPsql.pm to maintain long-running psql sessions for integration tests. This module works by injecting queries into psql's stdin, then detecting completion by echoing a "banner" string and using pump_until() to wait for that banner to appear in stdout. This design has a fundamental flaw on Windows: there is no guarantee that stderr output arrives at the IPC::Run consumer before or simultaneously with stdout output, even when the stderr write happened first in the program under test.

Root Cause: IPC::Run's Windows Proxy Architecture

On Windows, IPC::Run cannot use Unix-style pipe I/O directly. Instead, it spawns a proxy process that communicates with the Perl parent over TCP sockets—one socket per pipe (stdin, stdout, stderr). This introduces a critical ordering problem:

  1. psql writes an error to stderr (fd 2)
  2. psql writes the banner to stdout (fd 1)
  3. The proxy reads from both pipes and forwards over separate TCP connections
  4. Due to TCP buffering, Nagle's algorithm, and independent socket delivery, the stdout data (banner) can arrive at the Perl process before the stderr data (error message)

When pump_until() detects the banner on stdout, it returns—but stderr may not yet contain the error output. The test then checks stderr, finds it empty, and the error is effectively "swallowed."

Secondary Issues Discovered

  1. Banner matching was not anchored: The \echo command itself would appear in stdout (as psql echoes input in interactive mode), causing premature matches on the command text rather than its output.

  2. Missing newline handling: Not waiting for the trailing newline after the banner caused race conditions where partial reads would split the banner from its newline, corrupting subsequent query detection.

  3. \r\n vs \n inconsistency: When IO::Pty is available (interactive psql), line endings are \r\n, which broke regex matches expecting only \n.

  4. Non-unique banners: Using the same banner string for every query made debugging impossible—you couldn't tell if a matched banner was from the current query or a previous one.

  5. Banner removal consuming line separators: The regex that strips the banner from stderr (s/$banner_match//) also removes the \n before the banner. If a prior \warn output didn't include a trailing newline, this caused the next banner to be concatenated with the previous output, preventing the (^|\n) anchor from matching and causing an infinite loop in pump_until().

The Fix (Committed)

Andres Freund's patch addresses the Windows ordering problem and several secondary bugs:

  1. Dual-channel banner: Emit the banner on both stdout (via \echo) and stderr (via \warn), then wait for both before returning from query(). This ensures stderr has been fully delivered before the method returns.

  2. Anchored matching: Banner regex requires (^|\n) prefix, preventing false matches on the echoed command itself.

  3. Newline-inclusive matching: The regex includes \r?\n to handle both Unix and PTY-style line endings.

  4. Query counter: Each banner includes an incrementing counter (QUERY_SEPARATOR $N:) making each banner unique and preventing cross-query confusion.

  5. Debug output: Added note() calls showing stdout/stderr after each query for easier debugging.

Remaining Bug (Reported by Arseniy Mukhin)

A regression was identified: if a \warn command outputs text without a trailing newline, the subsequent banner on stderr gets concatenated with that text on the same line. Since the banner removal regex also strips the preceding \n, there's no separator left for the next query's banner match. This causes pump_until() to loop forever. Example:

\warn AAAAA

produces stderr: AAAAAbackground_psql: QUERY_SEPARATOR 2: — the (^|\n) anchor cannot match.

The Larger Architectural Discussion

IPC::Run's Fundamental Limitations

Noah Misch and Tom Lane identified that the ordering problem is inherent to any multi-pipe proxy architecture. Even a single-socket multiplexing approach (prefixing data with pipe identifiers) cannot fully solve it because:

Noah authored and pushed an upstream IPC::Run fix (commit 2128df3) specifically for a macOS stderr-lost-after-exit scenario. However, no IPC::Run release was made until August 2025, months after the PostgreSQL fix was committed.

Thomas Munro proposed two potential IPC::Run improvements:

  1. Pipes with IOCP/WaitForMultipleEvents: Use native Windows async I/O instead of select/poll
  2. Sockets as direct stdio: Pass sockets directly to subprocesses, eliminating the proxy entirely

PostgreSQL::Test::Session — The Long-term Replacement

Andrew Dunstan has been developing PostgreSQL::Test::Session, a libpq-based replacement for BackgroundPsql that eliminates IPC::Run for database communication entirely:

Performance characteristics:

By v13 (June 2026), all TAP tests pass on both Linux and Windows with the new framework.

Impact Assessment

The original bug affected:

The fix was backpatched to all supported branches (13-17) given that: