BackgroundPsql Swallowing Errors on Windows: Deep Technical Analysis
Core Problem
The PostgreSQL TAP testing infrastructure relies on BackgroundPsql.pm to maintain long-running psql sessions for integration tests. This module works by injecting queries into psql's stdin, then detecting completion by echoing a "banner" string and using pump_until() to wait for that banner to appear in stdout. This design has a fundamental flaw on Windows: there is no guarantee that stderr output arrives at the IPC::Run consumer before or simultaneously with stdout output, even when the stderr write happened first in the program under test.
Root Cause: IPC::Run's Windows Proxy Architecture
On Windows, IPC::Run cannot use Unix-style pipe I/O directly. Instead, it spawns a proxy process that communicates with the Perl parent over TCP sockets—one socket per pipe (stdin, stdout, stderr). This introduces a critical ordering problem:
- psql writes an error to stderr (fd 2)
- psql writes the banner to stdout (fd 1)
- The proxy reads from both pipes and forwards over separate TCP connections
- Due to TCP buffering, Nagle's algorithm, and independent socket delivery, the stdout data (banner) can arrive at the Perl process before the stderr data (error message)
When pump_until() detects the banner on stdout, it returns—but stderr may not yet contain the error output. The test then checks stderr, finds it empty, and the error is effectively "swallowed."
Secondary Issues Discovered
-
Banner matching was not anchored: The
\echocommand itself would appear in stdout (as psql echoes input in interactive mode), causing premature matches on the command text rather than its output. -
Missing newline handling: Not waiting for the trailing newline after the banner caused race conditions where partial reads would split the banner from its newline, corrupting subsequent query detection.
-
\r\nvs\ninconsistency: WhenIO::Ptyis available (interactive psql), line endings are\r\n, which broke regex matches expecting only\n. -
Non-unique banners: Using the same banner string for every query made debugging impossible—you couldn't tell if a matched banner was from the current query or a previous one.
-
Banner removal consuming line separators: The regex that strips the banner from stderr (
s/$banner_match//) also removes the\nbefore the banner. If a prior\warnoutput didn't include a trailing newline, this caused the next banner to be concatenated with the previous output, preventing the(^|\n)anchor from matching and causing an infinite loop inpump_until().
The Fix (Committed)
Andres Freund's patch addresses the Windows ordering problem and several secondary bugs:
-
Dual-channel banner: Emit the banner on both stdout (via
\echo) and stderr (via\warn), then wait for both before returning fromquery(). This ensures stderr has been fully delivered before the method returns. -
Anchored matching: Banner regex requires
(^|\n)prefix, preventing false matches on the echoed command itself. -
Newline-inclusive matching: The regex includes
\r?\nto handle both Unix and PTY-style line endings. -
Query counter: Each banner includes an incrementing counter (
QUERY_SEPARATOR $N:) making each banner unique and preventing cross-query confusion. -
Debug output: Added
note()calls showing stdout/stderr after each query for easier debugging.
Remaining Bug (Reported by Arseniy Mukhin)
A regression was identified: if a \warn command outputs text without a trailing newline, the subsequent banner on stderr gets concatenated with that text on the same line. Since the banner removal regex also strips the preceding \n, there's no separator left for the next query's banner match. This causes pump_until() to loop forever. Example:
\warn AAAAA
produces stderr: AAAAAbackground_psql: QUERY_SEPARATOR 2: — the (^|\n) anchor cannot match.
The Larger Architectural Discussion
IPC::Run's Fundamental Limitations
Noah Misch and Tom Lane identified that the ordering problem is inherent to any multi-pipe proxy architecture. Even a single-socket multiplexing approach (prefixing data with pipe identifiers) cannot fully solve it because:
- If the proxy is slow, it may batch multiple writes from different pipes into one read, losing their relative order
- The proxy has no way to determine which of two simultaneously-ready pipes was written to first
Noah authored and pushed an upstream IPC::Run fix (commit 2128df3) specifically for a macOS stderr-lost-after-exit scenario. However, no IPC::Run release was made until August 2025, months after the PostgreSQL fix was committed.
Thomas Munro proposed two potential IPC::Run improvements:
- Pipes with IOCP/WaitForMultipleEvents: Use native Windows async I/O instead of select/poll
- Sockets as direct stdio: Pass sockets directly to subprocesses, eliminating the proxy entirely
PostgreSQL::Test::Session — The Long-term Replacement
Andrew Dunstan has been developing PostgreSQL::Test::Session, a libpq-based replacement for BackgroundPsql that eliminates IPC::Run for database communication entirely:
- Uses FFI::Platypus to call libpq functions directly from Perl (no subprocess needed)
- Originally had both FFI and XS variants; XS was dropped for simplicity
- Uses FFI::Platypus::Record instead of FFI::C to avoid an extra dependency (FFI::C is not in Strawberry Perl)
- Implements pipeline mode for batched commands to minimize round-trips
- Eliminates all
background_psql()calls across the test suite
Performance characteristics:
- Significantly reduced resource usage (no fork/exec for each psql invocation)
- Initial versions were 10x slower due to polling (
usleep(100_000)) instead of event-driven waiting - After switching to proper
PQsocketPoll()-based waiting, performance became "a few percent faster" than HEAD - Expected to show dramatically larger gains on Windows where process creation is expensive
By v13 (June 2026), all TAP tests pass on both Linux and Windows with the new framework.
Impact Assessment
The original bug affected:
- All Windows CI: Random test failures depending on timing
- macOS: Similar symptoms observed (Tom Lane's pgbench example shows empty stderr after process exit)
- Any platform with IPC::Run proxy behavior
The fix was backpatched to all supported branches (13-17) given that:
- BackgroundPsql.pm has no known external consumers (verified via Debian search and GitHub)
- The behavioral changes are low-risk for extensions
- Debugging these failures is extremely costly (Andres spent 1.5 days on initial diagnosis)