meson: Make test output much more useful on failure (both in CI and locally)

First seen: 2026-01-26 10:13:57+00:00 · Messages: 22 · Participants: 8

Latest Update

2026-05-20 · claude-opus-4-6

Incremental Update: Minor Progress Note on Truncation Handling

The only new message since the prior analysis is a brief status update from Jelte Fennema-Nio (2026-05-18) regarding the interaction between command_ok output truncation and tests that invoke pg_regress via TAP (like 027_stream_regress.pl).

What's New

Jelte acknowledges he has not yet addressed the truncation concern in his v6 patch. He clarifies that the issue is specifically about the command_ok wrapper (from patches 0002/0005), not the pg_regress TAP diagnostic changes themselves. When pg_regress is invoked through command_ok in TAP tests, its output gets subject to command_ok's truncation logic, which may be inappropriate since pg_regress already controls its own output volume.

Proposed Design Direction

Jelte floats a potential solution: adding an optional parameter to command_ok that would allow callers to disable truncation on a per-call basis. This would let tests that invoke pg_regress (where output is already bounded/meaningful) bypass the general truncation that protects against unbounded output in other contexts.

This is a design consideration rather than a concrete patch — no code was shared for this approach yet.

History (1 prior analysis)
2026-05-18 · claude-opus-4-6

Technical Analysis: Improving Test Failure Output in PostgreSQL's Meson Build System

Core Problem

PostgreSQL's test infrastructure, particularly when using the Meson build system, produces deeply unhelpful output when tests fail. The fundamental issue is an information accessibility gap: when a test fails, the diagnostically useful information (regression diffs, command stderr/stdout, actual error messages) is buried in log files scattered across the build directory, requiring manual navigation to discover what went wrong.

This problem manifests in two contexts:

  1. CI environments (CirrusCI): Developers must click through a file browser UI to find regression.diffs or log files — a tedious multi-step process for every failure.
  2. Local development: Meson's TAP output shows only pass/fail status without the reason for failure. Developers must hunt through build/testrun/ directory trees for relevant log files.

The Makefile-based build system was slightly better here because it showed more inline information, but Meson strictly shows only the TAP protocol output, making the problem worse.

Architectural Context

The test infrastructure involves several layers:

  • pg_regress: The C program that runs SQL regression tests, compares output via diff, and produces regression.diffs
  • TAP test framework: Perl-based testing using Test::More, orchestrated through PostgreSQL::Test::Utils and PostgreSQL::Test::Cluster
  • Meson test runner: Captures TAP output from test programs and displays results
  • IPC::Run: Perl module used to spawn subprocesses for command execution in TAP tests

The key architectural constraint is that Meson can only display what the test programs emit via TAP protocol (stdout). Any diagnostic information must be encoded as TAP diag() output to be visible in the final test results.

Proposed Solutions (Patch Series)

Patch 0001: Include diffs in TAP output from pg_regress

Modifies pg_regress to emit the first N lines of regression.diffs as TAP diagnostic output when tests fail. This makes the actual test differences immediately visible without file navigation.

Technical details:

  • Reads the combined diff file after test completion
  • Emits lines using TAP diagnostic format (# prefix)
  • Originally limited to 80 lines; later revised based on feedback
  • Required careful handling of line counting (lines > 1023 chars count as multiple lines due to fgets buffer)
  • On Windows, required close/reopen logic for the diffs file due to file locking semantics ("The process cannot access the file because it is being used by another process")
  • Introduced DIAG_DETAIL and DIAG_END markers (analogous to existing NOTE_DETAIL/NOTE_END)

Patch 0002: Improved command_ok/command_fails output

Replaces the run_log() wrapper with direct IPC::Run::run calls that capture stdout/stderr, then display them via diag() on failure.

Technical subtlety: Cannot use simple variable capture (\$stdout) for daemon-spawning commands like pg_ctl start because the child postmaster inherits the file descriptors and outlives pg_ctl, causing IPC::Run to hang waiting for EOF. The solution pipes to tempfiles instead, mimicking the existing command_like_safe pattern.

Behavioral change noted in review: The old run_log printed a "# Running: ..." line even on success. The new version only prints command details on failure, reducing successful test log verbosity.

Patch 0003: Replace die with croak in .pm files

Changes error reporting in the test infrastructure modules (Cluster.pm, Utils.pm) to use Carp::croak instead of die. This causes error messages to report the caller's location (the actual test file) rather than the infrastructure module's line number.

Key detail: The committer (Andrew Dunslane) added @CARP_NOT in Utils.pm so that croak() would look past Cluster.pm to the actual TAP script caller — solving the cross-module attribution problem.

Patch 0004: Use done_testing()

Adds done_testing() calls to avoid the unhelpful "Tests were run but no plan was declared and done_testing() was not seen" messages on failures. This was originally suggested by Andrew Dunslane in a 2022 thread.

Patch 0005: Convert pg_upgrade and stream_regress tests to use command_ok

Migrates 002_pg_upgrade.pl and 027_stream_regress.pl to use the improved command_ok function, gaining the better failure output.

Key Design Tradeoffs and Disagreements

Output Volume vs. Usefulness (Peter Eisentraut's Objection)

Peter Eisentraut raised a significant objection post-commit: 80 lines of diff output, with lines potentially wider than terminal width, can produce 200+ wrapped lines that push the test summary off-screen, making the overall output less useful than before.

His specific concerns:

  1. Terminal overflow: With typical terminal heights of 40-60 lines, 80 diff lines swamp the screen
  2. Line wrapping: Diff lines are often wider than terminal width, multiplying the effective line count
  3. Crash cascades: When a server crashes, the first diffs shown are often from other tests that failed due to the crash, not the crashing test itself — making the truncated output misleading
  4. Stream synchronization: The diff output appears in the middle of test runs due to stdout/stderr buffering issues, as visible in buildfarm output

Resolution Approach

Andrew Dunslane proposed gating the behavior on an environment variable, adding it to cirrus.tasks.yml for CI. Jelte Fennema-Nio objected that this would break the case where pg_regress is called from TAP tests (like 027_stream_regress.pl). His counter-proposal (v5 patch, May 2026) addresses the concerns differently while preserving the TAP-embedded diff functionality.

Buildfarm Redundancy

Peter noted that the buildfarm already provides good failure navigation (links to detailed output below the summary). The new feature is somewhat redundant there and may actually degrade the buildfarm's presentation by injecting diff content that disrupts timing information display (as Alexander Lakhin demonstrated with the drongo buildfarm animal).

Technical Insights

  1. Windows file locking: The Windows kernel enforces exclusive file locks more aggressively than POSIX systems. Writing to regression.diffs while pg_regress still has it open requires close/reopen cycles.

  2. Daemon process file descriptor inheritance: The pg_ctl start/restart case is uniquely problematic because the spawned postmaster inherits stdout/stderr FDs. This is a well-known Unix process management issue that makes simple pipe-to-variable capture dangerous (potential indefinite hang).

  3. TAP protocol limitations: TAP diagnostic lines (# ...) are the only mechanism to inject information visible in Meson's output. This is both the solution and the constraint — everything must be encoded in this narrow channel.

  4. Carp vs die semantics: die reports the error at the point of failure in library code; croak reports it at the caller's location. For test infrastructure, the caller location is almost always what developers need. The @CARP_NOT mechanism allows multi-level stack unwinding past intermediate modules.

  5. The "first lines" heuristic: The assumption that the first N lines of diffs contain the most useful information is contested. It works for single-test failures but fails for crash scenarios where cascade failures dominate the beginning of the diff file.