Improving tracking/processing of buildfarm test failures

First seen: 2024-05-23 11:00:00+00:00 · Messages: 33 · Participants: 6

Latest Update

2026-05-06 · opus 4.7

Improving Tracking/Processing of Buildfarm Test Failures — Technical Analysis

The Core Problem: Signal vs. Noise in the Buildfarm

PostgreSQL's buildfarm is the project's primary early-warning system for portability bugs, race conditions, and regressions across ~100 animals spanning exotic OSes (HP-UX, AIX, illumos, Hurd), architectures (POWER, s390x, ARM, RISC-V), compilers, and build options (assertions, TAP tests, ICU variants, meson/autoconf). But the buildfarm historically produces only raw per-run status; there has been no structured memory of which failures have been seen before, which are under investigation, which are fixed, and which represent genuine latent bugs.

This creates several architectural problems for the project:

  1. Lost bug signal. A race condition can flash on one animal every few months for years. Without correlation, each occurrence looks novel and is dismissed as transient. Alexander Lakhin notes at the outset that a two-year-old flaky failure may still point to a real race in production code.
  2. Duplicated investigative effort. Noah Misch describes his manual triage workflow — grep -hackers for the animal name, grep the log for ~10 known failure indicator strings ("was terminated", etc.), search lists and the buildfarm DB — which each committer essentially re-invents.
  3. Post-commit accountability gaps. Amit Kapila explicitly ties the proposal to deciding whether to chase a BF failure immediately after a commit lands — i.e., was this my fault, or pre-existing noise?
  4. No aggregate trend data. Without classification you cannot answer "is master getting flakier?", "which subsystem dominates failures?", or "what's the ratio of environmental to real bugs?"

Proposed Solutions and the Design Tradeoff

Lakhin proposed two approaches:

Option A — Schema change to the buildfarm database

Add two fields to the failure record: an issue_link (URL to the -hackers/-bugs discussion) and a fix_commit (commit id). This enables server-side filtering: "show me unknown failures only." Cost: requires buildfarm server code changes, a UI, and an auth/moderation model.

Option B — A wiki page (Known_Buildfarm_Test_Failures)

Human-curated list mapping failure URLs to diagnostic strings and -hackers threads. Cost: purely social; discoverability depends on people using it.

Andrew Dunstan's gating constraint

Dunstan, as buildfarm maintainer, pushed back on the obvious "open it up" impulse: only animal owners can annotate their own animals, and general-public write access is a non-starter due to ongoing spam pressure. This effectively killed the naive form of Option A. He did, however, open two doors: (a) canned queries on the server and (b) schema extensions on request. He later explicitly endorsed moving away from HTML scraping toward a better reporting API, acknowledging that Lakhin's script-based workflow was a stopgap against a missing server feature.

Noah Misch's architectural suggestion

Misch argued that the input format matters more than the schema: accept free-form text, not just URLs/commit IDs, because submission friction is the binding constraint. More importantly, he proposed server-side auto-correlation as the real win:

This is essentially asking the buildfarm to become a rudimentary log-clustering / anomaly-detection system, which would shift triage from manual pattern-matching to automated grouping.

What Actually Got Built: A Social Protocol, Not a Schema

The thread resolved in favor of Option B. Lakhin created https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures and — this is the real contribution — committed to a monthly cadence of statistical reports posted to -hackers, backed by a private script that scrapes the buildfarm HTML API.

The report format is a de facto schema, encoded as pseudo-SQL against a hypothetical failures(br, dt, issue_link) table:

This is effectively a manually-populated OLAP cube over buildfarm output, with Lakhin as the sole ETL pipeline.

Quantitative Observations from ~24 Monthly Reports

Across mid-2024 through early-2026 the reports reveal structural patterns:

  1. master dominates the failure count, often by an order of magnitude. E.g., March 2026: master 349 of 433; April 2026: master 144 of 180. This is expected (master gets new code) but the ratio is a useful trend indicator.
  2. Spikes correlate with specific commits. The Mar 2026 spike of 153 failures tied to a single thread (c64cbda0-...@gmail.com) shows how one broken commit dominates a month's noise. Without classification this would look like generalized flakiness.
  3. The REL_18_STABLE failure count grows from its first appearance (June 2025: 1) as that branch matured and buildfarm animals adopted it — a useful canary that stable-branch noise grows with testing coverage, not with stability.
  4. Persistent unfixed issues re-appear across many months with the same message-id — notably a9a97e83-...@gmail.com (flagged "environmental?") and 657815a2-...@gmail.com, and the long-running Hurd ticket 2874644f-...@gmail.com. These are the genuine latent issues Lakhin's original premise predicted: failures that, absent tracking, would be perpetually re-diagnosed.
  5. "Short-lived failures" track commit churn. Some months spike into the hundreds (Apr 2025: 238; Mar 2026: 307), suggesting periods of high-risk commits that were caught and reverted quickly — which is exactly the buildfarm working as intended.
  6. Null-issue count is the triage backlog. It ranges from 6 (Nov 2024) to 74 (Oct 2024) to 73 (Nov 2025). High null counts mean the triage capacity is saturated; low counts mean Lakhin is keeping up.

The Isolated Technical Sub-Discussion: slot_creation_error Race

Kuroda reported a specific failure that illustrates the kind of bug this tracking system is designed to preserve. In test_decoding/isolation/slot_creation_error:

This is a classic race between two valid outcomes of libpq: whether the FATAL NoticeResponse is drained from the socket buffer before the socket-closed condition is detected. The relevant code path in try_complete_step() treats !PQconsumeInput(conn) as a fatal tester error rather than a potentially recoverable condition where buffered protocol messages might still be readable. The fix direction (not resolved in this thread) would be either to drain any remaining input before exiting, or to make the test tolerant of the racing orderings. This is exactly the sort of low-frequency race the tracking infrastructure is meant to prevent from being forgotten.

Participant Weight and Stance

Architectural Implications

The outcome is instructive: rather than extend the buildfarm's data model, the project adopted a human-curated, mailing-list-indexed classification layer on top of existing infrastructure, with one person as the bottleneck. This is fragile (bus factor = 1), but it works because:

  1. PostgreSQL's culture treats -hackers archive message-ids as stable, citable identifiers — effectively a content-addressable bug tracker.
  2. Wiki + monthly report gives searchability without requiring buildfarm schema changes.
  3. Annotations like "-- Fixed", "-- Hurd", "-- An environmental issue" are low-ceremony but sufficient for pattern recognition.

The unsolved problem remains Misch's original point: automated log-fingerprint clustering would replace Lakhin's manual labor with a deterministic pipeline. Dunstan's comment about improving the server's reporting side suggests that's the long-term direction, but as of the last report in the thread (May 2026) it has not materialized, and Lakhin's HTML-scraping script remains the production workflow.