Improving Tracking/Processing of Buildfarm Test Failures — Technical Analysis
The Core Problem: Signal vs. Noise in the Buildfarm
PostgreSQL's buildfarm is the project's primary early-warning system for portability bugs, race conditions, and regressions across ~100 animals spanning exotic OSes (HP-UX, AIX, illumos, Hurd), architectures (POWER, s390x, ARM, RISC-V), compilers, and build options (assertions, TAP tests, ICU variants, meson/autoconf). But the buildfarm historically produces only raw per-run status; there has been no structured memory of which failures have been seen before, which are under investigation, which are fixed, and which represent genuine latent bugs.
This creates several architectural problems for the project:
- Lost bug signal. A race condition can flash on one animal every few months for years. Without correlation, each occurrence looks novel and is dismissed as transient. Alexander Lakhin notes at the outset that a two-year-old flaky failure may still point to a real race in production code.
- Duplicated investigative effort. Noah Misch describes his manual triage workflow — grep -hackers for the animal name, grep the log for ~10 known failure indicator strings (
"was terminated", etc.), search lists and the buildfarm DB — which each committer essentially re-invents. - Post-commit accountability gaps. Amit Kapila explicitly ties the proposal to deciding whether to chase a BF failure immediately after a commit lands — i.e., was this my fault, or pre-existing noise?
- No aggregate trend data. Without classification you cannot answer "is master getting flakier?", "which subsystem dominates failures?", or "what's the ratio of environmental to real bugs?"
Proposed Solutions and the Design Tradeoff
Lakhin proposed two approaches:
Option A — Schema change to the buildfarm database
Add two fields to the failure record: an issue_link (URL to the -hackers/-bugs discussion) and a fix_commit (commit id). This enables server-side filtering: "show me unknown failures only." Cost: requires buildfarm server code changes, a UI, and an auth/moderation model.
Option B — A wiki page (Known_Buildfarm_Test_Failures)
Human-curated list mapping failure URLs to diagnostic strings and -hackers threads. Cost: purely social; discoverability depends on people using it.
Andrew Dunstan's gating constraint
Dunstan, as buildfarm maintainer, pushed back on the obvious "open it up" impulse: only animal owners can annotate their own animals, and general-public write access is a non-starter due to ongoing spam pressure. This effectively killed the naive form of Option A. He did, however, open two doors: (a) canned queries on the server and (b) schema extensions on request. He later explicitly endorsed moving away from HTML scraping toward a better reporting API, acknowledging that Lakhin's script-based workflow was a stopgap against a missing server feature.
Noah Misch's architectural suggestion
Misch argued that the input format matters more than the schema: accept free-form text, not just URLs/commit IDs, because submission friction is the binding constraint. More importantly, he proposed server-side auto-correlation as the real win:
- Detect when N members fail at the same step in a related commit range and then go green → probably a quickly-fixed defect.
- Cluster failures by log-line fingerprints highly correlated with failure.
This is essentially asking the buildfarm to become a rudimentary log-clustering / anomaly-detection system, which would shift triage from manual pattern-matching to automated grouping.
What Actually Got Built: A Social Protocol, Not a Schema
The thread resolved in favor of Option B. Lakhin created https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures and — this is the real contribution — committed to a monthly cadence of statistical reports posted to -hackers, backed by a private script that scrapes the buildfarm HTML API.
The report format is a de facto schema, encoded as pseudo-SQL against a hypothetical failures(br, dt, issue_link) table:
- Failures grouped by branch (REL_NN_STABLE, master)
- Distinct-issue count
- Top-N most frequent issues (with
-- Fixed/-- Hurd/-- An environmental issueannotations) - Unsorted/unhelpful failures (IS NULL issue_link) — the backlog of genuinely novel failures needing triage
- Short-lived failures — eliminated quickly, usually by revert/follow-up commit; these are the ones that would have been lost without tracking
This is effectively a manually-populated OLAP cube over buildfarm output, with Lakhin as the sole ETL pipeline.
Quantitative Observations from ~24 Monthly Reports
Across mid-2024 through early-2026 the reports reveal structural patterns:
- master dominates the failure count, often by an order of magnitude. E.g., March 2026: master 349 of 433; April 2026: master 144 of 180. This is expected (master gets new code) but the ratio is a useful trend indicator.
- Spikes correlate with specific commits. The Mar 2026 spike of 153 failures tied to a single thread (
c64cbda0-...@gmail.com) shows how one broken commit dominates a month's noise. Without classification this would look like generalized flakiness. - The REL_18_STABLE failure count grows from its first appearance (June 2025: 1) as that branch matured and buildfarm animals adopted it — a useful canary that stable-branch noise grows with testing coverage, not with stability.
- Persistent unfixed issues re-appear across many months with the same message-id — notably
a9a97e83-...@gmail.com(flagged "environmental?") and657815a2-...@gmail.com, and the long-running Hurd ticket2874644f-...@gmail.com. These are the genuine latent issues Lakhin's original premise predicted: failures that, absent tracking, would be perpetually re-diagnosed. - "Short-lived failures" track commit churn. Some months spike into the hundreds (Apr 2025: 238; Mar 2026: 307), suggesting periods of high-risk commits that were caught and reverted quickly — which is exactly the buildfarm working as intended.
- Null-issue count is the triage backlog. It ranges from 6 (Nov 2024) to 74 (Oct 2024) to 73 (Nov 2025). High null counts mean the triage capacity is saturated; low counts mean Lakhin is keeping up.
The Isolated Technical Sub-Discussion: slot_creation_error Race
Kuroda reported a specific failure that illustrates the kind of bug this tracking system is designed to preserve. In test_decoding/isolation/slot_creation_error:
- s1 starts a transaction.
- s2 tries to create a replication slot (blocks).
- s1 calls
pg_terminate_backend($s2). - Expected: isolation tester sees s2's FATAL "terminating connection due to administrator command" and the test passes with the cancellation in output.
- Observed (rare): isolation tester's
PQconsumeInput()returns failure (server closed the connection unexpectedly) before it has consumed the FATAL message from the wire. The tester prints its own error andexit(1), producing a diff.
This is a classic race between two valid outcomes of libpq: whether the FATAL NoticeResponse is drained from the socket buffer before the socket-closed condition is detected. The relevant code path in try_complete_step() treats !PQconsumeInput(conn) as a fatal tester error rather than a potentially recoverable condition where buffered protocol messages might still be readable. The fix direction (not resolved in this thread) would be either to drain any remaining input before exiting, or to make the test tolerant of the racing orderings. This is exactly the sort of low-frequency race the tracking infrastructure is meant to prevent from being forgotten.
Participant Weight and Stance
- Alexander Lakhin (Neon) — the driver; has become, de facto, PostgreSQL's buildfarm triage officer. His domain expertise is in reproducing and minimizing flaky tests; his reports are the primary artifact of this thread.
- Andrew Dunstan (committer, buildfarm maintainer) — gatekeeper for any server-side changes. His constraints (no public write access, spam) shaped the outcome toward client-side/wiki tooling. Supportive of eventually improving the reporting API.
- Noah Misch (committer) — contributed the most architecturally significant suggestion (auto-correlation, low-friction input). Did not implement but framed the design space.
- Amit Kapila (committer) — endorsed the effort, emphasizing the post-commit use case.
- Hayato Kuroda (Fujitsu) — example consumer, surfacing a real race via the new workflow.
Architectural Implications
The outcome is instructive: rather than extend the buildfarm's data model, the project adopted a human-curated, mailing-list-indexed classification layer on top of existing infrastructure, with one person as the bottleneck. This is fragile (bus factor = 1), but it works because:
- PostgreSQL's culture treats -hackers archive message-ids as stable, citable identifiers — effectively a content-addressable bug tracker.
- Wiki + monthly report gives searchability without requiring buildfarm schema changes.
- Annotations like "-- Fixed", "-- Hurd", "-- An environmental issue" are low-ceremony but sufficient for pattern recognition.
The unsolved problem remains Misch's original point: automated log-fingerprint clustering would replace Lakhin's manual labor with a deterministic pipeline. Dunstan's comment about improving the server's reporting side suggests that's the long-term direction, but as of the last report in the thread (May 2026) it has not materialized, and Lakhin's HTML-scraping script remains the production workflow.