occasional ECPG failures on dikkop (FreeBSD)

2026-05-07 · opus 4.7

Overview

This thread is a short buildfarm diagnostic exchange concerning intermittent (~5% rate) segmentation faults in ECPG's threaded regression tests on the dikkop FreeBSD animal. The failures are notable because: (a) they are only reproducible through the buildfarm client harness, not by direct invocation; (b) they affect every supported back-branch down to REL_14_STABLE; and (c) they began appearing roughly 30 days before the report, coincident with a host OS upgrade from FreeBSD 14.1 (or possibly 14.3) to 14.4.

The Core Problem

ECPG (Embedded SQL in C) ships a suite of threaded tests under src/interfaces/ecpg/test/thread/ — including alloc, prep, thread, thread_implicit, and descriptor. These exercise ECPG's per-thread connection and statement state, which is maintained via thread-local storage (TLS) and a collection of linked lists keyed by pthread_self()/connection name. The kernel log excerpt Tomas produced shows SIGSEGVs distributed across all of these binaries, not just the one test that happened to fail in a given run:

pid ... (thread_implicit) ... signal 11
pid ... (alloc)           ... signal 11
pid ... (prep)            ... signal 11
pid ... (thread)          ... signal 11

That distribution is diagnostically important. If the bug were specific to one test's SQL logic, the crash pattern would be narrower. The fact that every thread-using ECPG test occasionally crashes strongly implicates shared infrastructure — almost certainly in ecpglib itself (the ECPG runtime), specifically the code paths that manage per-thread connection/prepared-statement state under concurrent access.

Why It Matters Architecturally

ECPG's runtime library has a long-standing, somewhat fragile threading model. Connection lookups (ecpg_get_connection), prepared-statement caches (ECPGprepared_statement), and descriptor lists are guarded by a mix of pthread_mutex_t locks and TLS pointers. Bugs in this area manifest as use-after-free or NULL-deref segfaults that are highly schedule-dependent. Because ECPG is a client-side library linked into user applications, such races can silently corrupt customer applications — the buildfarm tests are one of the few places such races reliably surface.

The triggering question here is: why did a latent race suddenly become visible after a FreeBSD minor-version bump (14.1 → 14.4)?

Tom Lane's Diagnostic Hypothesis

Tom immediately connects this to an ongoing thread on ECPGprepared_statement() in which Andrew (Dunstan, almost certainly, given the "Andrew just identified" phrasing and buildfarm context) had isolated a concrete bug. Tom is careful not to claim causation — "I have no idea whether the bug that Andrew just identified explains dikkop's problem" — but his closing message crystallizes the architectural read:

Given that these are threading tests, I'm suspecting some change in thread scheduling behavior in this latest FBSD release, which somehow made it easier to hit a pre-existing issue.

This is the canonical explanation for buildfarm animals that begin flapping after an OS upgrade without any PostgreSQL code change: the underlying race has existed for years (the failures reproduce all the way back to 14), but the FreeBSD 14.4 scheduler — possibly due to changes in libthr timeslice behavior, mutex contention handling, or CPU affinity — widens the race window. FreeBSD 14.x has had ongoing work on libthr and kernel scheduler interaction (e.g., umtx/condvar paths), any of which could shift timings enough to expose a bug that previously needed adversarial scheduling to hit.

Why the Buildfarm Client Specifically

Tomas notes he cannot reproduce the failure manually "even when trying to use exactly the same options." This is consistent with the race hypothesis: the buildfarm client runs the tests in a particular process tree, with particular I/O redirection and parallelism characteristics (Meson/make check invocation through a perl harness), which subtly alters scheduler decisions. On a small/slow machine like dikkop (a Raspberry Pi-class FreeBSD/ARM animal, based on community knowledge of buildfarm members), that environmental difference is enough to shift timings into the racy window.

Missing Diagnostic Data

A significant limitation flagged in the thread: no core files were captured. FreeBSD logged the SIGSEGV via kern.corefile/kern.coredump but the files were not located. Without backtraces, the hypothesis that Andrew's ECPGprepared_statement bug is the root cause remains unverified. Tomas commits to looking harder on the next failure. A proper fix pathway would require:

Ensuring kern.coredump=1 and kern.corefile point to a writable location before the buildfarm run.
Optionally enabling ulimit -c unlimited in the buildfarm client environment.
Correlating the backtrace with the ECPGprepared_statement fix once committed.

Resolution Strategy

No patch is proposed in this thread. The working plan is implicit: wait for Andrew's ECPGprepared_statement fix to land, observe whether dikkop stabilizes, and only dig deeper if failures persist. This is a reasonable triage given the cross-reference to an active fix thread — duplicating the investigation would waste effort.

Participant Weight

Tom Lane — Core committer, historical custodian of much ECPG-adjacent and threading-related code. His diagnostic instinct (pre-existing race + scheduler change) is authoritative and directly informs the triage decision.
Tomas Vondra — Committer and buildfarm owner for dikkop. He is the one with access to the machine and thus the only person who can capture the missing core files.
Andrew (Dunstan, inferred) — Referenced as having "just identified" the ECPGprepared_statement bug in a parallel thread. Not a direct participant here but his work is the suspected fix.

Key Takeaways

ECPG threading tests have a latent race that was masked by pre-14.4 FreeBSD scheduling behavior.
The crash distribution across all thread-using ECPG tests points to shared ecpglib infrastructure, not any one test.
The suspected fix is already in-flight in a separate thread about ECPGprepared_statement().
Investigative progress is currently blocked on capturing a core file; no PostgreSQL code change is being proposed in this thread.