New Development: Root Cause of dikkop Failures Identified — A 22-Year-Old Thread Fallback Bug
Andrew Dunstan has diagnosed the secondary buildfarm failures exposed by the original NULL-connection fix, and it is not a use-after-free in the prepared-statement list as the prior analysis speculated from the corrupted this->stmt pointer. The actual defect is far older and more architectural: when a thread's own EXEC SQL CONNECT fails, ecpg_get_connection() silently falls back to the process-global actual_connection — which may belong to an entirely different thread. That sibling thread then continues to use "its" connection while the failed thread scribbles on the same libpq PGconn concurrently, producing the corrupted statement pointers and NULL conn->result observed on dikkop.
Andrew points to a 2006 report [1] that described exactly these symptoms: a user was holding the mutex the ECPG docs told him to hold, and state still corrupted — because the corruption window was inside ecpglib itself, before the libpq calls the user was serializing. The user's mutex was at the wrong layer. The bug has been latent for ~22 years because the fallback only misbehaves under connection-failure-under-contention, which is rare in practice.
The Fix (credited to Claude)
Track which thread opened each connection; only fall back to the global actual_connection if the current thread is its owner. The patch introduces an ecpg_default_connection() helper and an actual_connection_key (TSD) that records ownership. The per-thread slot populated by EXEC SQL SET CONNECTION is not owner-checked — only the implicit global fallback is.
Deliberate Behavior Change
This is a user-visible semantic change, not a pure bugfix:
| Pattern | Before | After |
|---|---|---|
Main thread CONNECT; workers issue unqualified SQL relying on their own mutex |
Worked by accident (or corrupted state silently) | Fails with ECPG_NO_CONN (SQLCODE -220) |
EXEC SQL AT con1 ... per statement |
Works | Works |
Per-thread EXEC SQL SET CONNECTION con1 at thread start |
Works | Works (populates per-thread slot, bypasses owner check) |
The old comment being removed ("hope the user knows what they're doing, i.e. using their own mutex") is essentially an admission that the documented contract was unachievable — as the 2006 report proved.
Tom's Review
Tom endorses the behavior change over the alternative ("declare the test case buggy, require user mutexes") on safety grounds, acknowledging "we might get some complaints." One code-structure nit: move the ecpg_pthreads_init() call that ensures actual_connection_key validity into ecpg_default_connection() itself, since the new function's comment doesn't advertise that prerequisite — a pattern-matching argument that encapsulation should include initialization of the TSD key the function depends on.
Tom also flags that documentation needs updating regardless of which direction is taken.
Backpatch Decision: Deferred
Andrew and Tom agree to not rush this in. Given the bug has gone undetected for decades and minor releases are next week, they'll wait at least another week. This is a sensible risk-management call: a behavior change that turns previously-"working" (silently-corrupting) programs into loud ECPG_NO_CONN failures is exactly the kind of thing that should not land days before a stamped release.
Significance
This retroactively reframes the prior analysis's concluding observation. The dikkop failure was not a use-after-free in the prepared-statement cache — the corrupted this->stmt pointer was the symptom of two threads racing on a shared PGconn via the global fallback, where one thread's in-flight query response overwrote the other's view of libpq state. The statement cache itself is fine; the bug is one level down, in how ecpg_get_connection() resolves "no connection specified."
It also validates the architectural concern from the prior analysis that libecpg's thread-local connection bookkeeping was suspect — but the defect is in the fallback policy, not in list locking or premature free.