[BUG] CRASH: ECPGprepared_statement() and ECPGdeallocate_all() when connection is NULL

First seen: 2025-12-08 15:38:56+00:00 · Messages: 24 · Participants: 7

Latest Update

2026-05-07 · opus 4.7

New Development: Root Cause of dikkop Failures Identified — A 22-Year-Old Thread Fallback Bug

Andrew Dunstan has diagnosed the secondary buildfarm failures exposed by the original NULL-connection fix, and it is not a use-after-free in the prepared-statement list as the prior analysis speculated from the corrupted this->stmt pointer. The actual defect is far older and more architectural: when a thread's own EXEC SQL CONNECT fails, ecpg_get_connection() silently falls back to the process-global actual_connection — which may belong to an entirely different thread. That sibling thread then continues to use "its" connection while the failed thread scribbles on the same libpq PGconn concurrently, producing the corrupted statement pointers and NULL conn->result observed on dikkop.

Andrew points to a 2006 report [1] that described exactly these symptoms: a user was holding the mutex the ECPG docs told him to hold, and state still corrupted — because the corruption window was inside ecpglib itself, before the libpq calls the user was serializing. The user's mutex was at the wrong layer. The bug has been latent for ~22 years because the fallback only misbehaves under connection-failure-under-contention, which is rare in practice.

The Fix (credited to Claude)

Track which thread opened each connection; only fall back to the global actual_connection if the current thread is its owner. The patch introduces an ecpg_default_connection() helper and an actual_connection_key (TSD) that records ownership. The per-thread slot populated by EXEC SQL SET CONNECTION is not owner-checked — only the implicit global fallback is.

Deliberate Behavior Change

This is a user-visible semantic change, not a pure bugfix:

Pattern Before After
Main thread CONNECT; workers issue unqualified SQL relying on their own mutex Worked by accident (or corrupted state silently) Fails with ECPG_NO_CONN (SQLCODE -220)
EXEC SQL AT con1 ... per statement Works Works
Per-thread EXEC SQL SET CONNECTION con1 at thread start Works Works (populates per-thread slot, bypasses owner check)

The old comment being removed ("hope the user knows what they're doing, i.e. using their own mutex") is essentially an admission that the documented contract was unachievable — as the 2006 report proved.

Tom's Review

Tom endorses the behavior change over the alternative ("declare the test case buggy, require user mutexes") on safety grounds, acknowledging "we might get some complaints." One code-structure nit: move the ecpg_pthreads_init() call that ensures actual_connection_key validity into ecpg_default_connection() itself, since the new function's comment doesn't advertise that prerequisite — a pattern-matching argument that encapsulation should include initialization of the TSD key the function depends on.

Tom also flags that documentation needs updating regardless of which direction is taken.

Backpatch Decision: Deferred

Andrew and Tom agree to not rush this in. Given the bug has gone undetected for decades and minor releases are next week, they'll wait at least another week. This is a sensible risk-management call: a behavior change that turns previously-"working" (silently-corrupting) programs into loud ECPG_NO_CONN failures is exactly the kind of thing that should not land days before a stamped release.

Significance

This retroactively reframes the prior analysis's concluding observation. The dikkop failure was not a use-after-free in the prepared-statement cache — the corrupted this->stmt pointer was the symptom of two threads racing on a shared PGconn via the global fallback, where one thread's in-flight query response overwrote the other's view of libpq state. The statement cache itself is fine; the bug is one level down, in how ecpg_get_connection() resolves "no connection specified."

It also validates the architectural concern from the prior analysis that libecpg's thread-local connection bookkeeping was suspect — but the defect is in the fallback policy, not in list locking or premature free.

History (1 prior analysis)
2026-05-06 · opus 4.7

ECPG NULL-Connection Crashes: Defensive Hardening of the Client Interface

Core Problem

ECPG (Embedded SQL in C) is PostgreSQL's precompiler that translates embedded SQL into C code using libecpg. The reported bug exposes a long-standing class of defects: several entry points in libecpg dereference the struct connection * returned by ecpg_get_connection() without checking for NULL. When a program calls functions such as ECPGprepared_statement(), ECPGdeallocate_all(), or ECPGget_desc() after a failed EXEC SQL CONNECT (or without ever connecting), ecpg_get_connection() legitimately returns NULL, and the callee crashes with SIGSEGV.

The two initial backtraces make the defect concrete:

  • ecpg_find_prepared_statement() at prepare.c:277 dereferences con->prep_stmts with con == NULL.
  • ecpg_deallocate_all_conn() at prepare.c:372 does while (con->prep_stmts) with con == NULL.

This is reproducible from v13 through master, which indicates the miss is historical rather than a recent regression. Architecturally, the issue matters because libecpg is an application-facing library: unlike backend code where an invariant-violated NULL is an assertion failure, here a NULL connection is an expected user-facing state (connect failed, or user chose a nonexistent connection name). The correct behavior is to raise an ECPG SQL error (SQLCODE -220, ECPG_NO_CONN, SQLSTATE 08003 "connection does not exist") — exactly what ecpg_init() already does for most call sites.

Design Choices and Evolution of the Fix

v1: Inline NULL check with ecpg_raise()

Shruthi's first patch added explicit if (!con) { ecpg_raise(...); return; } blocks after each unchecked ecpg_get_connection(). This mirrors the pattern already used in descriptor.c.

v2: Switch to ecpg_init()

Fujii Masao pushed back: most existing call sites delegate this validation to ecpg_init(), which performs ecpg_init_sqlca() plus the connection/NULL check in one step. Using ecpg_init() is more consistent and does the SQLCA housekeeping that a bare ecpg_raise() skips. Shruthi agreed and converted the patch.

v3: Nishant's pushback — preserve local style per file

Nishant Sharma pointed out that this one-size-fits-all switch was wrong for descriptor.c: that file consistently uses the if (!conn) ecpg_raise(...) idiom, and switching to ecpg_init() there would (a) create inconsistency within the file and (b) force removing the existing ecpg_init_sqlca(sqlca) line, (c) require moving the check to the top of ECPGget_desc() even though many of that function's early-return paths don't need it. This is a nuanced point: ecpg_init() is the right tool when you want to start an ECPG operation; ecpg_raise() after an inline check is the right tool when you're validating mid-function without re-initializing SQLCA.

Additional site: ecpg_freeStmtCacheEntry()

Nishant also identified a fourth vulnerable site in prepare.c's statement-cache eviction. Crucially, the con in that function is the connection stored on the cached statement, not one freshly looked up from the caller's connection name — so caller-side validation is insufficient. This is a subtle lifetime issue: a cached statement can outlive or reference a different connection than the one the current call is operating on.

A small disagreement arose on the correct semantics: should ecpg_freeStmtCacheEntry() refuse to clean up when con == NULL (return -1), or should it still perform the local memory/slot cleanup? Shruthi argued correctly that cache cleanup must proceed regardless: the slot is being reclaimed, the memory must be freed to prevent leaks, and a NULL connection just means the server-side DEALLOCATE is unsafe/unnecessary. The NULL check should gate only the remote deallocation, not the local bookkeeping. This is the right call — conflating "can I talk to the server?" with "can I free local memory?" would cause leaks.

Locale-cleanup trap in ECPGget_desc()

Nishant caught a latent resource leak in the proposed ECPGget_desc() fix: the function allocates stmt.oldlocale = ecpg_strdup(...) and calls setlocale() before the point where the NULL check was originally proposed. Returning early at that point would leak the strdup'd locale and leave the thread's locale altered. Shruthi's fix was structurally cleaner than adding cleanup goto: move the connection check above the locale manipulation so the early return happens before any resource acquisition. This respects the RAII-like ordering rule that validation precede acquisition.

Test Coverage

Mahendrakar requested crash tests; Nishant enforced ECPG's local test conventions:

  • Files named testN.pgc (not descriptive names), with descriptions inside.
  • Expected outputs in src/interfaces/ecpg/test/expected/connect-testN.c.
  • Appropriate entries in .gitignore and meson.build.

The test exercises ECPGget_desc, ECPGdeallocate_all, ECPGprepared_statement, and the connection-less ECPGdo path. The ecpg_freeStmtCacheEntry() path could not be reached by black-box testing — Shruthi noted no existing test covers that line either, which suggests the statement cache's eviction path is broadly under-tested.

Back-branch considerations: meson wasn't present in REL_15/REL_14, requiring a separate v3_test_v15 patch stripping meson bits. Also a cleanup ROLLBACK was added before DISCONNECT to avoid leaving test table test1 behind.

Post-Commit Fallout

After Andrew Dunstan pushed the combined patches on 2026-05-01, Alexander Lakhin reported that the buildfarm animal dikkop was still segfaulting in thread/prep and thread/alloc — reproducible by constraining max_connections = 10 and running the ecpg tests. The crash signature is different from the original bug:

  1. In thread/prep, deallocate_one() crashes at this->stmt->connection->connection because this->stmt itself is corrupted (0x242028205345554c — that's ASCII " ( SELEC", i.e., fragments of a query string), indicating a use-after-free or concurrent-mutation issue in the prepared-statement list.
  2. In thread/alloc, pqRowProcessor crashes because conn->result is NULL — libpq-level state corruption.

Tom Lane confirmed reproduction and observed "sorry, too many clients already" in the postmaster log: when multiple threads attempt connections in parallel and some fail, the test programs don't tolerate the failure — they proceed to use prepared statements / execute queries on partially-initialized or concurrently-torn-down state. Tom explicitly noted this is not new in v19; v18 behaves the same.

So the original NULL-connection fix is correct and complete for its stated scope, but it exposed a deeper, pre-existing concurrency bug in ecpg's thread-aware test programs (and possibly in libecpg's thread-local connection bookkeeping). This is a classic outcome of defensive hardening: fixing the easy NULL-deref shifts the failure mode to the more serious race. Andrew committed to digging further.

Architectural Observations

  1. libecpg uses two parallel idioms for connection validation (ecpg_init() vs. inline ecpg_get_connection() + ecpg_raise()). Neither is wrong, but the inconsistency is precisely what allowed these sites to be overlooked for a decade-plus. A static-analysis audit or a wrapper macro enforcing the check would prevent recurrence.

  2. The statement cache holds its own connection pointer, independent of the caller's current connection. Any cleanup path through the cache must treat that pointer as potentially stale. This is the same pattern that causes bugs in libpq's async result management.

  3. Resource-acquisition ordering matters in error paths. The ECPGget_desc() locale-save issue shows that fixes to "return earlier" can silently leak if validation is inserted after acquisition. Moving the check up is structurally safer than adding cleanup.

  4. The dikkop failures suggest libecpg's thread-safe connection handling has a use-after-free when connections fail under contention. The corrupted this->stmt pointer indicates either an unlocked list mutation or premature free of a struct statement while another thread holds a pointer into the prepared-statements list.