2026-05-07 · opus 4.7

Assertion Failure When Raising Errors from XACT_EVENT_PREPARE Callback

Core Problem

The reporter (Kirill Reshke, working on a sharding extension for PostgreSQL — pg-sharding) is building a two-phase commit (2PC) implementation layered on PostgreSQL shards. To test error paths in his distributed 2PC coordinator, he needs to simulate the case where PREPARE TRANSACTION fails on one of the participating shards — a realistic failure mode when max_prepared_transactions = 0 or other resource limits are hit on a shard.

To inject such a failure deterministically, he wrote a small extension (twopc_aux_tester) that registers an XactCallback and calls elog(ERROR, ...) when the event type is XACT_EVENT_PREPARE. The result on master (commit 5cdec42319) is not a clean error — it is an Assert failure:

TransactionIdIsValid(proc->xid)  -- procarray.c:673

triggered from ProcArrayEndTransaction during AbortTransaction cleanup after the elog'd error propagates up.

Interestingly, raising an error from XACT_EVENT_PRE_PREPARE works cleanly. Only XACT_EVENT_PREPARE triggers the assertion.

Why This Happens (Architectural Analysis)

To understand the failure, one must look at the ordering of operations inside PrepareTransaction() in src/backend/access/transam/xact.c:

CallXactCallbacks(XACT_EVENT_PRE_PREPARE) — fires before any state transition. The transaction is still a normal live xact. An ERROR here rolls back through AbortTransaction, which still sees a valid MyProc->xid and correctly clears it via ProcArrayEndTransaction. Everything is consistent.
The code then performs the heavy lifting of prepare: it writes the 2PC state file, inserts the GXACT into TwoPhaseState, and crucially transfers ownership of the XID from the backend's PGPROC to the dummy PGPROC associated with the prepared transaction (MarkAsPreparing / PostPrepare_Twophase machinery). After this transfer, MyProc->xid is InvalidTransactionId — from the procarray's perspective, this backend no longer owns the XID; the prepared-xact's dummy proc does.
CallXactCallbacks(XACT_EVENT_PREPARE) — fires after the XID has been handed off. At this point, the transaction is, for most bookkeeping purposes, already "prepared." If a callback throws ERROR here, control unwinds into AbortTransaction, which calls ProcArrayEndTransaction(MyProc, latestXid). That function asserts TransactionIdIsValid(proc->xid) because it expects to be clearing an XID that the backend still owns — but it doesn't; the XID has been moved to the GXACT's dummy proc.

So the assertion is telling the truth about an invariant: once the XID has been transferred to the prepared-xact's dummy PGPROC, the normal abort path is no longer valid for cleaning up this backend. Either the callback must not fail, or the abort path needs to know how to deal with a half-prepared transaction.

Is It "Illegal" to elog() from XACT_EVENT_PREPARE?

Strictly speaking, the behavior is undocumented. Looking at the comments around CallXactCallbacks and the XactEvent enum in xact.h, there is no explicit contract about which events tolerate errors. By convention:

PRE_* events (PRE_COMMIT, PRE_PREPARE) are the documented "safe places to fail." Indeed, PRE_PREPARE was specifically introduced so extensions (notably FDWs and logical decoding output plugins) have a place to validate or flush state while errors can still cleanly abort.
The non-PRE_ variants (COMMIT, ABORT, PREPARE, PARALLEL_*) are effectively "notification" callbacks that run after the point of no return. Throwing from them has long been considered unsafe — the canonical example is that throwing from XACT_EVENT_COMMIT leaves a committed-on-disk transaction that the backend then tries to abort in memory, producing WARNING messages and worse.

The XACT_EVENT_PREPARE case sits in the same category as XACT_EVENT_COMMIT: it runs after the state transition is logically complete. The assertion failure here makes this architectural expectation more explicit (in assert-enabled builds) but in a production build the behavior would be equally undefined — you'd likely get either the same WARN/ERROR spiral seen with post-commit errors, or silent state corruption of the procarray.

Design Implications

This raises a broader API-cleanliness question that has come up periodically on pgsql-hackers:

Should the documentation explicitly state which XactEvents may raise? Right now extension authors must infer this from reading xact.c. A comment block on RegisterXactCallback enumerating which events are "pre-commit-safe" vs. "notification-only" would prevent this class of bug.
Should PostgreSQL defensively PG_TRY() around post-point-of-no-return callbacks and downgrade errors to WARNING? This is what is done in some other notification paths (e.g., certain rmgr cleanup hooks). The downside is it masks extension bugs.
Should the assertion in ProcArrayEndTransaction be softened, or should AbortTransaction detect the "XID already transferred to GXACT" state and take a different cleanup path? For the prepare case specifically, one could imagine abort logic that rolls back the prepared transaction (calling FinishPreparedTransaction with isCommit=false semantics on the just-created GXACT). But that significantly complicates the abort path for a corner case.

For the reporter's actual testing goal, the right answer is almost certainly to inject the error from XACT_EVENT_PRE_PREPARE instead, which is the documented injection point. His observation that PRE_PREPARE works fine confirms this path is viable for his test harness.

What a Fix Might Look Like

A minimal, non-behavior-changing patch would:

Add a comment in xact.h near the XactEvent enum documenting which events permit ereport(ERROR).
Possibly upgrade the assertion in ProcArrayEndTransaction to an elog(WARNING) plus early return when called after XID transfer, so production builds don't silently corrupt procarray state.

A more ambitious change would be to wrap the XACT_EVENT_PREPARE callback invocation in a critical section or PG_TRY that demotes errors, matching the treatment given to post-commit callbacks. But that's a semantics change that would need consensus.

Status of the Thread

This is a single-message report with no responses yet in the supplied corpus. It's essentially a bug report / API-clarification question awaiting triage. The likely outcome, based on similar past threads, is: documentation clarification + possibly relaxing the assertion, rather than making post-prepare callbacks error-tolerant.

raising errors from Xact leads to assertion failure