Assertion Failure When Raising Errors from XACT_EVENT_PREPARE Callback
Core Problem
The reporter (Kirill Reshke, working on a sharding extension for PostgreSQL — pg-sharding) is building a two-phase commit (2PC) implementation layered on PostgreSQL shards. To test error paths in his distributed 2PC coordinator, he needs to simulate the case where PREPARE TRANSACTION fails on one of the participating shards — a realistic failure mode when max_prepared_transactions = 0 or other resource limits are hit on a shard.
To inject such a failure deterministically, he wrote a small extension (twopc_aux_tester) that registers an XactCallback and calls elog(ERROR, ...) when the event type is XACT_EVENT_PREPARE. The result on master (commit 5cdec42319) is not a clean error — it is an Assert failure:
TransactionIdIsValid(proc->xid) -- procarray.c:673
triggered from ProcArrayEndTransaction during AbortTransaction cleanup after the elog'd error propagates up.
Interestingly, raising an error from XACT_EVENT_PRE_PREPARE works cleanly. Only XACT_EVENT_PREPARE triggers the assertion.
Why This Happens (Architectural Analysis)
To understand the failure, one must look at the ordering of operations inside PrepareTransaction() in src/backend/access/transam/xact.c:
-
CallXactCallbacks(XACT_EVENT_PRE_PREPARE)— fires before any state transition. The transaction is still a normal live xact. An ERROR here rolls back throughAbortTransaction, which still sees a validMyProc->xidand correctly clears it viaProcArrayEndTransaction. Everything is consistent. -
The code then performs the heavy lifting of prepare: it writes the 2PC state file, inserts the GXACT into
TwoPhaseState, and crucially transfers ownership of the XID from the backend's PGPROC to the dummy PGPROC associated with the prepared transaction (MarkAsPreparing/PostPrepare_Twophasemachinery). After this transfer,MyProc->xidisInvalidTransactionId— from the procarray's perspective, this backend no longer owns the XID; the prepared-xact's dummy proc does. -
CallXactCallbacks(XACT_EVENT_PREPARE)— fires after the XID has been handed off. At this point, the transaction is, for most bookkeeping purposes, already "prepared." If a callback throws ERROR here, control unwinds intoAbortTransaction, which callsProcArrayEndTransaction(MyProc, latestXid). That function assertsTransactionIdIsValid(proc->xid)because it expects to be clearing an XID that the backend still owns — but it doesn't; the XID has been moved to the GXACT's dummy proc.
So the assertion is telling the truth about an invariant: once the XID has been transferred to the prepared-xact's dummy PGPROC, the normal abort path is no longer valid for cleaning up this backend. Either the callback must not fail, or the abort path needs to know how to deal with a half-prepared transaction.
Is It "Illegal" to elog() from XACT_EVENT_PREPARE?
Strictly speaking, the behavior is undocumented. Looking at the comments around CallXactCallbacks and the XactEvent enum in xact.h, there is no explicit contract about which events tolerate errors. By convention:
PRE_*events (PRE_COMMIT,PRE_PREPARE) are the documented "safe places to fail." Indeed,PRE_PREPAREwas specifically introduced so extensions (notably FDWs and logical decoding output plugins) have a place to validate or flush state while errors can still cleanly abort.- The non-
PRE_variants (COMMIT,ABORT,PREPARE,PARALLEL_*) are effectively "notification" callbacks that run after the point of no return. Throwing from them has long been considered unsafe — the canonical example is that throwing fromXACT_EVENT_COMMITleaves a committed-on-disk transaction that the backend then tries to abort in memory, producing WARNING messages and worse.
The XACT_EVENT_PREPARE case sits in the same category as XACT_EVENT_COMMIT: it runs after the state transition is logically complete. The assertion failure here makes this architectural expectation more explicit (in assert-enabled builds) but in a production build the behavior would be equally undefined — you'd likely get either the same WARN/ERROR spiral seen with post-commit errors, or silent state corruption of the procarray.
Design Implications
This raises a broader API-cleanliness question that has come up periodically on pgsql-hackers:
-
Should the documentation explicitly state which XactEvents may raise? Right now extension authors must infer this from reading xact.c. A comment block on
RegisterXactCallbackenumerating which events are "pre-commit-safe" vs. "notification-only" would prevent this class of bug. -
Should PostgreSQL defensively
PG_TRY()around post-point-of-no-return callbacks and downgrade errors to WARNING? This is what is done in some other notification paths (e.g., certain rmgr cleanup hooks). The downside is it masks extension bugs. -
Should the assertion in
ProcArrayEndTransactionbe softened, or shouldAbortTransactiondetect the "XID already transferred to GXACT" state and take a different cleanup path? For the prepare case specifically, one could imagine abort logic that rolls back the prepared transaction (callingFinishPreparedTransactionwithisCommit=falsesemantics on the just-created GXACT). But that significantly complicates the abort path for a corner case.
For the reporter's actual testing goal, the right answer is almost certainly to inject the error from XACT_EVENT_PRE_PREPARE instead, which is the documented injection point. His observation that PRE_PREPARE works fine confirms this path is viable for his test harness.
What a Fix Might Look Like
A minimal, non-behavior-changing patch would:
- Add a comment in
xact.hnear theXactEventenum documenting which events permit ereport(ERROR). - Possibly upgrade the assertion in
ProcArrayEndTransactionto anelog(WARNING)plus early return when called after XID transfer, so production builds don't silently corrupt procarray state.
A more ambitious change would be to wrap the XACT_EVENT_PREPARE callback invocation in a critical section or PG_TRY that demotes errors, matching the treatment given to post-commit callbacks. But that's a semantics change that would need consensus.
Status of the Thread
This is a single-message report with no responses yet in the supplied corpus. It's essentially a bug report / API-clarification question awaiting triage. The likely outcome, based on similar past threads, is: documentation clarification + possibly relaxing the assertion, rather than making post-prepare callbacks error-tolerant.