Race Conditions in Logical Decoding: CLOG vs Snapshot Consistency
The Core Problem
This thread exposes a fundamental race condition in PostgreSQL's logical decoding infrastructure that can cause data corruption on both subscriber and publisher.
The Race Window
The issue lies in the ordering of operations during transaction commit:
RecordTransactionCommit()writes the COMMIT WAL record- The CLOG is updated to mark the transaction as committed
- The transaction waits for synchronous replica ACK (if configured)
- The transaction is removed from ProcArray
Logical decoding reads WAL records asynchronously. When DecodeCommit() processes a COMMIT record, it calls SnapBuildCommitTxn() → SnapBuildAddCommittedTxn(), which adds the XID to the snapshot builder's committed.xip[] array. However, the CLOG may not yet reflect the committed status at this point, because the WAL write (step 1) precedes the CLOG update (step 2).
When SnapBuildBuildSnapshot() subsequently constructs a snapshot, it includes this XID in snapshot->xip as a committed transaction. If the snapshot is then used for visibility checks before CLOG is updated, HeapTupleSatisfiesMVCC (or similar) will consult CLOG and get the wrong answer — the transaction appears uncommitted or in-progress.
Consequences
- Wrong data on subscriber: Logical replication delivers incorrect tuple visibility results
- Corrupted publisher table: Hint bits are set based on incorrect visibility determinations. Once hint bits are written, the corruption persists even after CLOG is eventually updated
- Amplified by REPACK: The pg_repack extension's concurrent operations create stress patterns that expose this race more frequently than standard logical replication
Why This Is Architecturally Significant
PostgreSQL's MVCC correctness depends on a critical invariant: if a transaction appears committed in a snapshot, then CLOG must confirm it as committed when consulted. For ProcArray-based snapshots, this is guaranteed because:
- XID is not removed from ProcArray until after CLOG is updated
- Snapshot construction reads the running-transaction list from ProcArray
Logical decoding snapshots bypass this mechanism entirely — they derive committed transaction lists from WAL records, creating a new code path that violates the invariant.
Proposed Solutions
Approach 1: Wait for CLOG in SnapBuildBuildSnapshot() (Preferred — 0001)
Before finalizing the snapshot, iterate through all XIDs in builder->committed.xip[] and verify via TransactionIdDidCommit() that CLOG confirms the commit. If not, sleep and retry.
Álvaro's optimized version:
TransactionId *stillrunning;
int nstillrunning = builder->committed.xcnt;
stillrunning = palloc(sizeof(TransactionId) * builder->committed.xcnt);
memcpy(stillrunning, builder->committed.xip, sizeof(TransactionId) * nstillrunning);
for (;;) {
int next = 0;
if (nstillrunning == 0) break;
for (int i = 0; i < nstillrunning; i++) {
if (!TransactionIdDidCommit(stillrunning[i]))
stillrunning[next++] = stillrunning[i];
}
if (next == 0) break;
nstillrunning = next;
WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
10L, WAIT_EVENT_SNAPBUILD_CLOG);
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
}
This batches the wait — testing all remaining XIDs per iteration, sleeping once, then retesting only those still not confirmed. Antonin notes this is likely over-engineering since the race window is tiny and almost always resolves after one check, but doesn't object.
Key tradeoff: Using TransactionIdDidCommit() without first checking ProcArray is unusual in PostgreSQL. Normally CLOG is only consulted after confirming a transaction is no longer in ProcArray. The justification here is that logical decoding has its own mechanism for determining committed status (WAL records), so the CLOG check is purely a synchronization fence to ensure the CLOG write has landed.
Approach 2: Use TransactionIdIsInProgress() to omit still-running XIDs (0002)
Instead of waiting, simply skip any XID from committed.xip[] that TransactionIdIsInProgress() reports as still running. The logic: if it's still in ProcArray, its CLOG update hasn't landed, so don't include it in this snapshot — it'll be picked up in a future snapshot.
Problems identified:
- Causes subscription test failures (snapshot correctness issues)
TransactionIdIsInProgress()compares againstRecentXminbefore accessing CLOG, which interacts badly withPROC_IN_SAFE_ICflag — transactions may be incorrectly deemed "not in progress" while still uncommitted in CLOG- Conceptually questionable: a decoded COMMIT record definitively means the transaction committed; omitting it creates an inconsistent view
Rejected Approach: XactLockTableWait() in SnapBuildAddCommittedTxn()
Antonin's initial attempt caused deadlocks with synchronous replication:
- Publisher decodes COMMIT, waits for transaction lock release
- But the committing transaction is waiting for synchronous replica ACK
- The replica ACK requires the subscriber to confirm receipt of decoded data
- Circular dependency: Publisher can't send data because it's waiting → subscriber can't ACK → publisher's commit can't complete → lock never releases
This same deadlock affects any approach using XactLockTableWait() or TransactionIdIsInProgress() with synchronous replication.
Why Approach 1 Avoids Deadlock
The TransactionIdDidCommit() poll-and-sleep approach doesn't acquire any locks and doesn't block the committing transaction. It simply reads CLOG state. The committing transaction will update CLOG regardless of what the decoder is doing (CLOG update precedes sync replication wait in the commit sequence). So the decoder will see the commit in CLOG almost immediately, well before the sync replication wait completes.
Wait — actually re-reading the commit sequence: WAL write → CLOG update → sync rep wait → ProcArray removal. So CLOG is updated before sync rep wait. This means the race window is actually between WAL write and CLOG update, which is extremely narrow. The poll approach with 10ms timeout is more than sufficient.
Key Design Decisions and Open Questions
-
Where to place the fix: Álvaro asked whether the fix belongs in
DecodeCommit()(wait before adding to committed array) vsSnapBuildBuildSnapshot()(wait before using the array). The latter was chosen because it's the point where correctness matters. -
Commentary needed: Álvaro notes that using
TransactionIdDidCommit()without ProcArray verification needs careful documentation explaining why it's safe in this context. -
New wait event:
WAIT_EVENT_SNAPBUILD_CLOGwould need to be added to track this specific wait condition. -
Bug-fix vs feature: Antonin notes this is a bug fix for existing code, not subject to feature freeze, though it was eventually added to CommitFest for tracking.
Relationship to pg_repack
The bug was discovered through stress testing of the REPACK patch (CF #5117). REPACK performs concurrent table reorganization that creates high-frequency snapshot construction during logical decoding, dramatically increasing the probability of hitting this race window. While the bug exists in core PostgreSQL logical replication, REPACK makes it practically reproducible rather than merely theoretical.