Commit Sequence Numbers and Visibility: Deep Technical Analysis
Core Problem
PostgreSQL's current snapshot mechanism is built on the procarray — a shared-memory structure listing all active transaction IDs. When a backend acquires a snapshot, it scans the procarray to determine which transactions are "in progress" and therefore whose effects should be invisible. This mechanism has three fundamental architectural problems:
-
Opacity and non-determinism: The snapshot's contents depend on the instantaneous state of the procarray, which is difficult to observe, reason about, or reproduce externally. Two snapshots taken at nearly the same wall-clock time can see radically different sets of committed transactions depending on the exact timing of procarray removals.
-
Primary-replica divergence: Some snapshots that are valid on the primary are impossible to reproduce on the replica. The replica applies WAL sequentially, so its visibility is inherently LSN-ordered. The procarray-based snapshot on the primary can include transaction A but exclude transaction B even if B committed (wrote its commit record) before A — simply because B hadn't yet been removed from the procarray. This creates a fundamental impedance mismatch for logical replication, standby query consistency, and any system trying to reason about causal ordering across primary and replica.
-
Scalability: Acquiring a procarray-based snapshot requires scanning all active transaction slots under a lock (LWLock in shared mode), which becomes a contention bottleneck at high connection counts. This is a well-known scalability problem that has motivated years of incremental improvements (snapshot-too-old, CSN proposals, etc.).
Proposed Solution: CSN-Based Snapshots
The proposal redefines snapshot semantics around a single scalar value — the Commit Sequence Number (CSN), which is simply a WAL LSN:
A transaction's changes are visible if and only if its commit record LSN ≤ the snapshot's CSN.
This is elegant because it collapses a complex set-membership question (is xid X in the snapshot?) into a simple scalar comparison. It also aligns perfectly with the replica's natural visibility model (everything up to the last-applied LSN is visible).
The Two Key Design Questions
Question 1: How to Choose the CSN for a New Snapshot
Three constraints bound the CSN choice:
- (a) Monotonicity: CSN ≥ any previously acquired snapshot's CSN (prevents going "back in time")
- (b) Upper bound: CSN ≤ last WAL insert position (can't see the future)
- (c) Completeness: All transactions that have released locks must be visible (preserves the guarantee that if you can observe a transaction's effects through any side-channel — e.g., the lock is no longer held — you will see its changes)
The proposal: Choose the minimum CSN satisfying constraint (c), defined as the highest commit record LSN among all transactions that have finished and released locks. This is tracked via a new shared atomic variable maxTransactionFinishedPtr.
Why not the WAL insert pointer? Previous proposals in the referenced thread [1] used the WAL insert pointer. Jeff argues this has no advantage and makes Question 2 harder — because the insert pointer races ahead of what's been flushed/replicated, creating a larger gap between "what the CSN includes" and "what's durable."
The minimum-CSN approach is conservative: it only advances when transactions actually complete, which means the CSN naturally stays close to what has already met its durability requirements.
Question 2: Durability Requirements Before Snapshot Use
This is the subtler and more consequential question. The current procarray system has an implicit (arguably buggy) property: a synchronous-replication transaction T1 can see the effects of an async transaction T0 before T0 is flushed to disk, because T0 might have left the procarray (committed from its own perspective) before T1 takes its snapshot.
The proposal: Before using a snapshot with CSN = L, the backend must wait until LSN L is durable to the standard of the transaction using the snapshot. A synchronous_commit = on reader must wait until the CSN is replicated to the standby. A synchronous_commit = local reader must wait until the CSN is flushed locally.
This is a semantic shift: durability obligations are now defined by the reader rather than the writer. Jeff argues this is "least-surprising" — a sync transaction should never be able to read uncommitted-from-its-own-durability-perspective data.
Critical Tradeoff: The Mixed-Durability Worst Case
The most significant regression risk comes from mixed-durability workloads. The concrete scenario:
- Sync transaction T1 writes commit record at LSN 122 (not yet flushed)
- Async transaction T2 writes commit record at LSN 123, finishes, releases locks
maxTransactionFinishedPtradvances to 123- Sync transaction T3 takes snapshot with CSN = 123
- CSN 123 includes T1 (commit LSN 122 < 123), but T1 hasn't flushed yet
- T3 must wait for LSN 122 to flush before using its snapshot
With procarray-based snapshots, T1 would still be in the procarray (it hasn't released locks because it's waiting for flush), so it would simply be excluded from T3's snapshot. CSN-based snapshots cannot do this — once you define visibility as "commit LSN ≤ CSN", there's no way to selectively exclude a transaction whose commit LSN falls below the CSN.
The proposed mitigation is to track additional metadata about which unflushed transactions are async vs. sync, allowing sync readers to skip waiting when the only unflushed commits below the CSN are from async writers. But the worst case — frequent async writers pushing maxTransactionFinishedPtr forward, interleaved with infrequent sync writers that haven't flushed, and frequent sync readers that must wait — may be unavoidable.
Implementation Architecture
The core implementation is surprisingly minimal:
/* Updated in ProcArrayEndTransaction, AFTER clog update and durability wait,
BEFORE lock release */
pg_atomic_monotonic_advance_u64(&maxTransactionFinishedPtr, myCommitLSN);
The ordering is critical:
- After CLOG update: Otherwise a snapshot user could see the commit via CSN but find CLOG says "in progress" — a visibility inconsistency. (Alternatively, visibility checks could be modified to not rely solely on CLOG, as previous implementations in [1] did.)
- After durability requirement met: So that the variable's value represents what's actually durable to the writer's standard.
- Before lock release: Prevents the race where a transaction releases locks (making its effects observable via lock absence) but hasn't yet updated
maxTransactionFinishedPtr.
The use of pg_atomic_monotonic_advance_u64() is lock-free and uses a compare-and-swap loop that only succeeds if advancing the value, making it both correct under concurrency and extremely cheap (no LWLock needed).
Compatibility Considerations
-
SERIALIZABLE/SSI: Jeff notes that SSI should be compatible because it operates on conflict detection between transactions, which is orthogonal to how snapshots are defined. The serializable order can differ from CSN order, but this is already true (serializable order can differ from commit order in the procarray world).
-
Subtransactions & 2PC: Claimed to have no major impact. This seems correct — subtransaction commits don't write WAL commit records (they write subxact records), and the top-level commit LSN is what matters for CSN. 2PC's COMMIT PREPARED writes a commit record like any other transaction.
-
Replica behavior: Trivially correct — the replica's CSN is simply the last-applied WAL LSN, which is already how hot standby visibility works conceptually.
Architectural Significance
This proposal represents a fundamental shift in PostgreSQL's consistency model from set-based (snapshot = set of visible XIDs) to scalar-based (snapshot = single LSN threshold). This has profound implications:
- Snapshot acquisition becomes O(1) instead of O(active_transactions)
- Cross-system consistency becomes trivially expressible (give me everything up to LSN X)
- Logical decoding and external CDC systems can reason about exactly what a snapshot contains
- Replica snapshots and primary snapshots become identical in semantics, eliminating an entire class of replication anomalies
The cost is the mixed-durability regression and a conceptual loss: you can no longer have a snapshot that "skips" a committed transaction while including a later one. This is actually a feature — it eliminates a class of anomalies — but it means some workloads that relied on the procarray's accidental behavior may see latency changes.