Deep Technical Analysis: 64-bit XIDs in PostgreSQL
The Architectural Problem
PostgreSQL's 32-bit TransactionId imposes a hard ceiling on transaction throughput: once ~2 billion XIDs have been consumed, the system must perform anti-wraparound vacuuming or shut down to prevent data corruption. On modern hardware with high-throughput OLTP workloads, this ceiling can be reached in hours rather than years. The Postgres Pro fork has deployed 64-bit XIDs in production with customers wrapping every 1-5 days, which motivated Maxim Orlov's proposal to upstream this work for PG15.
The architectural tension is fundamental:
- Tuple header space is precious.
HeapTupleHeaderhast_xminandt_xmaxas 32-bit fields. Naively widening them to 64-bit adds 8 bytes to every tuple — unacceptable disk bloat. - Wraparound is not just a counter issue. It drives freezing, which
in turn drives SLRU truncation (
pg_xact,pg_multixact). Any design must decide what to do about the freezing regime. - pg_upgrade must work. Billions of existing pages cannot be rewritten at upgrade time. Any on-disk format change must be convertible lazily.
The Proposed Design (Postgres Pro Lineage)
The patch (derived from Alexander Korotkov's earlier work and used commercially by Postgres Pro) takes a page-level base XID approach:
TransactionIdis redefined asint64throughout the code.- On disk, tuple headers retain 32-bit
xmin/xmax(nowShortTransactionId), interpreted as offsets from per-pagepd_xid_baseandpd_multi_basevalues stored in the page's special area (16 bytes added). - Because upgraded pages may lack space for the 16-byte special, a
transitional "double xmax" format is introduced:
xminis treated as virtualFrozenTransactionIdandxmaxoccupies both 32-bit slots as a full 64-bit value. Pages are converted to the normal format lazily by VACUUM/pruning when space becomes available. - In-memory
HeapTuplecarries precomputed 64-bit xmin/xmax copied from the page under buffer content lock.
Key Design Debates and Disagreements
Debate 1: The "double xmax" transitional format
Jim Finnerty (Amazon) pushed hard for eliminating double-xmax by either
pre-upgrade preparation (pg_repack-style rewriting) or post-upgrade
conversion before completion. Korotkov responded that Postgres Pro
tried this approach (pg_pageprep) and found it "very difficult and
unreliable" — double-xmax was the pragmatic fallback. Robert Haas sided
against pre-upgrade preparation on rollout grounds: you can't ship a
feature that requires waiting multiple major versions. This debate was
eventually resolved in 2026 by dropping double-xmax entirely in favor of
a PD_HAS_NO_SPECIAL flag that overlays base values onto existing page
header fields (pd_prune_xid, pd_pagesize_version/pd_special).
Debate 2: 32-bit limit inside a page
Stephen Frost raised an early concern: even with 64-bit XIDs globally,
a single page can only hold XIDs within ~2^32 of its base. A long-running
transaction (e.g. a week-long OLAP query on a 100K TPS system) could
make it impossible to insert new tuples onto pages containing its XIDs,
causing user-visible errors. Peter Geoghegan deepened this critique:
freezing requires the XID be committed AND visible to all snapshots, so
you can't just freeze your way out. The patch's answer — throw an error
on insert when heap_page_prepare_for_xid can't shift the base — is
regarded as a user-hostile "gotcha."
Debate 3: Should we remove xidStopLimit / the failsafe?
This became the dominant philosophical argument. Robert Haas argued forcefully: xidStopLimit's sole purpose is preventing XID reuse that corrupts data; once that's impossible, keeping the shutdown is like "sending checks to the mortgage company after paying off the loan." Peter Geoghegan countered that xidStopLimit, while not designed for it, effectively bounds freeze debt, and removing it without replacement hides the real issue (autovacuum starvation, broken monitoring, etc.). He suggested throttling rather than stopping, akin to LSM tree write stalls in MyRocks. Chris Travers advocated repurposing the wraparound warnings as configurable "XID lag" warnings so DBAs still get an early-warning signal. Consensus landed somewhere in the middle: keep a warning mechanism (perhaps every N million XIDs), remove the hard stop.
Debate 4: TransactionId == FullTransactionId?
This became the committer showstopper. The patch redefines
TransactionId as 64-bit throughout, making FullTransactionId
redundant. Andres Freund, Heikki Linnakangas, and Robert Haas all
independently objected:
FullTransactionIdwas introduced deliberately for places needing epoch awareness; flattening the two erases a meaningful distinction.- Making every in-memory XID 64-bit doubles ProcArray footprint, regresses Andres's recent scalability improvements, and bloats WAL records by 4 bytes each (measured at ~5% overhead on small-record workloads).
- It makes the patch dramatically larger and harder to review.
Maxim Orlov's argument in 2026 that "if we had 64-bit XIDs from the start we wouldn't have needed FullTransactionId" was directly rebutted by Haas: "every professor in college would take style points off" for redefining an existing type rather than widening the narrower places incrementally.
Debate 5: 33-bit vs 64-bit page base
Haas proposed using half-epochs (equivalently, a 33-bit base) so that
the usual case of "all tuples on a page are in the current half-epoch"
requires no tuple adjustment when advancing. Linnakangas refined this:
33 bits is the minimum needed to represent any two XIDs that are less
than 2^31 apart. The 2026 roadmap settled on storing a 32-bit base
computed as xid64 = (base << 31) + xid32, which accepts a 63-bit
effective XID space (fine, given 64-bit LSNs are the real ceiling).
Debate 6: Bufmgr layering violation
Andres Freund's late review exposed a serious architectural bug: the
patch calls convert_page() inside buffer_readv_complete_one(), and
smuggles a Relation* pointer through IO handles to determine relkind.
This is fundamentally broken for AIO because the backend completing the
IO may not be the one that issued it, the relation may be closed, and
bufmgr has no business knowing about heap semantics. Conversion must
live in the heap AM layer.
The Evolution of the Plan
The thread spans ~4 years and demonstrates how a large invasive patch gradually gets decomposed:
- Phase 1 (2022): XID_FMT refactoring and int64 GUCs split off as independent committable pieces.
- Phase 2 (2022-2024): 64-bit SLRU indexing extracted to its own thread — this is widely agreed to be committable and reliability- improving on its own merits. Eventually merged for PG17.
- Phase 3 (2024-2025): 64-bit Multixact offsets split off.
- Phase 4 (2026): Fundamental redesign per committer guidance. Korotkov posts the roadmap: keep TransactionId 32-bit, keep the 2^31 running-transaction distance limit, introduce page-level epoch FIRST (giving immediate benefit: lazy freeze without dirtying pages when no dead tuples exist), THEN expand CLOG separately, THEN drop anti- wraparound vacuum. Eliminate double-xmax using pd_flags overlay.
Deep Technical Insights
Insight: Freezing decoupling from wraparound
Heikki Linnakangas articulated the single biggest architectural win:
once the page has an epoch, encountering an XID older than
relfrozenxid implicitly means "committed, visible to all" — no page
modification needed. This decouples freezing from wraparound and makes
aggressive vacuum a purely SLRU-space-management concern, not a
correctness concern. This is arguably more valuable than 64-bit XIDs
themselves.
Insight: Why XID-space is the wrong unit
Peter Geoghegan's recurring point: freeze debt should be measured in
unfrozen pages, not XID age. Two tables at age(relfrozenxid) = 1 billion can have radically different actual freeze costs. The current
XidStopLimit mechanism fires on the wrong signal. More XID runway
doesn't fix this miscalibration; it just defers symptoms.
Insight: Replica/read-only conversion hazard
The patch originally modified pages in memory during read-only access (converting to 64-bit format on first read), without WAL logging. Andres identified this as catastrophically unsafe on promotion. The fix (REGBUF_CONVERTED buffer descriptor bit triggering FPW on transition to read-write) is itself questionable — conversion probably shouldn't happen lazily at all; it should be a heap-AM operation triggered on write paths only.
Insight: Heap/INSERT+INIT replay inconsistency
Evgeny Voropaev discovered a subtle bug: when pruning with
repairFragmentation=false leaves an empty-but-fragmented page, a
subsequent heap_insert may set XLOG_HEAP_INIT_PAGE (because the new
tuple is at FirstOffsetNumber), but the primary's resulting page
differs from the replica's replayed page because redo initializes the
page cleanly. The fix is to not set XLOG_HEAP_INIT_PAGE when the
page isn't genuinely empty (phdr->pd_special == phdr->pd_upper).
Insight: TOAST chunk size compatibility
On 32-bit architectures, adding the 16-byte heap special reduces
TOAST_MAX_CHUNK_SIZE, which would break pg_upgrade because existing
TOAST chunks have a different size. Solution: give TOAST pages their
own (smaller) special area type without pd_multi_base, preserving
the chunk size.
Participant Weight
- Robert Haas, Andres Freund, Heikki Linnakangas: all three committers converged independently on rejecting the TransactionId = FullTransactionId merger and endorsing incremental decomposition. When three committers align, the patch either conforms or dies.
- Peter Geoghegan: VACUUM/freezing domain expert; his first- principles critique (XID space is the wrong currency) reframes the whole problem. Not a blocker, but shapes long-term direction.
- Alexander Korotkov, Maxim Orlov, Pavel Borisov: principal authors from Postgres Pro/Supabase ecosystem; bring production deployment experience from the fork.
- Stephen Frost, Bruce Momjian, Simon Riggs: early design reviewers; Simon's TAM-based "heap64" suggestion was floated but not pursued.
- Chris Travers: operational DBA perspective; insistence on retained warning mechanisms shaped the xidStopLimit debate.