Add 64-bit XIDs into PostgreSQL 15

First seen: 2021-12-30 12:15:16+00:00 · Messages: 228 · Participants: 41

Latest Update

2026-05-06 · opus 4.7

Deep Technical Analysis: 64-bit XIDs in PostgreSQL

The Architectural Problem

PostgreSQL's 32-bit TransactionId imposes a hard ceiling on transaction throughput: once ~2 billion XIDs have been consumed, the system must perform anti-wraparound vacuuming or shut down to prevent data corruption. On modern hardware with high-throughput OLTP workloads, this ceiling can be reached in hours rather than years. The Postgres Pro fork has deployed 64-bit XIDs in production with customers wrapping every 1-5 days, which motivated Maxim Orlov's proposal to upstream this work for PG15.

The architectural tension is fundamental:

The Proposed Design (Postgres Pro Lineage)

The patch (derived from Alexander Korotkov's earlier work and used commercially by Postgres Pro) takes a page-level base XID approach:

  1. TransactionId is redefined as int64 throughout the code.
  2. On disk, tuple headers retain 32-bit xmin/xmax (now ShortTransactionId), interpreted as offsets from per-page pd_xid_base and pd_multi_base values stored in the page's special area (16 bytes added).
  3. Because upgraded pages may lack space for the 16-byte special, a transitional "double xmax" format is introduced: xmin is treated as virtual FrozenTransactionId and xmax occupies both 32-bit slots as a full 64-bit value. Pages are converted to the normal format lazily by VACUUM/pruning when space becomes available.
  4. In-memory HeapTuple carries precomputed 64-bit xmin/xmax copied from the page under buffer content lock.

Key Design Debates and Disagreements

Debate 1: The "double xmax" transitional format

Jim Finnerty (Amazon) pushed hard for eliminating double-xmax by either pre-upgrade preparation (pg_repack-style rewriting) or post-upgrade conversion before completion. Korotkov responded that Postgres Pro tried this approach (pg_pageprep) and found it "very difficult and unreliable" — double-xmax was the pragmatic fallback. Robert Haas sided against pre-upgrade preparation on rollout grounds: you can't ship a feature that requires waiting multiple major versions. This debate was eventually resolved in 2026 by dropping double-xmax entirely in favor of a PD_HAS_NO_SPECIAL flag that overlays base values onto existing page header fields (pd_prune_xid, pd_pagesize_version/pd_special).

Debate 2: 32-bit limit inside a page

Stephen Frost raised an early concern: even with 64-bit XIDs globally, a single page can only hold XIDs within ~2^32 of its base. A long-running transaction (e.g. a week-long OLAP query on a 100K TPS system) could make it impossible to insert new tuples onto pages containing its XIDs, causing user-visible errors. Peter Geoghegan deepened this critique: freezing requires the XID be committed AND visible to all snapshots, so you can't just freeze your way out. The patch's answer — throw an error on insert when heap_page_prepare_for_xid can't shift the base — is regarded as a user-hostile "gotcha."

Debate 3: Should we remove xidStopLimit / the failsafe?

This became the dominant philosophical argument. Robert Haas argued forcefully: xidStopLimit's sole purpose is preventing XID reuse that corrupts data; once that's impossible, keeping the shutdown is like "sending checks to the mortgage company after paying off the loan." Peter Geoghegan countered that xidStopLimit, while not designed for it, effectively bounds freeze debt, and removing it without replacement hides the real issue (autovacuum starvation, broken monitoring, etc.). He suggested throttling rather than stopping, akin to LSM tree write stalls in MyRocks. Chris Travers advocated repurposing the wraparound warnings as configurable "XID lag" warnings so DBAs still get an early-warning signal. Consensus landed somewhere in the middle: keep a warning mechanism (perhaps every N million XIDs), remove the hard stop.

Debate 4: TransactionId == FullTransactionId?

This became the committer showstopper. The patch redefines TransactionId as 64-bit throughout, making FullTransactionId redundant. Andres Freund, Heikki Linnakangas, and Robert Haas all independently objected:

Maxim Orlov's argument in 2026 that "if we had 64-bit XIDs from the start we wouldn't have needed FullTransactionId" was directly rebutted by Haas: "every professor in college would take style points off" for redefining an existing type rather than widening the narrower places incrementally.

Debate 5: 33-bit vs 64-bit page base

Haas proposed using half-epochs (equivalently, a 33-bit base) so that the usual case of "all tuples on a page are in the current half-epoch" requires no tuple adjustment when advancing. Linnakangas refined this: 33 bits is the minimum needed to represent any two XIDs that are less than 2^31 apart. The 2026 roadmap settled on storing a 32-bit base computed as xid64 = (base << 31) + xid32, which accepts a 63-bit effective XID space (fine, given 64-bit LSNs are the real ceiling).

Debate 6: Bufmgr layering violation

Andres Freund's late review exposed a serious architectural bug: the patch calls convert_page() inside buffer_readv_complete_one(), and smuggles a Relation* pointer through IO handles to determine relkind. This is fundamentally broken for AIO because the backend completing the IO may not be the one that issued it, the relation may be closed, and bufmgr has no business knowing about heap semantics. Conversion must live in the heap AM layer.

The Evolution of the Plan

The thread spans ~4 years and demonstrates how a large invasive patch gradually gets decomposed:

  1. Phase 1 (2022): XID_FMT refactoring and int64 GUCs split off as independent committable pieces.
  2. Phase 2 (2022-2024): 64-bit SLRU indexing extracted to its own thread — this is widely agreed to be committable and reliability- improving on its own merits. Eventually merged for PG17.
  3. Phase 3 (2024-2025): 64-bit Multixact offsets split off.
  4. Phase 4 (2026): Fundamental redesign per committer guidance. Korotkov posts the roadmap: keep TransactionId 32-bit, keep the 2^31 running-transaction distance limit, introduce page-level epoch FIRST (giving immediate benefit: lazy freeze without dirtying pages when no dead tuples exist), THEN expand CLOG separately, THEN drop anti- wraparound vacuum. Eliminate double-xmax using pd_flags overlay.

Deep Technical Insights

Insight: Freezing decoupling from wraparound

Heikki Linnakangas articulated the single biggest architectural win: once the page has an epoch, encountering an XID older than relfrozenxid implicitly means "committed, visible to all" — no page modification needed. This decouples freezing from wraparound and makes aggressive vacuum a purely SLRU-space-management concern, not a correctness concern. This is arguably more valuable than 64-bit XIDs themselves.

Insight: Why XID-space is the wrong unit

Peter Geoghegan's recurring point: freeze debt should be measured in unfrozen pages, not XID age. Two tables at age(relfrozenxid) = 1 billion can have radically different actual freeze costs. The current XidStopLimit mechanism fires on the wrong signal. More XID runway doesn't fix this miscalibration; it just defers symptoms.

Insight: Replica/read-only conversion hazard

The patch originally modified pages in memory during read-only access (converting to 64-bit format on first read), without WAL logging. Andres identified this as catastrophically unsafe on promotion. The fix (REGBUF_CONVERTED buffer descriptor bit triggering FPW on transition to read-write) is itself questionable — conversion probably shouldn't happen lazily at all; it should be a heap-AM operation triggered on write paths only.

Insight: Heap/INSERT+INIT replay inconsistency

Evgeny Voropaev discovered a subtle bug: when pruning with repairFragmentation=false leaves an empty-but-fragmented page, a subsequent heap_insert may set XLOG_HEAP_INIT_PAGE (because the new tuple is at FirstOffsetNumber), but the primary's resulting page differs from the replica's replayed page because redo initializes the page cleanly. The fix is to not set XLOG_HEAP_INIT_PAGE when the page isn't genuinely empty (phdr->pd_special == phdr->pd_upper).

Insight: TOAST chunk size compatibility

On 32-bit architectures, adding the 16-byte heap special reduces TOAST_MAX_CHUNK_SIZE, which would break pg_upgrade because existing TOAST chunks have a different size. Solution: give TOAST pages their own (smaller) special area type without pd_multi_base, preserving the chunk size.

Participant Weight