2026-05-18 · claude-opus-4-6

Lazy Snapshot Distribution in Logical Decoding

Core Problem: O(N²) Snapshot Growth in the Reorder Buffer

This patch addresses a fundamental scalability problem in PostgreSQL's logical decoding infrastructure where the combination of a long-running transaction (or any xmin holder) and frequent catalog-modifying commits causes quadratic growth in reorder buffer spill files.

The Chain Reaction

The problem manifests through a four-stage chain reaction:

VACUUM triggers catalog invalidations: When autovacuum updates pg_class statistics (relpages, reltuples, relallvisible) via heap_inplace_update_and_unlock(), it triggers CacheInvalidateHeapTuple(). At command end, LogLogicalInvalidations() emits XLOG_XACT_INVALIDATIONS WAL records.
Decoding marks transactions as catalog-changing: xact_decode() processes these records and calls ReorderBufferXidSetCatalogChanges(), marking the VACUUM transaction as having catalog changes — even though pg_class stat updates don't affect tuple decoding semantics.
Snapshot distribution on commit: When the VACUUM commits, SnapBuildCommitTxn() sees the catalog-change flag, builds a new snapshot via SnapBuildBuildSnapshot(), and calls SnapBuildDistributeSnapshotAndInval() to distribute the snapshot to ALL in-progress transactions.
Monotonically growing snapshots: Since a long-running transaction prevents SnapBuildPurgeOlderTxn() from cleaning builder->committed.xip, each successive snapshot is larger than the last. Snapshot N contains N XIDs at 192 + 4*N bytes. Total disk usage = Σ(i=1..N)(192 + 4*i) = O(N²).

Why This Matters Architecturally

The problem is not academic. With N=100K catalog-modifying commits (realistic in a vacuum storm across many tables), the spill files reach ~80GB. This causes:

Premature spilling: INTERNAL_SNAPSHOT entries consume logical_decoding_work_mem budget, forcing spills far earlier than data changes would require.
I/O amplification: Each snapshot is serialized during spill and deserialized during commit-time processing.
CPU overhead: ReorderBufferProcessTXN() iterates all changes via binary heap; 100K extra INTERNAL_SNAPSHOT entries mean 100K additional heap-pop operations.
Replication lag: The combined overhead directly increases commit-to-decode latency.

Two Necessary Conditions

The trigger requires:

Something preventing xmin advancement (the long write transaction itself, pg_dump with REPEATABLE READ, a slow logical replication consumer)
A transaction with at least one data change present in the reorder buffer (pure read-only sessions don't appear in the reorder buffer)

Proposed Solution: Generation-Counter-Based Lazy Distribution

Design

Instead of distributing a snapshot to every in-progress transaction on each catalog-modifying commit (eager), the patch defers distribution until a transaction actually needs to decode a data change.

Key mechanism:

A uint64 snapshot_generation counter in SnapBuild is incremented each time a new catalog snapshot is built.
Each transaction tracks uint64 last_snapshot_generation in ReorderBufferTXN.
SnapBuildProcessChange() checks whether the transaction's generation is behind the builder's and distributes the current snapshot only if needed.

Complexity Analysis

	Before (Eager)	After (Lazy)
Snapshot distributions	G (per commit)	min(K, G) (per data change)
Disk usage growth	O(N²)	O(min(K, G) × N)

Where: N = committed XID array size, G = catalog-modifying commits during the long txn, K = data changes in the long txn.

For the common production case where K << G (e.g., a long batch INSERT during a vacuum storm), this collapses from quadratic to near-constant.

Correctness Argument

The lazy approach is equivalent to eager because:

LSN-order processing: When SnapBuildProcessChange() is called at LSN X, builder->snapshot reflects exactly all catalog-modifying commits with LSN < X — neither too new nor too old.
Monotonic committed array: The array only grows while a long transaction holds xmin back. Snapshot N is a superset of snapshots 1..N-1. Skipping intermediate snapshots doesn't affect visibility.
Unused intermediate snapshots: In the original code, intermediate snapshots with no data changes between them are effectively dead. Lazy distribution simply avoids creating these entries.
Streaming mode compatibility: SnapBuildProcessChange() is always called before each data change regardless of streaming/buffering mode.
Subtransaction safety: The lazy snapshot is distributed to the toplevel transaction (using txn->xid). Eager invalidation distribution always places an invalidation entry in the toplevel's list at the DDL commit LSN (< data change LSN). During binary heap iteration, the invalidation entry ensures correct ordering.

Why Not Eliminate ReorderBufferAddSnapshot Entirely?

The patch explains why snapshots in the change list cannot be eliminated altogether: DecodeInsert/DecodeUpdate only enqueue raw WAL tuples. Actual catalog lookups (RelationIdGetRelation) happen during commit-time processing in ReorderBufferProcessTXN(). At that point, builder->snapshot has advanced past all changes. The INTERNAL_SNAPSHOT entries serve as markers telling commit-time processing "switch to this catalog snapshot from here on."

What About Filtering VACUUM?

The patch explicitly argues against narrower fixes like filtering VACUUM-generated invalidations. The O(N²) problem applies to any workload with many catalog-modifying commits: batch partition creation, online migration tools, extension installations, etc. A general solution at the distribution layer is needed.

Key Implementation Details

Files Modified

snapbuild_internal.h: Adds uint64 snapshot_generation to struct SnapBuild
reorderbuffer.h: Adds uint64 last_snapshot_generation to ReorderBufferTXN; exports ReorderBufferTXNByXid()
snapbuild.c: Increments generation counter in SnapBuildCommitTxn(); renames SnapBuildDistributeSnapshotAndInval() → SnapBuildDistributeInval() (removes snapshot distribution, keeps invalidation); adds lazy distribution logic in SnapBuildProcessChange() with a single hash lookup (eliminating 2-4 redundant lookups in original code)
reorderbuffer.c: Removes static from ReorderBufferTXNByXid()
SNAPBUILD_VERSION: Bumped 6 → 7 (existing serialized snapshots auto-invalidated on upgrade)

Invalidation Handling Preserved

Invalidation messages continue to be distributed eagerly because:

They are small (100K vacuum commits ≈ 14MB)
They have overflow protection via MAX_DISTR_INVAL_MSG_PER_TXN
They cannot be deduplicated like snapshots (each carries unique cache invalidation semantics)

This preserves the correctness fix from Tomas Vondra's 2023 "long-standing data loss bug" thread, which added invalidation distribution to ensure in-progress transactions invalidate their relcache when concurrent publication DDL changes the set of published tables.

Relationship to Prior Work

Vondra/Kapila's data loss fix (2023/2025): Renamed SnapBuildDistributeNewCatalogSnapshot and added invalidation distribution. Acknowledged "some performance regression." This patch addresses the snapshot side while keeping invalidation semantics intact.
Sawada's per-transaction memory contexts: Addresses GenerationContext fragmentation — a different root cause of memory bloat. Complementary.
Tachoires's LZ4 compression for spill files: Reduces cost of whatever does spill. Complementary — this patch reduces what needs to be spilled in the first place.

Performance Results

At logical_decoding_work_mem = 64MB (production-like):

N=5000: decode time 629ms → 261ms (2.4x), spill 207MB → 0
N=10000: decode time 1865ms → 814ms (2.3x), spill 813MB → 0

At logical_decoding_work_mem = 64kB (stress):

N=5000: spill 247MB → 25MB (10x reduction), decode time 607ms → 289ms (2.1x)
Master's spill scales as O(N²): 5000²/1000² = 25, actual ratio 247/9.3 = 26.6 ✓

The hot path (SnapBuildProcessChange) does fewer hash lookups than before, so there is no regression for the common case.

Limitations and Future Work

When the long-running transaction itself has continuous data changes (K approaches G, e.g., bulk COPY during vacuum storm), lazy distribution degenerates toward eager behavior. The root cause in that scenario is that VACUUM's pg_class statistics updates are treated as catalog changes even though they don't affect tuple decoding. A complementary optimization filtering non-schema-affecting catalog modifications is identified as future work.

Testing

Isolation test: lazy_snapshot_distribution.spec — three scenarios (ALTER TABLE + INSERT, many CREATE TABLE + INSERT, subtransaction + DDL + INSERT)
TAP test: 002_lazy_snapshot_spill.pl — verifies spill_bytes=0 with 200 catalog-modifying DDLs between two INSERTs
Full regression suite: All existing test_decoding, subscription, recovery, and pg_logicalinspect tests pass unchanged

Lazy snapshot distribution in logical decoding

Latest Update