Lazy snapshot distribution in logical decoding

First seen: 2026-05-12 08:55:23+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-18 · claude-opus-4-6

Lazy Snapshot Distribution in Logical Decoding

Core Problem: O(N²) Snapshot Growth in the Reorder Buffer

This patch addresses a fundamental scalability problem in PostgreSQL's logical decoding infrastructure where the combination of a long-running transaction (or any xmin holder) and frequent catalog-modifying commits causes quadratic growth in reorder buffer spill files.

The Chain Reaction

The problem manifests through a four-stage chain reaction:

  1. VACUUM triggers catalog invalidations: When autovacuum updates pg_class statistics (relpages, reltuples, relallvisible) via heap_inplace_update_and_unlock(), it triggers CacheInvalidateHeapTuple(). At command end, LogLogicalInvalidations() emits XLOG_XACT_INVALIDATIONS WAL records.

  2. Decoding marks transactions as catalog-changing: xact_decode() processes these records and calls ReorderBufferXidSetCatalogChanges(), marking the VACUUM transaction as having catalog changes — even though pg_class stat updates don't affect tuple decoding semantics.

  3. Snapshot distribution on commit: When the VACUUM commits, SnapBuildCommitTxn() sees the catalog-change flag, builds a new snapshot via SnapBuildBuildSnapshot(), and calls SnapBuildDistributeSnapshotAndInval() to distribute the snapshot to ALL in-progress transactions.

  4. Monotonically growing snapshots: Since a long-running transaction prevents SnapBuildPurgeOlderTxn() from cleaning builder->committed.xip, each successive snapshot is larger than the last. Snapshot N contains N XIDs at 192 + 4*N bytes. Total disk usage = Σ(i=1..N)(192 + 4*i) = O(N²).

Why This Matters Architecturally

The problem is not academic. With N=100K catalog-modifying commits (realistic in a vacuum storm across many tables), the spill files reach ~80GB. This causes:

Two Necessary Conditions

The trigger requires:

  1. Something preventing xmin advancement (the long write transaction itself, pg_dump with REPEATABLE READ, a slow logical replication consumer)
  2. A transaction with at least one data change present in the reorder buffer (pure read-only sessions don't appear in the reorder buffer)

Proposed Solution: Generation-Counter-Based Lazy Distribution

Design

Instead of distributing a snapshot to every in-progress transaction on each catalog-modifying commit (eager), the patch defers distribution until a transaction actually needs to decode a data change.

Key mechanism:

Complexity Analysis

Before (Eager) After (Lazy)
Snapshot distributions G (per commit) min(K, G) (per data change)
Disk usage growth O(N²) O(min(K, G) × N)

Where: N = committed XID array size, G = catalog-modifying commits during the long txn, K = data changes in the long txn.

For the common production case where K << G (e.g., a long batch INSERT during a vacuum storm), this collapses from quadratic to near-constant.

Correctness Argument

The lazy approach is equivalent to eager because:

  1. LSN-order processing: When SnapBuildProcessChange() is called at LSN X, builder->snapshot reflects exactly all catalog-modifying commits with LSN < X — neither too new nor too old.

  2. Monotonic committed array: The array only grows while a long transaction holds xmin back. Snapshot N is a superset of snapshots 1..N-1. Skipping intermediate snapshots doesn't affect visibility.

  3. Unused intermediate snapshots: In the original code, intermediate snapshots with no data changes between them are effectively dead. Lazy distribution simply avoids creating these entries.

  4. Streaming mode compatibility: SnapBuildProcessChange() is always called before each data change regardless of streaming/buffering mode.

  5. Subtransaction safety: The lazy snapshot is distributed to the toplevel transaction (using txn->xid). Eager invalidation distribution always places an invalidation entry in the toplevel's list at the DDL commit LSN (< data change LSN). During binary heap iteration, the invalidation entry ensures correct ordering.

Why Not Eliminate ReorderBufferAddSnapshot Entirely?

The patch explains why snapshots in the change list cannot be eliminated altogether: DecodeInsert/DecodeUpdate only enqueue raw WAL tuples. Actual catalog lookups (RelationIdGetRelation) happen during commit-time processing in ReorderBufferProcessTXN(). At that point, builder->snapshot has advanced past all changes. The INTERNAL_SNAPSHOT entries serve as markers telling commit-time processing "switch to this catalog snapshot from here on."

What About Filtering VACUUM?

The patch explicitly argues against narrower fixes like filtering VACUUM-generated invalidations. The O(N²) problem applies to any workload with many catalog-modifying commits: batch partition creation, online migration tools, extension installations, etc. A general solution at the distribution layer is needed.

Key Implementation Details

Files Modified

Invalidation Handling Preserved

Invalidation messages continue to be distributed eagerly because:

This preserves the correctness fix from Tomas Vondra's 2023 "long-standing data loss bug" thread, which added invalidation distribution to ensure in-progress transactions invalidate their relcache when concurrent publication DDL changes the set of published tables.

Relationship to Prior Work

  1. Vondra/Kapila's data loss fix (2023/2025): Renamed SnapBuildDistributeNewCatalogSnapshot and added invalidation distribution. Acknowledged "some performance regression." This patch addresses the snapshot side while keeping invalidation semantics intact.

  2. Sawada's per-transaction memory contexts: Addresses GenerationContext fragmentation — a different root cause of memory bloat. Complementary.

  3. Tachoires's LZ4 compression for spill files: Reduces cost of whatever does spill. Complementary — this patch reduces what needs to be spilled in the first place.

Performance Results

At logical_decoding_work_mem = 64MB (production-like):

At logical_decoding_work_mem = 64kB (stress):

The hot path (SnapBuildProcessChange) does fewer hash lookups than before, so there is no regression for the common case.

Limitations and Future Work

When the long-running transaction itself has continuous data changes (K approaches G, e.g., bulk COPY during vacuum storm), lazy distribution degenerates toward eager behavior. The root cause in that scenario is that VACUUM's pg_class statistics updates are treated as catalog changes even though they don't affect tuple decoding. A complementary optimization filtering non-schema-affecting catalog modifications is identified as future work.

Testing