Lazy Snapshot Distribution in Logical Decoding
Core Problem: O(N²) Snapshot Growth in the Reorder Buffer
This patch addresses a fundamental scalability problem in PostgreSQL's logical decoding infrastructure where the combination of a long-running transaction (or any xmin holder) and frequent catalog-modifying commits causes quadratic growth in reorder buffer spill files.
The Chain Reaction
The problem manifests through a four-stage chain reaction:
-
VACUUM triggers catalog invalidations: When autovacuum updates
pg_classstatistics (relpages, reltuples, relallvisible) viaheap_inplace_update_and_unlock(), it triggersCacheInvalidateHeapTuple(). At command end,LogLogicalInvalidations()emitsXLOG_XACT_INVALIDATIONSWAL records. -
Decoding marks transactions as catalog-changing:
xact_decode()processes these records and callsReorderBufferXidSetCatalogChanges(), marking the VACUUM transaction as having catalog changes — even though pg_class stat updates don't affect tuple decoding semantics. -
Snapshot distribution on commit: When the VACUUM commits,
SnapBuildCommitTxn()sees the catalog-change flag, builds a new snapshot viaSnapBuildBuildSnapshot(), and callsSnapBuildDistributeSnapshotAndInval()to distribute the snapshot to ALL in-progress transactions. -
Monotonically growing snapshots: Since a long-running transaction prevents
SnapBuildPurgeOlderTxn()from cleaningbuilder->committed.xip, each successive snapshot is larger than the last. Snapshot N contains N XIDs at192 + 4*Nbytes. Total disk usage =Σ(i=1..N)(192 + 4*i) = O(N²).
Why This Matters Architecturally
The problem is not academic. With N=100K catalog-modifying commits (realistic in a vacuum storm across many tables), the spill files reach ~80GB. This causes:
- Premature spilling: INTERNAL_SNAPSHOT entries consume
logical_decoding_work_membudget, forcing spills far earlier than data changes would require. - I/O amplification: Each snapshot is serialized during spill and deserialized during commit-time processing.
- CPU overhead:
ReorderBufferProcessTXN()iterates all changes via binary heap; 100K extra INTERNAL_SNAPSHOT entries mean 100K additional heap-pop operations. - Replication lag: The combined overhead directly increases commit-to-decode latency.
Two Necessary Conditions
The trigger requires:
- Something preventing xmin advancement (the long write transaction itself,
pg_dumpwith REPEATABLE READ, a slow logical replication consumer) - A transaction with at least one data change present in the reorder buffer (pure read-only sessions don't appear in the reorder buffer)
Proposed Solution: Generation-Counter-Based Lazy Distribution
Design
Instead of distributing a snapshot to every in-progress transaction on each catalog-modifying commit (eager), the patch defers distribution until a transaction actually needs to decode a data change.
Key mechanism:
- A
uint64 snapshot_generationcounter inSnapBuildis incremented each time a new catalog snapshot is built. - Each transaction tracks
uint64 last_snapshot_generationinReorderBufferTXN. SnapBuildProcessChange()checks whether the transaction's generation is behind the builder's and distributes the current snapshot only if needed.
Complexity Analysis
| Before (Eager) | After (Lazy) | |
|---|---|---|
| Snapshot distributions | G (per commit) | min(K, G) (per data change) |
| Disk usage growth | O(N²) | O(min(K, G) × N) |
Where: N = committed XID array size, G = catalog-modifying commits during the long txn, K = data changes in the long txn.
For the common production case where K << G (e.g., a long batch INSERT during a vacuum storm), this collapses from quadratic to near-constant.
Correctness Argument
The lazy approach is equivalent to eager because:
-
LSN-order processing: When
SnapBuildProcessChange()is called at LSN X,builder->snapshotreflects exactly all catalog-modifying commits with LSN < X — neither too new nor too old. -
Monotonic committed array: The array only grows while a long transaction holds xmin back. Snapshot N is a superset of snapshots 1..N-1. Skipping intermediate snapshots doesn't affect visibility.
-
Unused intermediate snapshots: In the original code, intermediate snapshots with no data changes between them are effectively dead. Lazy distribution simply avoids creating these entries.
-
Streaming mode compatibility:
SnapBuildProcessChange()is always called before each data change regardless of streaming/buffering mode. -
Subtransaction safety: The lazy snapshot is distributed to the toplevel transaction (using
txn->xid). Eager invalidation distribution always places an invalidation entry in the toplevel's list at the DDL commit LSN (< data change LSN). During binary heap iteration, the invalidation entry ensures correct ordering.
Why Not Eliminate ReorderBufferAddSnapshot Entirely?
The patch explains why snapshots in the change list cannot be eliminated altogether: DecodeInsert/DecodeUpdate only enqueue raw WAL tuples. Actual catalog lookups (RelationIdGetRelation) happen during commit-time processing in ReorderBufferProcessTXN(). At that point, builder->snapshot has advanced past all changes. The INTERNAL_SNAPSHOT entries serve as markers telling commit-time processing "switch to this catalog snapshot from here on."
What About Filtering VACUUM?
The patch explicitly argues against narrower fixes like filtering VACUUM-generated invalidations. The O(N²) problem applies to any workload with many catalog-modifying commits: batch partition creation, online migration tools, extension installations, etc. A general solution at the distribution layer is needed.
Key Implementation Details
Files Modified
snapbuild_internal.h: Addsuint64 snapshot_generationtostruct SnapBuildreorderbuffer.h: Addsuint64 last_snapshot_generationtoReorderBufferTXN; exportsReorderBufferTXNByXid()snapbuild.c: Increments generation counter inSnapBuildCommitTxn(); renamesSnapBuildDistributeSnapshotAndInval()→SnapBuildDistributeInval()(removes snapshot distribution, keeps invalidation); adds lazy distribution logic inSnapBuildProcessChange()with a single hash lookup (eliminating 2-4 redundant lookups in original code)reorderbuffer.c: RemovesstaticfromReorderBufferTXNByXid()SNAPBUILD_VERSION: Bumped 6 → 7 (existing serialized snapshots auto-invalidated on upgrade)
Invalidation Handling Preserved
Invalidation messages continue to be distributed eagerly because:
- They are small (100K vacuum commits ≈ 14MB)
- They have overflow protection via
MAX_DISTR_INVAL_MSG_PER_TXN - They cannot be deduplicated like snapshots (each carries unique cache invalidation semantics)
This preserves the correctness fix from Tomas Vondra's 2023 "long-standing data loss bug" thread, which added invalidation distribution to ensure in-progress transactions invalidate their relcache when concurrent publication DDL changes the set of published tables.
Relationship to Prior Work
-
Vondra/Kapila's data loss fix (2023/2025): Renamed
SnapBuildDistributeNewCatalogSnapshotand added invalidation distribution. Acknowledged "some performance regression." This patch addresses the snapshot side while keeping invalidation semantics intact. -
Sawada's per-transaction memory contexts: Addresses GenerationContext fragmentation — a different root cause of memory bloat. Complementary.
-
Tachoires's LZ4 compression for spill files: Reduces cost of whatever does spill. Complementary — this patch reduces what needs to be spilled in the first place.
Performance Results
At logical_decoding_work_mem = 64MB (production-like):
- N=5000: decode time 629ms → 261ms (2.4x), spill 207MB → 0
- N=10000: decode time 1865ms → 814ms (2.3x), spill 813MB → 0
At logical_decoding_work_mem = 64kB (stress):
- N=5000: spill 247MB → 25MB (10x reduction), decode time 607ms → 289ms (2.1x)
- Master's spill scales as O(N²): 5000²/1000² = 25, actual ratio 247/9.3 = 26.6 ✓
The hot path (SnapBuildProcessChange) does fewer hash lookups than before, so there is no regression for the common case.
Limitations and Future Work
When the long-running transaction itself has continuous data changes (K approaches G, e.g., bulk COPY during vacuum storm), lazy distribution degenerates toward eager behavior. The root cause in that scenario is that VACUUM's pg_class statistics updates are treated as catalog changes even though they don't affect tuple decoding. A complementary optimization filtering non-schema-affecting catalog modifications is identified as future work.
Testing
- Isolation test:
lazy_snapshot_distribution.spec— three scenarios (ALTER TABLE + INSERT, many CREATE TABLE + INSERT, subtransaction + DDL + INSERT) - TAP test:
002_lazy_snapshot_spill.pl— verifies spill_bytes=0 with 200 catalog-modifying DDLs between two INSERTs - Full regression suite: All existing test_decoding, subscription, recovery, and pg_logicalinspect tests pass unchanged