REPACK [CONCURRENTLY]: Architectural Analysis
Core Problem
PostgreSQL has long lacked an in-core facility for concurrent table rewriting. The existing tools have significant deficiencies:
- VACUUM FULL and CLUSTER both acquire
AccessExclusiveLockfor the entire duration of the rewrite, making them impractical for large, hot tables. - pg_repack (external extension) uses a "weird internal implementation" involving triggers and manual change capture, and is considered ancient/fragile.
- pg_squeeze (external, by Antonin Houska) uses logical decoding to capture concurrent changes during rewrite — a cleaner architecture.
The naming is also confusing: VACUUM FULL behaves nothing like VACUUM (it rewrites the table), and CLUSTER confusingly conflicts with the notion of a "database cluster." The patch series therefore introduces a new command REPACK that subsumes both, with an optional CONCURRENTLY flag.
Architectural Approach
The concurrent implementation is essentially pg_squeeze merged into core:
- Acquire
ShareUpdateExclusiveLock(SUEL) on target table — allows readers and writers. - Set up a logical replication slot with a custom output plugin (
pgoutput_repack→ later renamedpgrepack) that only decodes changes for one specific relation. - A background worker consumes WAL in parallel, decoding changes into a shared file (using
SharedFileSet), so WAL reservation doesn't grow unboundedly. - Main backend performs the initial copy from old heap to new heap via
table_relation_copy_for_cluster(). - Build indexes on new heap.
- Replay decoded concurrent changes from the shared file onto the new heap.
- Upgrade to
AccessExclusiveLock(AEL) and swap relfilenodes (finish_heap_swap).
Key Design Decisions & Tradeoffs
1. Syntax Unification & Deprecation
The patch reshapes CLUSTER and VACUUM FULL as obsolete spellings of REPACK. Alvaro chose a non-disruptive path: keep old commands working indefinitely, mark as "obsolete" (not "deprecated"), and retain pg_stat_progress_cluster as a view layered on pg_stat_progress_repack.
2. pg_repackdb Shell Utility — Dropped
Originally proposed as a symlink to vacuumdb (with per-name behavior), later as a standalone binary. Andres Freund objected strongly ("don't think the whole thing of having a single executable with multiple names is worth doing"). Alvaro eventually removed it from the patchset entirely, deferring to a future release.
3. MVCC Safety — Deferred
The committed implementation is not MVCC-safe: tuples in the new heap carry REPACK's xmin, not the original. Snapshots taken before the swap can see an "empty" table if they examine the new relfilenode. Mihail Nikalayeu repeatedly pushed for either:
- A preserve-visibility approach (keep original xmin/xmax), or
- A
relcheckxminmechanism (raise an error rather than silently return wrong data for old snapshots).
Antonin considered true MVCC safety too complex for PG19 (would require reworking rewriteheap.c to interact with logical decoding). Deferred to PG20.
4. xmin Horizon Pinning
REPACK holds XID for its entire duration → blocks VACUUM's horizon advancement on all tables. Proposed mitigations (for PG20):
PROC_IN_REPACKflag (likePROC_IN_VACUUM) to exclude from data horizon but keep for catalog horizon.- Cache-only relations: create transient catalog entries that don't hit disk, so no XID needed until the final swap.
- Snapshot resetting: use multiple snapshots during copy (patch 0006 in v28+), reducing how long any single snapshot is held.
5. Replication Slot Contention
Every REPACK consumes a logical replication slot, which competes with actual replication. Alvaro added max_repack_replication_slots GUC (default 5) as a separate pool. Matthias van de Meent pushed back: REPLICATION privilege exists precisely because slots affect effective_wal_level cluster-wide. The compromise kept REPLICATION privilege not required but added the dedicated pool.
6. Database-Specific Snapshot Building (commit 0d3dba38c777 — REVERTED)
To avoid REPACK-in-database-A blocking REPACK-setup-in-database-B, a mechanism was added to let SnapBuildProcessRunningXacts use database-filtered xl_running_xacts records. This was reverted near the end of the thread: Antonin discovered that COMMIT records for transactions listed in xl_running_xacts are not guaranteed to follow that record in WAL order, so the cleanup relied upon in SnapBuildCommitTxn can fail to run, leaving transactions incorrectly marked as running. The correctness issue manifested as duplicate keys and division-by-zero in Mihail's stress tests. Alvaro committed the revert.
7. Lock Upgrade Deadlock Hazard
REPACK takes SUEL, works for hours, then upgrades to AEL. A concurrent SELECT in a transaction blocks the upgrade → deadlock detector fires → current code kills REPACK, wasting hours of work.
Multiple approaches were debated:
- Alvaro's initial prototype: deadlock detector kills anything waiting on a REPACK-held relation (
PROC_IN_CONCURRENT_REPACK). Andres objected: too aggressive, would make REPACK unsafe for production. - Andres's proposal: teach deadlock detector about the planned lock upgrade as a hypothetical edge, so conflicts are detected/cancelled early (when the blocker first waits), not after REPACK has finished all its work and tries to upgrade.
- Amit Kapila's proposal: release SUEL before AEL request, using a
rel_in_use_by_repackflag checked viaCheckTableNotInUse. Andres rejected:CheckTableNotInUseis not called by all lock paths; would cause spurious errors. - Mihail's
FutureWaitLockPOC: declare future AEL intent in PGPROC; deadlock detector treats as hard edge. Complications around fast-path locks for SUEL from other backends.
The committed state went with a narrower form of Alvaro's approach combined with dedicated deadlock-detector changes. This remains an area with rough edges.
8. Index Progress Reporting
REPACK builds indexes internally and should not clobber pg_stat_progress_create_index. Antonin added a flag INDEX_CREATE_REPORT_PROGRESS (inverted: suppress when unset from REPACK). General infrastructure for nested progress reporting deferred.
9. need_shared_catalogs Output Plugin Flag
To let REPACK's snapshot builder skip waiting on unrelated transactions, a new output-plugin option was introduced declaring the plugin won't touch shared catalogs. Andres raised concerns that cache invalidation callbacks from third-party extensions could violate this silently ("You just need some shared_preload_library extension to register a relcache invalidation callback that accesses shared catalog"). The assertion check was hardened but not converted to a runtime elog(ERROR).
Implementation Hotspots
src/backend/commands/repack.c(renamed fromcluster.c): core logic, includingrebuild_relation_finish_concurrent()for lock upgrade and swap.src/backend/commands/repack_worker.c: the bgworker running logical decoding.src/backend/replication/pgrepack/pgrepack.c: minimal output plugin — deforms tuples and writes to shared file.src/backend/replication/logical/snapbuild.c: modified for database-specific snapshots (later reverted).- New GUC:
max_repack_replication_slots. - New PROC flag:
PROC_IN_CONCURRENT_REPACK(inPROC_VACUUM_STATE_MASK).
Testing & Stress Testing
Mihail Nikalayeu and Srinath Reddy Sadipiralla ran extensive stress tests:
- Pgbench with 30+ concurrent clients while REPACK runs in a loop.
- Found multiple bugs: snapshot initialization order (need_full_snapshot), missing XID filtering in the output plugin (
repacked_rel_locator.relNumberbeingInvalidOidduring setup),PROC_IN_VACUUMmisuse causing visibility breakage, memory leaks in change application, speculative-insert CONFIRM/ABORT filtering, and more.
Andres noted O(N) performance regression from AssertCouldGetRelation() in debug builds during heap_insert()-per-tuple copy — suggested using multi_insert for efficiency (not yet done).
Post-Commit Issues
After initial commit (around 28d534e2a, early April 2026):
- Valgrind complaint on
skinkfrom uninitialized padding inSerializedSnapshotData— fixed by zero-initializing the stack struct. - 32-bit build warnings (
VARSIZE_ANYon a union sized bysizeof(void *)) — fixed usinguint64. thorntail(wal_level=minimal) failure — tests moved tocontrib/test_decoding.- Stale
bgworker_diereference. - REPLICATION privilege incorrectly required for table owner (not invoking user), discovered by Justin Pryzby.
- Dangling
min_dynamic_shared_memorycrash due to uninitialized bool field — fixed. - Alexander Lakhin's test hitting an assertion in parallel workers cross-database — related to 0d3dba38c777, fixed by the revert.
Status at End of Thread (May 2026)
The base REPACK and REPACK CONCURRENTLY features are committed. The 0d3dba38c777 database-specific-snapshots optimization is being reverted due to correctness issues. MVCC safety, xmin horizon pinning fixes, and snapshot resetting are deferred to PG20. Lock-upgrade deadlock handling remains somewhat unsatisfying; Mihail's and Andres's proposals for deadlock-detector-level fixes remain under discussion.