2026-05-06 · opus 4.7

REPACK [CONCURRENTLY]: Architectural Analysis

Core Problem

PostgreSQL has long lacked an in-core facility for concurrent table rewriting. The existing tools have significant deficiencies:

VACUUM FULL and CLUSTER both acquire AccessExclusiveLock for the entire duration of the rewrite, making them impractical for large, hot tables.
pg_repack (external extension) uses a "weird internal implementation" involving triggers and manual change capture, and is considered ancient/fragile.
pg_squeeze (external, by Antonin Houska) uses logical decoding to capture concurrent changes during rewrite — a cleaner architecture.

The naming is also confusing: VACUUM FULL behaves nothing like VACUUM (it rewrites the table), and CLUSTER confusingly conflicts with the notion of a "database cluster." The patch series therefore introduces a new command REPACK that subsumes both, with an optional CONCURRENTLY flag.

Architectural Approach

The concurrent implementation is essentially pg_squeeze merged into core:

Acquire ShareUpdateExclusiveLock (SUEL) on target table — allows readers and writers.
Set up a logical replication slot with a custom output plugin (pgoutput_repack → later renamed pgrepack) that only decodes changes for one specific relation.
A background worker consumes WAL in parallel, decoding changes into a shared file (using SharedFileSet), so WAL reservation doesn't grow unboundedly.
Main backend performs the initial copy from old heap to new heap via table_relation_copy_for_cluster().
Build indexes on new heap.
Replay decoded concurrent changes from the shared file onto the new heap.
Upgrade to AccessExclusiveLock (AEL) and swap relfilenodes (finish_heap_swap).

Key Design Decisions & Tradeoffs

1. Syntax Unification & Deprecation

The patch reshapes CLUSTER and VACUUM FULL as obsolete spellings of REPACK. Alvaro chose a non-disruptive path: keep old commands working indefinitely, mark as "obsolete" (not "deprecated"), and retain pg_stat_progress_cluster as a view layered on pg_stat_progress_repack.

2. pg_repackdb Shell Utility — Dropped

Originally proposed as a symlink to vacuumdb (with per-name behavior), later as a standalone binary. Andres Freund objected strongly ("don't think the whole thing of having a single executable with multiple names is worth doing"). Alvaro eventually removed it from the patchset entirely, deferring to a future release.

3. MVCC Safety — Deferred

The committed implementation is not MVCC-safe: tuples in the new heap carry REPACK's xmin, not the original. Snapshots taken before the swap can see an "empty" table if they examine the new relfilenode. Mihail Nikalayeu repeatedly pushed for either:

A preserve-visibility approach (keep original xmin/xmax), or
A relcheckxmin mechanism (raise an error rather than silently return wrong data for old snapshots).

Antonin considered true MVCC safety too complex for PG19 (would require reworking rewriteheap.c to interact with logical decoding). Deferred to PG20.

4. xmin Horizon Pinning

REPACK holds XID for its entire duration → blocks VACUUM's horizon advancement on all tables. Proposed mitigations (for PG20):

PROC_IN_REPACK flag (like PROC_IN_VACUUM) to exclude from data horizon but keep for catalog horizon.
Cache-only relations: create transient catalog entries that don't hit disk, so no XID needed until the final swap.
Snapshot resetting: use multiple snapshots during copy (patch 0006 in v28+), reducing how long any single snapshot is held.

5. Replication Slot Contention

Every REPACK consumes a logical replication slot, which competes with actual replication. Alvaro added max_repack_replication_slots GUC (default 5) as a separate pool. Matthias van de Meent pushed back: REPLICATION privilege exists precisely because slots affect effective_wal_level cluster-wide. The compromise kept REPLICATION privilege not required but added the dedicated pool.

6. Database-Specific Snapshot Building (commit 0d3dba38c777 — REVERTED)

To avoid REPACK-in-database-A blocking REPACK-setup-in-database-B, a mechanism was added to let SnapBuildProcessRunningXacts use database-filtered xl_running_xacts records. This was reverted near the end of the thread: Antonin discovered that COMMIT records for transactions listed in xl_running_xacts are not guaranteed to follow that record in WAL order, so the cleanup relied upon in SnapBuildCommitTxn can fail to run, leaving transactions incorrectly marked as running. The correctness issue manifested as duplicate keys and division-by-zero in Mihail's stress tests. Alvaro committed the revert.

7. Lock Upgrade Deadlock Hazard

REPACK takes SUEL, works for hours, then upgrades to AEL. A concurrent SELECT in a transaction blocks the upgrade → deadlock detector fires → current code kills REPACK, wasting hours of work.

Multiple approaches were debated:

Alvaro's initial prototype: deadlock detector kills anything waiting on a REPACK-held relation (PROC_IN_CONCURRENT_REPACK). Andres objected: too aggressive, would make REPACK unsafe for production.
Andres's proposal: teach deadlock detector about the planned lock upgrade as a hypothetical edge, so conflicts are detected/cancelled early (when the blocker first waits), not after REPACK has finished all its work and tries to upgrade.
Amit Kapila's proposal: release SUEL before AEL request, using a rel_in_use_by_repack flag checked via CheckTableNotInUse. Andres rejected: CheckTableNotInUse is not called by all lock paths; would cause spurious errors.
Mihail's FutureWaitLock POC: declare future AEL intent in PGPROC; deadlock detector treats as hard edge. Complications around fast-path locks for SUEL from other backends.

The committed state went with a narrower form of Alvaro's approach combined with dedicated deadlock-detector changes. This remains an area with rough edges.

8. Index Progress Reporting

REPACK builds indexes internally and should not clobber pg_stat_progress_create_index. Antonin added a flag INDEX_CREATE_REPORT_PROGRESS (inverted: suppress when unset from REPACK). General infrastructure for nested progress reporting deferred.

9. `need_shared_catalogs` Output Plugin Flag

To let REPACK's snapshot builder skip waiting on unrelated transactions, a new output-plugin option was introduced declaring the plugin won't touch shared catalogs. Andres raised concerns that cache invalidation callbacks from third-party extensions could violate this silently ("You just need some shared_preload_library extension to register a relcache invalidation callback that accesses shared catalog"). The assertion check was hardened but not converted to a runtime elog(ERROR).

Implementation Hotspots

src/backend/commands/repack.c (renamed from cluster.c): core logic, including rebuild_relation_finish_concurrent() for lock upgrade and swap.
src/backend/commands/repack_worker.c: the bgworker running logical decoding.
src/backend/replication/pgrepack/pgrepack.c: minimal output plugin — deforms tuples and writes to shared file.
src/backend/replication/logical/snapbuild.c: modified for database-specific snapshots (later reverted).
New GUC: max_repack_replication_slots.
New PROC flag: PROC_IN_CONCURRENT_REPACK (in PROC_VACUUM_STATE_MASK).

Testing & Stress Testing

Mihail Nikalayeu and Srinath Reddy Sadipiralla ran extensive stress tests:

Pgbench with 30+ concurrent clients while REPACK runs in a loop.
Found multiple bugs: snapshot initialization order (need_full_snapshot), missing XID filtering in the output plugin (repacked_rel_locator.relNumber being InvalidOid during setup), PROC_IN_VACUUM misuse causing visibility breakage, memory leaks in change application, speculative-insert CONFIRM/ABORT filtering, and more.

Andres noted O(N) performance regression from AssertCouldGetRelation() in debug builds during heap_insert()-per-tuple copy — suggested using multi_insert for efficiency (not yet done).

Post-Commit Issues

After initial commit (around 28d534e2a, early April 2026):

Valgrind complaint on skink from uninitialized padding in SerializedSnapshotData — fixed by zero-initializing the stack struct.
32-bit build warnings (VARSIZE_ANY on a union sized by sizeof(void *)) — fixed using uint64.
thorntail (wal_level=minimal) failure — tests moved to contrib/test_decoding.
Stale bgworker_die reference.
REPLICATION privilege incorrectly required for table owner (not invoking user), discovered by Justin Pryzby.
Dangling min_dynamic_shared_memory crash due to uninitialized bool field — fixed.
Alexander Lakhin's test hitting an assertion in parallel workers cross-database — related to 0d3dba38c777, fixed by the revert.

Status at End of Thread (May 2026)

The base REPACK and REPACK CONCURRENTLY features are committed. The 0d3dba38c777 database-specific-snapshots optimization is being reverted due to correctness issues. MVCC safety, xmin horizon pinning fixes, and snapshot resetting are deferred to PG20. Lock-upgrade deadlock handling remains somewhat unsatisfying; Mihail's and Andres's proposals for deadlock-detector-level fixes remain under discussion.

Adding REPACK [concurrently]

Latest Update

REPACK [CONCURRENTLY]: Architectural Analysis

Core Problem

Architectural Approach

Key Design Decisions & Tradeoffs

1. Syntax Unification & Deprecation

2. pg_repackdb Shell Utility — Dropped

3. MVCC Safety — Deferred

4. xmin Horizon Pinning

5. Replication Slot Contention

6. Database-Specific Snapshot Building (commit 0d3dba38c777 — REVERTED)

7. Lock Upgrade Deadlock Hazard

8. Index Progress Reporting

9. `need_shared_catalogs` Output Plugin Flag

Implementation Hotspots

Testing & Stress Testing

Post-Commit Issues

Status at End of Thread (May 2026)

Adding REPACK [concurrently]

Latest Update

REPACK [CONCURRENTLY]: Architectural Analysis

Core Problem

Architectural Approach

Key Design Decisions & Tradeoffs

1. Syntax Unification & Deprecation

2. pg_repackdb Shell Utility — Dropped

3. MVCC Safety — Deferred

4. xmin Horizon Pinning

5. Replication Slot Contention

6. Database-Specific Snapshot Building (commit 0d3dba38c777 — REVERTED)

7. Lock Upgrade Deadlock Hazard

8. Index Progress Reporting

9. need_shared_catalogs Output Plugin Flag

Implementation Hotspots

Testing & Stress Testing

Post-Commit Issues

Status at End of Thread (May 2026)

9. `need_shared_catalogs` Output Plugin Flag