Adding REPACK [concurrently]

First seen: 2025-07-26 21:56:04+00:00 · Messages: 348 · Participants: 29

Latest Update

2026-05-06 · opus 4.7

REPACK [CONCURRENTLY]: Architectural Analysis

Core Problem

PostgreSQL has long lacked an in-core facility for concurrent table rewriting. The existing tools have significant deficiencies:

The naming is also confusing: VACUUM FULL behaves nothing like VACUUM (it rewrites the table), and CLUSTER confusingly conflicts with the notion of a "database cluster." The patch series therefore introduces a new command REPACK that subsumes both, with an optional CONCURRENTLY flag.

Architectural Approach

The concurrent implementation is essentially pg_squeeze merged into core:

  1. Acquire ShareUpdateExclusiveLock (SUEL) on target table — allows readers and writers.
  2. Set up a logical replication slot with a custom output plugin (pgoutput_repack → later renamed pgrepack) that only decodes changes for one specific relation.
  3. A background worker consumes WAL in parallel, decoding changes into a shared file (using SharedFileSet), so WAL reservation doesn't grow unboundedly.
  4. Main backend performs the initial copy from old heap to new heap via table_relation_copy_for_cluster().
  5. Build indexes on new heap.
  6. Replay decoded concurrent changes from the shared file onto the new heap.
  7. Upgrade to AccessExclusiveLock (AEL) and swap relfilenodes (finish_heap_swap).

Key Design Decisions & Tradeoffs

1. Syntax Unification & Deprecation

The patch reshapes CLUSTER and VACUUM FULL as obsolete spellings of REPACK. Alvaro chose a non-disruptive path: keep old commands working indefinitely, mark as "obsolete" (not "deprecated"), and retain pg_stat_progress_cluster as a view layered on pg_stat_progress_repack.

2. pg_repackdb Shell Utility — Dropped

Originally proposed as a symlink to vacuumdb (with per-name behavior), later as a standalone binary. Andres Freund objected strongly ("don't think the whole thing of having a single executable with multiple names is worth doing"). Alvaro eventually removed it from the patchset entirely, deferring to a future release.

3. MVCC Safety — Deferred

The committed implementation is not MVCC-safe: tuples in the new heap carry REPACK's xmin, not the original. Snapshots taken before the swap can see an "empty" table if they examine the new relfilenode. Mihail Nikalayeu repeatedly pushed for either:

Antonin considered true MVCC safety too complex for PG19 (would require reworking rewriteheap.c to interact with logical decoding). Deferred to PG20.

4. xmin Horizon Pinning

REPACK holds XID for its entire duration → blocks VACUUM's horizon advancement on all tables. Proposed mitigations (for PG20):

5. Replication Slot Contention

Every REPACK consumes a logical replication slot, which competes with actual replication. Alvaro added max_repack_replication_slots GUC (default 5) as a separate pool. Matthias van de Meent pushed back: REPLICATION privilege exists precisely because slots affect effective_wal_level cluster-wide. The compromise kept REPLICATION privilege not required but added the dedicated pool.

6. Database-Specific Snapshot Building (commit 0d3dba38c777 — REVERTED)

To avoid REPACK-in-database-A blocking REPACK-setup-in-database-B, a mechanism was added to let SnapBuildProcessRunningXacts use database-filtered xl_running_xacts records. This was reverted near the end of the thread: Antonin discovered that COMMIT records for transactions listed in xl_running_xacts are not guaranteed to follow that record in WAL order, so the cleanup relied upon in SnapBuildCommitTxn can fail to run, leaving transactions incorrectly marked as running. The correctness issue manifested as duplicate keys and division-by-zero in Mihail's stress tests. Alvaro committed the revert.

7. Lock Upgrade Deadlock Hazard

REPACK takes SUEL, works for hours, then upgrades to AEL. A concurrent SELECT in a transaction blocks the upgrade → deadlock detector fires → current code kills REPACK, wasting hours of work.

Multiple approaches were debated:

The committed state went with a narrower form of Alvaro's approach combined with dedicated deadlock-detector changes. This remains an area with rough edges.

8. Index Progress Reporting

REPACK builds indexes internally and should not clobber pg_stat_progress_create_index. Antonin added a flag INDEX_CREATE_REPORT_PROGRESS (inverted: suppress when unset from REPACK). General infrastructure for nested progress reporting deferred.

9. need_shared_catalogs Output Plugin Flag

To let REPACK's snapshot builder skip waiting on unrelated transactions, a new output-plugin option was introduced declaring the plugin won't touch shared catalogs. Andres raised concerns that cache invalidation callbacks from third-party extensions could violate this silently ("You just need some shared_preload_library extension to register a relcache invalidation callback that accesses shared catalog"). The assertion check was hardened but not converted to a runtime elog(ERROR).

Implementation Hotspots

Testing & Stress Testing

Mihail Nikalayeu and Srinath Reddy Sadipiralla ran extensive stress tests:

Andres noted O(N) performance regression from AssertCouldGetRelation() in debug builds during heap_insert()-per-tuple copy — suggested using multi_insert for efficiency (not yet done).

Post-Commit Issues

After initial commit (around 28d534e2a, early April 2026):

Status at End of Thread (May 2026)

The base REPACK and REPACK CONCURRENTLY features are committed. The 0d3dba38c777 database-specific-snapshots optimization is being reverted due to correctness issues. MVCC safety, xmin horizon pinning fixes, and snapshot resetting are deferred to PG20. Lock-upgrade deadlock handling remains somewhat unsatisfying; Mihail's and Andres's proposals for deadlock-detector-level fixes remain under discussion.