Disabling Heap-Only Tuples

2026-05-09 · opus 4.7

Thread Analysis: Disabling Heap-Only Tuples / `local_update_limit` / `COMPACT`

The Architectural Problem

PostgreSQL's HOT (Heap-Only Tuple) optimization is a cornerstone of MVCC update performance: when an UPDATE can place the new tuple version on the same heap page as the old, and no indexed columns changed, the update avoids touching indexes entirely and can be cleaned up by in-page pruning (heap_page_prune). This reduces WAL, index bloat, and I/O.

However, HOT has an unintended anti-compaction consequence. Even without HOT, heap_update (via the local-page preference in RelationGetBufferForTuple path) strongly prefers keeping the new tuple on the old tuple's page. The practical effect: free space in low-numbered blocks of a bloated relation is never used by UPDATEs on tuples located in high-numbered blocks. A table that has been mass-deleted in its head cannot naturally shrink via organic DML, because tuples at the tail stay at the tail. The only remedies are:

VACUUM FULL / CLUSTER — AccessExclusiveLock, rewrites whole relation, doubles peak disk usage.
pg_repack / pg_squeeze — external, fragile, also needs ~live-size of scratch space, has known edge cases.
Manual UPDATE ... WHERE ctid > ... loops — relies on exhausting local page space so the update spills to FSM-chosen pages, which is unreliable and slow.

Thom Brown opened the thread proposing a superuser-level ability to disable HOT on a table to enable such manual compaction. Matthias van de Meent immediately reframed the problem correctly: disabling HOT is the wrong lever — the relevant mechanism is heap_update's preference for same-page placement, which operates independently of HOT. The real request is "force UPDATEs to go through the FSM."

Design Evolution: From Boolean to `local_update_limit`

Matthias's first POC patch (2023-07-06) introduced a reloption max_local_update (renamed to local_update_limit) expressed in MiB. Semantics: for blocks numbered higher than the limit, skip the local-page fast path in heap_update and force FSM lookup. Lock level: ShareUpdateExclusiveLock (same as fillfactor). Thom's test run demonstrated it works end-to-end: a 730 MB table with 13.83% free space was shrunk to 626 MB with 0.39% free space via a scripted ctid-descending UPDATE loop.

Several refinements followed:

Low-fillfactor interaction (Matthias, 2023-07-07): FSM lookups always respect fillfactor, while local updates ignore it. So local_update_limit combined with a low fillfactor would actually bloat the table, because every update past the limit would be confined to (1 - fillfactor) space per page. Matthias patched this (2023-08-30): if the FSM-selected target page is ≥ the old tuple's page, fall back to a local update. This closed the worst footgun — failed compaction attempts now degrade to ordinary HOT updates rather than forced non-HOT updates to the same or later page.
Naming debate: max_local_update → avoid_hot_above_size_mb (Laurenz) → local_update_limit (Matthias's argument: the parameter affects all updates including non-HOT, since same-page preference is not HOT-specific).

The Central Disagreement: Is This a Footgun?

Robert Haas (committer, heavy influence) became the principal skeptic from 2023-08-28 onward. His critique has two parts:

(a) The user cannot compute the correct value. The right setting is roughly "the minimum size the table can be compacted to," which depends on live tuple count, average tuple width, fillfactor, and index requirements — information the user typically does not have, and which shifts as the workload changes. Robert's work_mem analogy: a parameter whose correct value you must continuously re-derive from system state is a parameter the user will get wrong.

(b) Misconfiguration doesn't just underperform — it amplifies bloat. Robert's key scenario (2023-09-19): on a heavily-updated table with local_update_limit set too low, occasional free slots appear in low-numbered pages (because other tuples moved out). Updates will opportunistically grab those slots as non-HOT updates, filling the low pages. The next update of those tuples cannot stay local and must spill to a high page as non-HOT. Net: HOT updates that would have been free get converted to non-HOT updates in both directions, generating index bloat and dead line pointers that only full VACUUM can reclaim. An "anti-bloat feature that becomes a bloat-amplification feature."

Matthias's rebuttals: (1) the 2023-08-30 patch prevents the "moved to same or later page" case; (2) every tuning knob is a footgun if misused (fillfactor, autovacuum parameters); (3) in the absence of this, users reach for strictly worse tools (pg_repack, VACUUM FULL). Laurenz Albe backed Matthias: users only set this during emergency compaction and unset it after.

Alvaro's Auto-Reset Idea and Andres's Broader Objection

Alvaro Herrera proposed (2023-09-19) that the reloption auto-clear itself once the table has shrunk to ~1.2× the configured limit (tied to autovacuum_vacuum_scale_factor). This would neutralize the "DBA set this a decade ago and forgot" failure mode. Laurenz suggested doing the reset inside vacuum truncation, piggybacking on the AEL already held.

Andres Freund (committer, storage/executor authority) rejected the entire reloption-based framing (2023-09-19, 2023-09-21):

Controlling compaction by table size is fundamentally wrong; it should be controlled by FSM free-space fraction. The value (relation_size - known_fsm_free) * factor gives a self-adjusting threshold that scales with the table and doesn't need retuning.
SQL-driven compaction via user UPDATE loops is awkward and slow; it requires tooling to generate the updates for tuples that the application itself never touches.
The right architecture is explicit compaction inside VACUUM or a dedicated command, with rate limiting (to avoid WAL explosion and FSM contention on low-numbered pages), scanning the relation backwards and moving tuples forward until FSM targets are exhausted.
MVCC constraint: moving a tuple requires a transaction — you cannot physically relocate a tuple without xmin/xmax bookkeeping, else you break visibility (seeing the same tuple twice or never).
Side remark on reducing the AEL requirement for relation truncation using a future "shared smgrrelation" infrastructure (Thomas Munro's work), storing separate filesystem-size and valid-size, letting truncation proceed without blocking readers.

Stephen Frost reinforced the "VACUUM should drive this" view with a concrete sketch: VACUUM scans the FSM/VM, sets a flag on the relation if second-half-fits-in-first-half, and that flag nudges UPDATEs and VACUUM itself to migrate tail tuples forward. All-frozen pages serve as the "old enough to move" criterion.

Resolution and the 2026 COMPACT POC

Matthias withdrew the patch in March 2024, acknowledging he lacked bandwidth to build the FSM-driven/statistics-driven variant the committers wanted. A TODO was added.

In May 2026, James Locke posted a POC of exactly the architecture Andres had advocated: a server-side COMPACT command built on new primitives:

heap_relocate: a stripped-down sibling of heap_update that moves a byte-identical tuple to a caller-chosen page. Skips HOT detection, modified-attribute analysis, TOAST, and RI extraction. Sets xmin of the new tuple and xmax of the old to the same xid — crucially, this preserves MVCC visibility: any snapshot either sees both "as running" or both "as committed" (Alvaro's seqscan concern is resolved by this invariant, identical to cross-page UPDATE).
XLH_UPDATE_RELOCATED flag on xl_heap_update so logical decoding (DecodeUpdate) filters these out — subscribers see no phantom UPDATE events. This is a notable subtlety: without it, logical replicas would receive meaningless replication traffic.
lazy_compact_heap: new VACUUM phase walking pages high-to-low, calling heap_relocate with FSM-chosen low-numbered targets, inserting matching index entries via index_insert(UNIQUE_CHECK_NO). Monotonic progress invariant: only place tuples on pages strictly lower than the source.
RelationGetSpecificBufferForTuple: new hio.c primitive for placing tuples on an explicit target block.
COMPACT top-level command: runs compact → prune+truncate → analyze as three vacuum() invocations. Only brief AEL during truncation.

Tradeoffs, per James's own benchmark on a 24 MB / 100K-row workload: COMPACT writes ~24% more WAL than VACUUM FULL / pg_repack (one cross-page update + index inserts per tuple vs. bulk rewrite) but needs only ~1 page peak extra disk space instead of ~live-size. Index bloat is unavoidable and documented: REINDEX CONCURRENTLY is the recommended follow-up. No tuning parameter.

Known rough edges: the three-pass vacuum() structure is ugly and really wants a single invocation that commits mid-flight to get a fresh snapshot; using index_insert directly instead of ExecInsertIndexTuples because the executor scaffolding (es_snapshot, ECxt, range table) can't be assembled cleanly from outside the executor.

Key Technical Insights

HOT was never the right lever. The same-page preference in heap_update operates orthogonally to HOT. Any "disable HOT" story is actually a "disable same-page preference" story.
Relation size is the wrong control variable. FSM free-space fraction is the right one, because it self-adjusts and because it actually reflects the question the system is trying to answer ("is there reclaimable space in low pages?").
MVCC forces tuple movement to cost an xid pair. Any compaction mechanism — userland UPDATE loop, VACUUM phase, or dedicated command — must either use full UPDATE semantics or a specialized primitive like heap_relocate that still does xmin/xmax bookkeeping. You cannot "physically move" a tuple.
Index bloat is intrinsic to non-rewriting compaction. Moving tuples without rewriting indexes means old index entries persist until a subsequent VACUUM. REINDEX CONCURRENTLY is the modern answer to this concern (Laurenz's point), which makes the tradeoff much more tolerable than in old-VACUUM-FULL days.
Logical decoding is a first-class concern for new heap operations. The XLH_UPDATE_RELOCATED flag in James's POC reflects the lesson that any new type of heap record must be filterable from logical decoding output to avoid polluting subscribers.
AEL for relation truncation is increasingly seen as obsolete. Andres's side remark about shared smgrrel-based truncation points to a larger ongoing direction to eliminate AEL from vacuum truncation, motivated by hot-standby workloads.

Latest Update

Thread Analysis: Disabling Heap-Only Tuples / local_update_limit / COMPACT