[Bug]Vacuum full silently NULL out fast default columns

First seen: 2026-05-04 09:17:08+00:00 · Messages: 6 · Participants: 3

Latest Update

2026-05-06 · opus 4.7

Core Problem: Fast Default Materialization Lost During Table Rewrite

Background: Fast Defaults (ALTER TABLE ADD COLUMN ... DEFAULT)

Since PostgreSQL 11 (commit 16828d5c0273), ALTER TABLE ADD COLUMN ... DEFAULT <const> avoids rewriting the entire table for constant (or immutable) defaults. Instead of physically touching every existing tuple, the default value is stashed in pg_attribute.attmissingval and the attribute's atthasmissing flag is set. When the heap is read, tuples whose HeapTupleHeader.t_natts is less than the relation's current natts have the "missing" trailing attributes synthesized on the fly by slot_getmissingattrs() / heap_deform_tuple() logic, which consults TupleDesc.attrs[i].attmissingval.

This is why the storage format allows "short" tuples: a tuple written before the ADD COLUMN has fewer physical attributes than the current tuple descriptor, and the missing columns are virtualized from catalog metadata.

The Bug

The invariant that couples short tuples with attmissingval is silently broken by VACUUM FULL / CLUSTER / REPACK (and now REPACK CONCURRENTLY, recently added in commit 28d534e2ae). The rewrite path in heapam_handler.c's reform_tuple() has a fast path: if the source relation has no dropped columns requiring reformation, it simply copies the source tuple byte-for-byte into the new heap.

That fast path is wrong when the source tuple is shorter than the new tuple descriptor. The rewritten heap then contains short tuples — but finish_heap_swap() subsequently calls RelationClearMissing() on the rewritten relation, zeroing out attmissingval / atthasmissing under the assumption that any remaining short tuples must have already been reformed to full width. The short tuples persist, but the source of their trailing values has been erased. On read, those columns now materialize as NULL.

Severity

The bug is particularly insidious because it bypasses constraint enforcement:

Tom Lane's reaction — "a seriously awful bug" — reflects this. Fortunately bisect pinned it to 28d534e2ae (the REPACK CONCURRENTLY commit) on HEAD, so no released branch is affected. However, the underlying reform_tuple() fast-path logic presumably predates that commit; the REPACK CONCURRENTLY work likely exposed or broadened the path that triggers it (or added test surface that was never covered). The fix touches the shared rewrite loop, so it implicitly covers VACUUM FULL and CLUSTER as well.

The Fix

The patch forces reform_tuple() to take the slow path — actually deforming and reforming the tuple with the new descriptor, materializing attmissingval into physical attribute slots — whenever HeapTupleHeaderGetNatts(tuple) < newTupDesc->natts. This guarantees that by the time RelationClearMissing() runs, no short tuples remain in the rewritten heap that depend on catalog-side missing values.

v1 → v2 Iteration

Tom objected to v1 on stylistic grounds: Satya had crammed the short-tuple check into an existing loop condition. Tom: "I don't like putting extra checks into a loop condition like this." This is a maintainability concern — overloaded loop predicates obscure control flow and make the rewrite invariants harder to reason about. v2 pulls the check out and, as a side fix, corrects an early-exit bug in the same loop (the loop was continuing past the point where a decision could have been made). Álvaro then contributed comment improvements and added a regression test specifically for REPACK CONCURRENTLY, closing the test-coverage gap that allowed the original bug to ship in the first place.

Regression Test Surface

The added tests in fast_default.sql exercise:

Key Technical Insights

  1. The atthasmissing mechanism has a hidden invariant with table rewrite: any code path that rewrites a heap must either preserve atthasmissing or materialize missing values into every rewritten tuple before clearing it. finish_heap_swap() assumes the latter; reform_tuple()'s fast path violated that assumption. This is exactly the kind of cross-module invariant that fast-default work (2018) and table-rewrite infrastructure (much older) each understand locally but neither enforces globally.

  2. The REPACK CONCURRENTLY commit is the proximate cause but possibly not the root cause: the reform_tuple() fast path predates this change. Investigation into whether CLUSTER / VACUUM FULL were ever affected on earlier branches would be warranted; the quick bisect only establishes where it became reproducible on HEAD.

  3. Silent NOT NULL / CHECK bypass is the real severity driver: had the bug only produced NULLs in nullable columns, it would be a correctness bug. The zero-value substitution on NOT NULL columns means downstream queries, application logic, and referential integrity can all be subtly wrong with no error surfaced.

  4. Fast path optimizations in tuple-reformation code need short-tuple awareness: any future optimization that tries to skip heap_deform_tupleheap_form_tuple round-tripping during a rewrite must check t_natts against the target descriptor, not just look for dropped columns in the source.

Participant Dynamics

The cycle from report → push was approximately 24 hours, which is typical for a non-released-branch data-corruption bug where there is no back-patch coordination required.