Core Problem: Fast Default Materialization Lost During Table Rewrite
Background: Fast Defaults (ALTER TABLE ADD COLUMN ... DEFAULT)
Since PostgreSQL 11 (commit 16828d5c0273), ALTER TABLE ADD COLUMN ... DEFAULT <const> avoids rewriting the entire table for constant (or immutable) defaults. Instead of physically touching every existing tuple, the default value is stashed in pg_attribute.attmissingval and the attribute's atthasmissing flag is set. When the heap is read, tuples whose HeapTupleHeader.t_natts is less than the relation's current natts have the "missing" trailing attributes synthesized on the fly by slot_getmissingattrs() / heap_deform_tuple() logic, which consults TupleDesc.attrs[i].attmissingval.
This is why the storage format allows "short" tuples: a tuple written before the ADD COLUMN has fewer physical attributes than the current tuple descriptor, and the missing columns are virtualized from catalog metadata.
The Bug
The invariant that couples short tuples with attmissingval is silently broken by VACUUM FULL / CLUSTER / REPACK (and now REPACK CONCURRENTLY, recently added in commit 28d534e2ae). The rewrite path in heapam_handler.c's reform_tuple() has a fast path: if the source relation has no dropped columns requiring reformation, it simply copies the source tuple byte-for-byte into the new heap.
That fast path is wrong when the source tuple is shorter than the new tuple descriptor. The rewritten heap then contains short tuples — but finish_heap_swap() subsequently calls RelationClearMissing() on the rewritten relation, zeroing out attmissingval / atthasmissing under the assumption that any remaining short tuples must have already been reformed to full width. The short tuples persist, but the source of their trailing values has been erased. On read, those columns now materialize as NULL.
Severity
The bug is particularly insidious because it bypasses constraint enforcement:
- For nullable columns: rows silently become NULL despite a non-NULL default having been observed moments earlier.
- For
NOT NULLcolumns: the read path appears to substitute the type's zero value (per Satya's report), which means the NOT NULL constraint and any CHECK constraint referencing the column are silently violated without an error. This is a data-integrity class bug, not merely a visible regression.
Tom Lane's reaction — "a seriously awful bug" — reflects this. Fortunately bisect pinned it to 28d534e2ae (the REPACK CONCURRENTLY commit) on HEAD, so no released branch is affected. However, the underlying reform_tuple() fast-path logic presumably predates that commit; the REPACK CONCURRENTLY work likely exposed or broadened the path that triggers it (or added test surface that was never covered). The fix touches the shared rewrite loop, so it implicitly covers VACUUM FULL and CLUSTER as well.
The Fix
The patch forces reform_tuple() to take the slow path — actually deforming and reforming the tuple with the new descriptor, materializing attmissingval into physical attribute slots — whenever HeapTupleHeaderGetNatts(tuple) < newTupDesc->natts. This guarantees that by the time RelationClearMissing() runs, no short tuples remain in the rewritten heap that depend on catalog-side missing values.
v1 → v2 Iteration
Tom objected to v1 on stylistic grounds: Satya had crammed the short-tuple check into an existing loop condition. Tom: "I don't like putting extra checks into a loop condition like this." This is a maintainability concern — overloaded loop predicates obscure control flow and make the rewrite invariants harder to reason about. v2 pulls the check out and, as a side fix, corrects an early-exit bug in the same loop (the loop was continuing past the point where a decision could have been made). Álvaro then contributed comment improvements and added a regression test specifically for REPACK CONCURRENTLY, closing the test-coverage gap that allowed the original bug to ship in the first place.
Regression Test Surface
The added tests in fast_default.sql exercise:
VACUUM FULLon a fast-default columnCLUSTERon sameREPACKandREPACK CONCURRENTLY- A
NOT NULLcolumn with aCHECKconstraint, which acts as a tripwire: if any future refactor reintroduces the missingval loss, the constraint violation will surface immediately rather than as silent data corruption.
Key Technical Insights
-
The
atthasmissingmechanism has a hidden invariant with table rewrite: any code path that rewrites a heap must either preserveatthasmissingor materialize missing values into every rewritten tuple before clearing it.finish_heap_swap()assumes the latter;reform_tuple()'s fast path violated that assumption. This is exactly the kind of cross-module invariant that fast-default work (2018) and table-rewrite infrastructure (much older) each understand locally but neither enforces globally. -
The REPACK CONCURRENTLY commit is the proximate cause but possibly not the root cause: the
reform_tuple()fast path predates this change. Investigation into whether CLUSTER / VACUUM FULL were ever affected on earlier branches would be warranted; the quick bisect only establishes where it became reproducible on HEAD. -
Silent NOT NULL / CHECK bypass is the real severity driver: had the bug only produced NULLs in nullable columns, it would be a correctness bug. The zero-value substitution on NOT NULL columns means downstream queries, application logic, and referential integrity can all be subtly wrong with no error surfaced.
-
Fast path optimizations in tuple-reformation code need short-tuple awareness: any future optimization that tries to skip
heap_deform_tuple→heap_form_tupleround-tripping during a rewrite must checkt_nattsagainst the target descriptor, not just look for dropped columns in the source.
Participant Dynamics
- Satya (reporter/author): Did the diagnostic work, including root-causing to
reform_tuple()and identifying the interaction withRelationClearMissing(). Produced v1 and v2 patches. - Tom Lane (committer, reviewer): Immediately ran bisect, pinned the commit, and gave a concise style objection that shaped v2. His weight here is both as a senior committer and as an authority on data-corruption-class bugs.
- Álvaro Herrera (committer, author of the offending commit): Took ownership of the fix — appropriate since he authored 28d534e2ae. Added additional comments and REPACK CONCURRENTLY test coverage, and pushed the final patch. His handling (acknowledging the issue quickly, adding test coverage beyond the minimum fix) is consistent with committer responsibility for regressions in their own work.
The cycle from report → push was approximately 24 hours, which is typical for a non-released-branch data-corruption bug where there is no back-patch coordination required.