Infinite Autovacuum loop caused by failing virtual generated column expression

2026-05-06 · opus 4.7

Core Problem: Error-Raising Expressions in Extended Statistics Cause Autovacuum Livelock

The Architectural Issue

PostgreSQL 19 extended ANALYZE to compute statistics on virtual generated columns (columns whose values are computed on-the-fly from an expression rather than stored). When combined with extended statistics (CREATE STATISTICS), this creates a new failure mode: because extended stats computation must materialize the generated column's value for every sampled row, any row whose expression evaluation raises a runtime error (e.g., division by zero, domain violation, type cast failure) will abort the entire ANALYZE command.

The autovacuum launcher, operating on a coarse schedule, will then re-trigger ANALYZE on the table indefinitely since relfrozenxid/reltuples/stats-staleness tracking keeps flagging it as needing attention. Each retry fails the same way, producing log churn and wasted CPU/IO.

Critically, the OP (Satya) identifies an important asymmetry:

Per-column ANALYZE on the generated column succeeds — the analyze code path for single-column stats evidently tolerates or avoids evaluating the expression on every sample row uniformly.
Plain ANALYZE t without extended stats succeeds.
Adding CREATE STATISTICS t_stat ON a, gen FROM t causes a hard failure.

This isolates the bug to the extended-statistics build path, specifically the sample-materialization step that evaluates the virtual column expression on each sampled tuple.

Why This Matters Beyond Generated Columns

Dean Rasheed (committer, optimizer/stats domain expert) correctly notes this is not novel — the same livelock has been theoretically possible since PG14 introduced expression statistics (CREATE STATISTICS ... ON (expr) FROM t). Any expression that raises an error on some subset of rows produces the same pathology. Virtual generated columns merely make it easier to stumble into because users may not realize CREATE STATISTICS over a generated column implicitly re-evaluates the generator expression.

This reframes the fix: it is not a generated-columns patch but rather a robustness improvement for the entire extended-statistics subsystem.

Proposed Solutions and Tradeoffs

Two options were floated:

Skip the offending row from the sample. Would preserve partial statistics but risks biased histograms/MCVs if errors correlate with data distribution (very likely — the error is a property of the data).
Skip the offending statistics object with a WARNING. Coarser: the user loses that stats object entirely until they fix the expression, but the table's other stats (per-column, and other non-failing extended-stats objects) remain usable and ANALYZE completes.

Satya's draft patch implements option (2). Yugo Nagata endorses this as the right user-contract: "treat this as the user's responsibility to notice the warning and address the underlying issue." Dean Rasheed pushes back mildly on urgency, pointing out:

The autovacuum retry interval is 1 minute by default — not a "flood."
If the failure rate depends on sampling (e.g., only 1 row in millions causes division-by-zero), retries might eventually succeed by chance.

Dean's skepticism is worth weighing: option (1)'s appeal is precisely that sampling is probabilistic, so skipping bad rows degrades gracefully. But option (2)'s determinism — same input always produces the same outcome — is easier to reason about operationally.

Patch Implementation Details

v1 Patch Structure

The v1 patch wraps extended-stats computation in BuildRelationExtStatistics() in a PG_TRY/PG_CATCH block, using a child ResourceOwner to contain resources allocated during the failed build so they can be cleanly released without corrupting the outer transaction state. On catch, it emits a WARNING and continues to the next statistics object.

Yugo's Review: Push TRY/CATCH Deeper

Yugo's review (2026-04-28) argues the child ResourceOwner is unnecessary overhead and suggests pushing the PG_TRY blocks down into the two functions that actually invoke expression evaluation:

make_build_data() — materializes the sample tuples into the build format, calling ExecEvalExpr() on each expression attribute.
compute_expr_stats() — the per-expression statistics computation path.

His suggested shape:

PG_TRY();
{
    datum = ExecEvalExpr(exprstate, ..., &isnull);
}
PG_CATCH();
{
    ExecDropSingleTupleTableSlot(slot);
    FreeExecutorState(estate);
    PG_RE_THROW();
}
PG_END_TRY();

Note: this inner snippet re-throws — it's only a resource-cleanup shim. The actual WARN-and-skip decision is made at a higher level. This layering is cleaner than a ResourceOwner subtree because:

It narrows the PG_TRY scope to exactly the risky call.
It avoids the bookkeeping cost of creating/destroying a child ResourceOwner per stats object on tables with many objects.
It keeps resource ownership conventional — the executor state is freed explicitly by the code that created it.

Test Coverage

Yugo also pushes for broader test coverage that decouples the fix from the triggering scenario (generated columns), exercising:

Extended stats defined directly on error-raising expressions (((a/0)))
Mixed stats objects where some fail and some succeed — verifying that pg_statistic_ext still contains entries for the non-failing objects after ANALYZE.

This is important regression coverage: it enforces the invariant that one bad stats object does not prevent computation of its siblings. The SELECT statistics_name FROM pg_stats_ext ... ORDER BY ROW(x.*) check validates exactly which objects succeeded.

Design Tension: Error Containment Granularity

The deeper architectural question is at what granularity should stats computation be transactional? The pre-patch behavior treated all of a table's extended stats as one atomic unit — an all-or-nothing batch. The patch moves the boundary to per-stats-object. One could argue for even finer granularity (per-expression within an object, or per-row skipping per option (1)), but each step finer:

Increases the surface area of PG_TRY/PG_CATCH (which has non-trivial cost and constrains what can be done inside).
Moves further from user-meaningful units — a user can DROP STATISTICS a failing object, but cannot easily tell the system "skip rows where expression X fails."

Per-stats-object skipping is the sweet spot: it aligns with a DDL-level unit the user controls.

Secondary Concern: Why Didn't Single-Column ANALYZE Fail?

Satya's repro shows ANALYZE t(gen) succeeds on the same data where extended-stats ANALYZE t fails. This suggests the per-column code path either:

Catches expression errors and treats the row as NULL, or
Evaluates expressions lazily / only on rows that contribute to the stats,

while the extended-stats path materializes the full virtual column eagerly across the whole sample. This inconsistency is itself worth understanding — the thread does not dig into it, but aligning the two behaviors (both either tolerant or both strict) would be a more principled fix than papering over only the extended-stats path. The committed approach effectively brings extended-stats to the same level of robustness as single-column stats, but via WARNING-and-skip rather than row-level tolerance.

Assessment

The patch is a pragmatic, well-scoped robustness fix. Dean's observation that the underlying issue predates PG19 means this could be backported to PG14+, though the thread does not explicitly discuss that. The v2 patch (2026-05-03) incorporates Yugo's structural feedback (moving PG_TRY deeper, removing the child ResourceOwner) and expands test coverage as requested. No committer has yet signed off; Dean's earlier lukewarm reaction ("I'm not sure") suggests this may need further advocacy to land, possibly with a rationale for why option (2) is preferable to option (1) or to the status quo.