Add heap and index vacuum timings to pg_stat_progress_vacuum

First seen: 2026-05-04 05:41:32+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-06 · opus 4.7

Adding Heap and Index Vacuum Timings to pg_stat_progress_vacuum

The Observability Gap

pg_stat_progress_vacuum has steadily grown in coverage since its introduction, but it remains a fundamentally structural view: it tells you what phase VACUUM is in and how much work remains (blocks scanned, dead-tuple bytes, index_vacuum_count), yet it says nothing about where time is actually being spent. For long-running vacuums — the exact case where operators most urgently need to reason about progress — this is a glaring hole.

The concrete pathology Bharath highlights is the classic interaction between:

  1. Bounded dead-TID storage (maintenance_work_mem, now backing the TID store / TidStore radix-tree introduced in PG17), and
  2. Indexes that must be fully scanned once per overflow cycle.

When the TID store fills, VACUUM must pause heap scanning, perform a complete ambulkdelete pass over every index, then resume. With N indexes and K overflow cycles, total index work is O(N·K·index_size). The user's reproducer makes this brutally obvious: maintenance_work_mem=256kB on a 3.9 GB table with 4 indexes produced 13,188 overflow cycles × 4 indexes = 52,752 full index scans, 34 minutes in, and only 13% done. Heap vacuum time: 2.1s. Index vacuum time: 2006s. Three orders of magnitude.

Today, an operator confronting this from pg_stat_progress_vacuum alone can see index_vacuum_count climbing, but cannot quantify the cost per cycle nor project completion. They must resort to external tools (perf, pg_stat_statements won't help, manual sampling of phase transitions) to diagnose that raising maintenance_work_mem is the correct lever.

What the Patch Proposes

Three new cumulative-timing columns on pg_stat_progress_vacuum:

Critically, these are accumulators, not per-phase gauges that reset. That is the right design for this use case: the signal operators need is "how much total wall-clock has index work cost me so far," which is directly comparable to heap scan/vacuum time and lets you compute a straightforward ETA via linear extrapolation (as the companion estimate_vacuum_completion() SQL function demonstrates: pct_done derived from heap_blks_scanned / heap_blks_total, then scaled).

Accumulation semantics

The accumulation across overflow cycles is the patch's most important semantic choice. If it reset per cycle, the view would be nearly useless — by the time you SELECT it, you'd probably catch a cycle mid-flight and see a meaningless partial number. Cumulative timings, in contrast, compose with heap_blks_scanned to give a rate (seconds of index work per heap block scanned) that is stable and projectable.

The GUC Question

Bharath floats track_vacuum_timing as a potential opt-in, modeled on track_io_timing / track_wal_timing. His own preference is to avoid it. The tradeoff is the familiar one:

The second framing strongly supports the no-GUC path. The precedent is closer to the unconditional timing already collected in pgstat_progress_* transitions than to track_io_timing's per-read overhead. Adding a GUC here is probably over-engineering, and committers have increasingly pushed back on new tracking GUCs that fragment observability.

Architectural Implications

A few deeper points worth drawing out:

  1. This makes the cost of overflow cycles first-class observable. Right now, the guidance "raise maintenance_work_mem if index_vacuum_count > 1" is folklore. With index_vacuum_time exposed, you can quantify the savings: if index work is 99% of vacuum time and there are 13k cycles, collapsing to 1 cycle is roughly a 13,000× win on the index portion. This is particularly valuable on PG17+ where the TID store's compressed representation makes "how much memory do I actually need?" non-obvious.

  2. Interaction with parallel index vacuum. VACUUM can parallelize index processing across workers (max_parallel_maintenance_workers, PARALLEL option). The patch presumably measures elapsed wall-clock in the leader, which is the right thing for operator-facing ETA — but the sum heap_vacuum_time + index_vacuum_time + index_cleanup_time will not equal CPU time, and this should be clearly documented to prevent confusion analogous to the long-running confusion around EXPLAIN (ANALYZE) loops and workers.

  3. Autovacuum coverage. The view already covers autovacuum (started_by = 'autovacuum'). These timings will surface autovacuum pathologies that today are invisible without log_autovacuum_min_duration spelunking. That's a significant operational win, particularly for fleets.

  4. Relationship to delay_time. The existing delay_time column (cost-based delay accumulator) is conceptually a sibling of these new columns — they're all "cumulative time spent in category X." The three new columns cleanly partition the non-delay, non-scan time of vacuum, which gives a coherent taxonomy: scan time (derivable from phase transitions), heap vacuum time, index vacuum time, index cleanup time, delay time. The only remaining unaccounted bucket would be the initial heap scan itself (heap_scan_time?), which might be a natural follow-up.

What's Missing / Likely Review Pushback

Verdict on the Proposal

The change is small, well-motivated, and fills a real diagnostic gap that every operator running VACUUM on large indexed tables has hit. The design choice to accumulate across overflow cycles is correct. The GUC question should almost certainly resolve to "no GUC" given that timing calls happen at phase boundaries, not in hot loops. The most likely review trajectory is: expand to cover heap scan and truncate phases for symmetry, sharpen documentation around parallel workers and cumulative semantics, then commit.