Add autovacuum_warning to surface concurrent vacuum collisions

First seen: 2026-06-01 12:33:33+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-06-04 · claude-opus-4-6

Technical Analysis: Add autovacuum_warning to Surface Concurrent Vacuum Collisions

Core Problem

PostgreSQL's autovacuum system has a significant observability gap between "everything is fine" and "the system is in crisis." Currently, DBAs receive warnings only when the situation has already deteriorated significantly:

  1. Anti-wraparound vacuum warnings — emitted when a table's transaction age approaches autovacuum_freeze_max_age, meaning the system is already dangerously close to transaction ID wraparound shutdown.
  2. Severe table bloat — observable only after the fact through pg_stat_user_tables or external monitoring, by which point significant storage waste has accumulated.

The fundamental issue is that autovacuum worker contention — where multiple workers attempt to process the same table — is an early indicator of autovacuum saturation, but this signal is completely invisible to operators. When workers collide, it means:

Proposed Solution

The patch introduces a new boolean GUC autovacuum_warning (default: off) that, when enabled, causes autovacuum workers to emit a LOG-level message whenever they skip a table because another worker is already vacuuming it.

Architectural Placement

The check occurs in the autovacuum worker's table selection logic. In src/backend/postmaster/autovacuum.c, when a worker iterates through its candidate table list, it checks whether another worker already holds a vacuum lock on that relation. Normally this skip is silent — with the patch, it optionally logs the collision.

Design Decisions and Tradeoffs

1. Off by default: This mirrors checkpoint_warning's philosophy. No log noise for users who don't need it, but trivial to enable for those who do. This is a conservative choice that avoids any controversy about default log verbosity.

2. LOG level (not WARNING): Appropriate because a single collision is informational, not necessarily problematic. Persistent collisions indicate a problem, but a single event does not. LOG level also means it won't appear on the client connection, only in server logs.

3. Generic HINT text: The author explicitly chose not to point at specific parameters (e.g., autovacuum_max_workers, autovacuum_vacuum_cost_delay, table-level settings). This is defensible because collisions can stem from multiple root causes:

4. Boolean GUC rather than threshold-based: Unlike checkpoint_warning which uses a time threshold (warn if checkpoints happen more frequently than N seconds), this is a simple on/off toggle. A potential enhancement would be a cooldown period to avoid log flooding (e.g., don't warn about the same table more than once per N minutes).

Potential Concerns Not Yet Raised

Since this is the initial proposal with no responses yet, several likely review concerns can be anticipated:

  1. Log flooding: In a heavily loaded system with many collisions, this could generate enormous log volume. A rate-limiting mechanism (per-table cooldown, or aggregate reporting) might be needed.

  2. Is a GUC the right interface? Some might argue this should always log (at DEBUG level) or use the existing log_autovacuum_min_duration infrastructure rather than adding a new GUC.

  3. Naming: autovacuum_warning is somewhat vague — it could be confused with other autovacuum warning conditions. A more specific name like log_autovacuum_worker_collision or autovacuum_collision_warning might be clearer.

  4. Interaction with log_autovacuum_min_duration: The relationship between this new GUC and the existing autovacuum logging infrastructure should be clarified.

  5. Statistical approach: Rather than logging each collision, an alternative design could track collision counts in pg_stat_* views, providing a queryable metric rather than log-based alerting.

Comparison to checkpoint_warning

The analogy to checkpoint_warning is apt but imperfect:

Aspect checkpoint_warning autovacuum_warning (proposed)
Signal Checkpoints too frequent Worker collisions
Type Time threshold (seconds) Boolean toggle
Default 30s (effectively on) off
Actionability Clear: raise max_wal_size Ambiguous: multiple parameters
Frequency risk Low (one per checkpoint cycle) High (potentially per-table per-cycle)

The key difference is that checkpoint_warning has natural rate limiting (at most one warning per checkpoint) while vacuum collisions could be very frequent in a saturated system — precisely when you least want additional log I/O overhead.

Assessment

This is a small, focused observability improvement that fills a real gap in autovacuum monitoring. The implementation appears straightforward (likely a few lines in autovacuum.c's worker loop). The main design questions will likely center around rate limiting and whether a GUC is the best mechanism versus pg_stat_* exposure.