Report oldest xmin source when autovacuum cannot remove tuples

First seen: 2025-10-31 06:31:40+00:00 · Messages: 22 · Participants: 8

Latest Update

2026-05-25 · claude-opus-4-6

Technical Analysis: Report Oldest Xmin Source When Autovacuum Cannot Remove Tuples

Core Problem

PostgreSQL's VACUUM process cannot remove dead tuples that are still visible to any active transaction horizon (the "oldest xmin"). When VACUUM encounters dead tuples it cannot reclaim, the log output indicates the oldest xmin value but provides no diagnostic information about WHY that xmin is being held back. This is a critical operational gap because:

  1. Table bloat is one of PostgreSQL's most common production issues, and identifying the root cause requires correlating multiple volatile system views (pg_stat_activity, pg_prepared_xacts, pg_replication_slots, pg_stat_replication) at the exact moment the VACUUM ran.
  2. Retroactive diagnosis is nearly impossible — by the time an operator notices bloat from logs, the blocking transaction may have already completed, leaving no trace of the original cause.
  3. Multiple distinct sources can hold back the xmin horizon: active transactions, idle-in-transaction sessions, prepared transactions (2PC), replication slots, and hot standby feedback from replicas.

Architectural Context

The xmin horizon computation lives in ComputeXidHorizons() within procarray.c. This function walks the entire ProcArray under shared lock, computing various horizon values (shared, catalog, data, temp) that determine which tuples are safe to remove. The computed horizons are consumed by the visibility map logic (GlobalVisUpdate) and ultimately by VACUUM to determine its "removable cutoff" XID.

The key architectural insight is that ComputeXidHorizons() already examines every backend's xid and xmin to compute the minimum, but discards the identity of the backend that set that minimum. The feature request is to retain and report this attribution.

Evolution of the Technical Approach

v1: Inline in ComputeXidHorizons() (Original Proposal)

The initial patch modified ComputeXidHorizons() directly to track an OldestXminInfo struct alongside each horizon computation. This had the advantage of zero additional overhead (the information was gathered during the existing ProcArray scan) but had a critical correctness flaw identified by Sami Imseih:

The PID attribution was incorrect for write transactions. When multiple backends share the same backend_xmin (because they all started after the blocking transaction), the loop picks the first PID it encounters with the cutoff XID value. However, the actual blocker might be a different backend whose xid (not xmin) equals the horizon. The demonstration showed VACUUM reporting PID 267064 when the actual blocker was PID 267090 (an idle-in-transaction session that had consumed the oldest XID via txid_current()).

v2: Separate ProcArray Scan (Adopted Approach)

The accepted approach introduces two new functions:

The priority/selection logic is crucial: xid-match always wins over xmin-match. A backend whose xid equals the horizon is the actual root cause (it allocated that transaction ID). Backends whose xmin equals the horizon are merely "victims" — they started transactions that reference the same snapshot but didn't create the oldest XID. This distinction eliminates the false attribution problem from v1.

Blocker Types (XidHorizonBlockerType enum)

ACTIVE_TRANSACTION        — currently executing
IDLE_IN_TRANSACTION       — session with open transaction, not executing
PREPARED_TRANSACTION      — 2PC prepared but not committed/aborted
REPLICATION_SLOT          — logical or physical slot holding back xmin
HOT_STANDBY_FEEDBACK      — standby's xmin feedback via walsender

The priority ordering within the xid-match group is effectively moot (as explained by the author): a given xid is owned by exactly one backend, so there's never ambiguity within that group. The ordering matters only for the xmin-match fallback group, where multiple backends may share the same xmin.

Key Design Decisions and Tradeoffs

1. Single Blocker vs. All Blockers in Log Output

The thread debated whether to report all blockers or just one. The decision was to report only the root cause (highest-priority single blocker) in VACUUM logs because:

2. Additional ProcArray Scan Overhead

The adopted approach requires a second ProcArray scan beyond ComputeXidHorizons(). This is acceptable because:

A proposed optimization (not implemented) was to only perform the scan when the cutoff xmin hasn't advanced between consecutive vacuums, which would require tracking the last cutoff value in relation-level statistics. This was deferred.

3. Time Lag Between Horizon Computation and Blocker Identification

A theoretical concern raised by the original author: the blocker scan happens after ComputeXidHorizons(), so the blocking backend could have committed in between. In practice this is acceptable because:

4. Active vs. Idle-in-Transaction Distinction

The thread agreed to distinguish between active and idle-in-transaction states, as idle-in-transaction is a much more actionable finding (the session is doing nothing but holding back the horizon, and can potentially be terminated). This was added in v2.

Integration Points

Output Format

oldest xmin source: active transaction (pid=12345)
oldest xmin source: idle in transaction (pid=12346)
oldest xmin source: prepared transaction
oldest xmin source: replication slot
oldest xmin source: hot standby feedback (pid=12347)

PID is included for transaction-based blockers (where a backend exists) but omitted for replication slots and prepared transactions (pid=0).