Report oldest xmin source when autovacuum cannot remove tuples

First seen: 2025-10-31 06:31:40+00:00 · Messages: 33 · Participants: 11

Latest Update

2026-06-04 · claude-opus-4-6

Incremental Update: Report Oldest Xmin Source — Debate Shifts to Log vs. SQL View

Summary

A significant architectural debate has emerged about whether the blocker information belongs in VACUUM logs at all, or should instead be exposed via a SQL-visible view. This represents a potential challenge to the patch's fundamental premise, though the thread ultimately converges on "both are needed." A committer (Jacob Champion) has weighed in for the first time, lending support to the logging approach while also revealing he'd independently been working on a similar solution.

Key Architectural Debate: Log Output vs. SQL View

horikyota.ntt raised a substantive objection to the patch's core approach:

Temporal mismatch: A DBA investigating bloat wants to know what's blocking now, not what blocked a VACUUM at some earlier point. A live view is more useful for that.
Epistemological concern: VACUUM output has traditionally reported facts observed during the operation. The blocker information is "reconstructed afterwards" and "explicitly best-effort," making it categorically different from other VACUUM log content.
The existing signal is sufficient: VACUUM VERBOSE already reports the removable cutoff and its age — that's enough to tell you the horizon is held back. The resolution step should use a live view.

Sami Imseih endorsed this position, referencing his earlier comment that SQL exposure is better for proactive monitoring.

Shinya Kato's Counter-Argument (Patch Author)

Kato defended the logging approach with a key insight: a view only shows current state; logs are the only durable record for post-hoc analysis. Once a blocker commits, there's no way to retroactively determine what held back a past VACUUM. This is the core argument for logging.

However, Kato also acknowledged the epistemological concern and proposed a potential compromise: capture the exact blocker at the moment OldestXmin is computed (inline during ComputeXidHorizons()) rather than reconstructing it afterwards. This would make the information a fact observed during the operation rather than a best-effort guess.

This is architecturally significant because it represents a potential return to a v1-style inline approach, but presumably with the correctness fixes (xid-vs-xmin priority) that motivated the v2 separate-scan design. The tension between these approaches remains unresolved.

Committer Input: Jacob Champion

Jacob Champion disclosed he'd independently been working on a similar patch and supports the logging approach:

Use case: During high-stress support escalations, logs are essential because the blocking state may be transient
Design preference: He also chose to track the origin during horizon calculation (inline approach) rather than reconstructing it afterwards
Acknowledged problem: He hit the same "collision issue" (xid-vs-xmin ambiguity) that led Kato to the v2 separate-scan design, and hadn't resolved it before setting the patch down
Practical stance: "There's nothing wrong with getting a good solution in and working on a better one, as long as the first patch doesn't make the second one harder"

This is notable because a committer expressing preference for the inline approach may influence the final architecture.

Scott Ray's View Proposal

Scott Ray revealed he's been working on a companion view that shows:

Horizon contribution for each backend, prepared xact, replication slot, and HSF walsender
Breakdown by class (shared, catalog, data, temp)
Delta analysis: for each contributor, how the horizon would shift if that holder were removed

He proposed sharing a single underlying function between the log output and the view. Kato expressed mild preference to keep both discussions in this thread to maintain alignment on the shared function design.

Emerging Consensus

The thread converges on: both log output and a SQL view are needed, and they serve different purposes (post-hoc vs. real-time). Multiple participants (+1s from qiuwenhuifx, japinli) support having both. The open question is whether the log blocker identification should be inline (captured during horizon computation) or separate (reconstructed afterwards).

History (2 prior analyses)

2026-06-01 · claude-opus-4-6

Incremental Update: Report Oldest Xmin Source When Autovacuum Cannot Remove Tuples

Summary

Limited new technical substance since the last analysis. The thread saw one bug identification by the patch author and one architectural counter-proposal from a new participant, which was declined.

New Issue: Misleading Report for Physical Slot + hot_standby_feedback

Shinya Kato identified a correctness bug in the current patch's blocker classification. When hot_standby_feedback=on is used with a physical replication slot, the standby's xmin is recorded on the replication slot (not the walsender's PGPROC). The current patch reports this as XHB_REPLICATION_SLOT with message "logical replication slot", which is factually incorrect — it's a physical slot holding back xmin due to standby feedback, not a logical slot. A fix is planned but not yet posted.

The two distinct code paths for hot_standby_feedback:

Without physical slot: xmin held on walsender's PGPROC → correctly reported as "hot standby feedback (pid=N)"
With physical slot: xmin held on the slot itself → currently misreported as "logical replication slot" (needs fix)

Architectural Counter-Proposal Rejected

Scott Ray proposed computing the blocker inline during ComputeXidHorizons() and caching it for consumption at log time, avoiding the second ProcArray scan. Shinya Kato declined this, citing Sami Imseih's earlier rationale that separating the responsibilities of horizon computation and blocker identification is a cleaner design. This reaffirms the v2 architecture (separate scan) as the accepted approach.

2026-05-25 · claude-opus-4-6

Technical Analysis: Report Oldest Xmin Source When Autovacuum Cannot Remove Tuples

Core Problem

PostgreSQL's VACUUM process cannot remove dead tuples that are still visible to any active transaction horizon (the "oldest xmin"). When VACUUM encounters dead tuples it cannot reclaim, the log output indicates the oldest xmin value but provides no diagnostic information about WHY that xmin is being held back. This is a critical operational gap because:

Table bloat is one of PostgreSQL's most common production issues, and identifying the root cause requires correlating multiple volatile system views (pg_stat_activity, pg_prepared_xacts, pg_replication_slots, pg_stat_replication) at the exact moment the VACUUM ran.
Retroactive diagnosis is nearly impossible — by the time an operator notices bloat from logs, the blocking transaction may have already completed, leaving no trace of the original cause.
Multiple distinct sources can hold back the xmin horizon: active transactions, idle-in-transaction sessions, prepared transactions (2PC), replication slots, and hot standby feedback from replicas.

Architectural Context

The xmin horizon computation lives in ComputeXidHorizons() within procarray.c. This function walks the entire ProcArray under shared lock, computing various horizon values (shared, catalog, data, temp) that determine which tuples are safe to remove. The computed horizons are consumed by the visibility map logic (GlobalVisUpdate) and ultimately by VACUUM to determine its "removable cutoff" XID.

The key architectural insight is that ComputeXidHorizons() already examines every backend's xid and xmin to compute the minimum, but discards the identity of the backend that set that minimum. The feature request is to retain and report this attribution.

Evolution of the Technical Approach

v1: Inline in ComputeXidHorizons() (Original Proposal)

The initial patch modified ComputeXidHorizons() directly to track an OldestXminInfo struct alongside each horizon computation. This had the advantage of zero additional overhead (the information was gathered during the existing ProcArray scan) but had a critical correctness flaw identified by Sami Imseih:

The PID attribution was incorrect for write transactions. When multiple backends share the same backend_xmin (because they all started after the blocking transaction), the loop picks the first PID it encounters with the cutoff XID value. However, the actual blocker might be a different backend whose xid (not xmin) equals the horizon. The demonstration showed VACUUM reporting PID 267064 when the actual blocker was PID 267090 (an idle-in-transaction session that had consumed the oldest XID via txid_current()).

v2: Separate ProcArray Scan (Adopted Approach)

The accepted approach introduces two new functions:

GetXidHorizonBlockers(): Performs a dedicated ProcArray scan (under shared lock) to collect ALL potential blockers of a given xmin horizon. This returns a list of candidates with their type, PID, and whether they match via xid or xmin.
GetXidHorizonBlocker(): Selects the single highest-priority blocker from the candidates for log output.

The priority/selection logic is crucial: xid-match always wins over xmin-match. A backend whose xid equals the horizon is the actual root cause (it allocated that transaction ID). Backends whose xmin equals the horizon are merely "victims" — they started transactions that reference the same snapshot but didn't create the oldest XID. This distinction eliminates the false attribution problem from v1.

Blocker Types (XidHorizonBlockerType enum)

ACTIVE_TRANSACTION        — currently executing
IDLE_IN_TRANSACTION       — session with open transaction, not executing
PREPARED_TRANSACTION      — 2PC prepared but not committed/aborted
REPLICATION_SLOT          — logical or physical slot holding back xmin
HOT_STANDBY_FEEDBACK      — standby's xmin feedback via walsender

The priority ordering within the xid-match group is effectively moot (as explained by the author): a given xid is owned by exactly one backend, so there's never ambiguity within that group. The ordering matters only for the xmin-match fallback group, where multiple backends may share the same xmin.

Key Design Decisions and Tradeoffs

1. Single Blocker vs. All Blockers in Log Output

The thread debated whether to report all blockers or just one. The decision was to report only the root cause (highest-priority single blocker) in VACUUM logs because:

In concurrent workloads (e.g., pgbench with many clients), dozens of backends may share the same xmin, producing excessive log noise
The root cause (xid owner) is what the DBA needs to resolve; subsequent blockers will appear in the next VACUUM cycle
The full list is architecturally reserved for a future dynamic statistics view via GetXidHorizonBlockers()

2. Additional ProcArray Scan Overhead

The adopted approach requires a second ProcArray scan beyond ComputeXidHorizons(). This is acceptable because:

It only occurs during VACUUM VERBOSE or when log_autovacuum_min_duration triggers logging
The scan holds only a shared lock (not exclusive)
Laurenz Albe explicitly endorsed this as acceptable overhead for the default case

A proposed optimization (not implemented) was to only perform the scan when the cutoff xmin hasn't advanced between consecutive vacuums, which would require tracking the last cutoff value in relation-level statistics. This was deferred.

3. Time Lag Between Horizon Computation and Blocker Identification

A theoretical concern raised by the original author: the blocker scan happens after ComputeXidHorizons(), so the blocking backend could have committed in between. In practice this is acceptable because:

The horizon value is still valid (it was computed atomically)
If the blocker committed, the next VACUUM will advance the horizon
The information is diagnostic, not transactional

4. Active vs. Idle-in-Transaction Distinction

The thread agreed to distinguish between active and idle-in-transaction states, as idle-in-transaction is a much more actionable finding (the session is doing nothing but holding back the horizon, and can potentially be terminated). This was added in v2.

Integration Points

VACUUM log output: New lines appended to verbose vacuum output showing blocker type and PID
TAP tests: Dedicated test file (010_autovacuum_oldest_xmin_reason.pl) covering all blocker types including serializable transactions
Future: pg_stat views: The GetXidHorizonBlockers() API is designed to support a future system view exposing all current horizon blockers
Future: pg_stat_*_tables columns: Proposed last_vacuum_oldest_xmin and last_vacuum_oldest_xmin_source columns for persistent tracking

Output Format

oldest xmin source: active transaction (pid=12345)
oldest xmin source: idle in transaction (pid=12346)
oldest xmin source: prepared transaction
oldest xmin source: replication slot
oldest xmin source: hot standby feedback (pid=12347)

PID is included for transaction-based blockers (where a backend exists) but omitted for replication slots and prepared transactions (pid=0).