2026-06-01 · claude-opus-4-6

Technical Analysis: Fix Recomputation of Required LSN on restart_lsn-only Advancement

Core Problem

The issue resides in LogicalConfirmReceivedLocation(), a function in PostgreSQL's logical replication infrastructure that processes confirmations from downstream consumers about which WAL positions they have consumed. This function manages two independent advancement conditions:

updated_restart — the slot's restart_lsn has advanced (the point from which WAL replay must begin if the slot is reused)
updated_xmin — the slot's catalog xmin has advanced (the transaction horizon below which catalog tuples can be vacuumed)

When either condition is true, the slot is marked dirty and ReplicationSlotSave() persists the updated state. However, the critical call to ReplicationSlotsComputeRequiredLSN() — which recomputes the global minimum LSN across all replication slots and propagates it to XLogSetReplicationSlotMinimumLSN() — is gated exclusively by if (updated_xmin).

Why This Matters Architecturally

ReplicationSlotsComputeRequiredLSN() derives the system-wide WAL retention floor. This value directly controls:

WAL segment recycling — segments below this LSN are candidates for removal
pg_replication_slots.restart_lsn effectiveness — individual slot advancement is meaningless for WAL reclamation if the global minimum isn't recomputed
Disk space management — stale global minimums cause unnecessary WAL accumulation

When restart_lsn advances but xmin does not change, the per-slot state is correctly persisted, but the system-wide WAL retention point remains stale. WAL that is no longer needed by any slot continues to be retained until some unrelated event (e.g., an xmin update on any slot, a checkpoint, or another slot's activity) triggers recomputation.

Severity Assessment

In typical logical replication scenarios, xmin updates frequently alongside restart_lsn advances, so the window of staleness is small. However, there are pathological cases:

PG 19 REPACK — REPACK intentionally prevents xmin advancement, meaning restart_lsn can advance many times without ever triggering the global recomputation. This creates unbounded WAL retention.
Idle subscribers with periodic restart_lsn confirmations — if the workload doesn't generate catalog-modifying transactions, xmin may remain static while restart_lsn moves forward.
Many slots scenario — even brief staleness multiplied across many slots can compound into significant excess WAL retention.

Proposed Solution

The fix is minimal and surgical: move ReplicationSlotsComputeRequiredLSN() from inside the if (updated_xmin) block to inside if (updated_restart). This correctly aligns the recomputation trigger with the state it actually depends on — restart_lsn is the input to ReplicationSlotsComputeRequiredLSN(), not xmin.

// Before (incorrect gating):
if (updated_xmin) {
    ReplicationSlotsComputeRequiredXmin(...);
    ReplicationSlotsComputeRequiredLSN();  // LSN recomputation wrongly tied to xmin
}

// After (correct gating):
if (updated_restart) {
    ReplicationSlotsComputeRequiredLSN();  // LSN recomputation tied to restart_lsn
}
if (updated_xmin) {
    ReplicationSlotsComputeRequiredXmin(...);
}

This is logically clean: each recomputation function is called exactly when its corresponding input state changes.

Key Technical Insights

Separation of Concerns in Slot State

Replication slots track two independent retention dimensions:

WAL retention via restart_lsn → governed by ReplicationSlotsComputeRequiredLSN()
Catalog retention via catalog_xmin → governed by ReplicationSlotsComputeRequiredXmin()

The bug conflates these two dimensions by using xmin changes as a proxy for LSN changes. While they are often correlated in practice (both advance as replication progresses), they are architecturally independent.

Interaction with REPACK (PG 19)

The REPACK feature in PG 19 deliberately holds back xmin advancement (to preserve catalog visibility guarantees during table reorganization). This breaks the implicit assumption that xmin advances "frequently enough" to keep the global LSN minimum fresh. The bug becomes a real WAL bloat issue in REPACK scenarios, not merely a theoretical concern.

Backpatch Considerations

The discussion reveals tension between:

Correctness: The code is demonstrably wrong regardless of version — the gating condition doesn't match the semantic dependency
Practical impact: In versions before PG 19 (without REPACK), the bug is largely masked by frequent xmin updates during normal logical replication

Álvaro Herrera's question about whether backpatching is warranted reflects the committer perspective that backpatches should address real-world issues, not just theoretical correctness problems.

Design Decision: Why Was It Written This Way?

The original code likely grouped both recomputation calls together as an optimization — "if something changed, recompute everything." Over time, the conditions were refactored to be more granular (updated_restart vs updated_xmin), but the recomputation calls weren't correspondingly separated. This is a classic refactoring oversight where the control flow diverged from the semantic intent.

logical: fix recomputation required LSN on restart_lsn-only advancement

Latest Update