2026-06-01 · claude-opus-4-6

Fix safe_wal_size for Slots Without restart_lsn

Core Problem

The pg_replication_slots system view exposes a safe_wal_size column that indicates how much WAL can be written before a replication slot's reserved WAL is at risk of being removed (when max_slot_wal_keep_size is configured). This value is computed from the slot's restart_lsn — the oldest WAL position the slot needs to retain.

The bug occurs when a replication slot exists but has never reserved any WAL — meaning its restart_lsn is InvalidXLogRecPtr (NULL). In this state:

wal_status correctly shows NULL
restart_lsn correctly shows NULL
But safe_wal_size incorrectly shows a non-null numeric value

The root cause is in the WAL availability computation logic. The code checks for the WALAVAIL_REMOVED case (where WAL has already been removed past the slot's position) and returns NULL for safe_wal_size in that scenario. However, it fails to check for WALAVAIL_INVALID_LSN — the state representing a slot that never had a valid restart_lsn in the first place. When this case falls through, the arithmetic proceeds on InvalidXLogRecPtr (which is 0/0), producing a nonsensical but non-null result.

Architectural Context

This lives in the intersection of two subsystems:

Replication slot management (src/backend/replication/slot.c and related): Slots track restart_lsn to prevent WAL recycling. A slot that's been created but not yet activated (e.g., a logical slot awaiting its initial snapshot) legitimately has no restart_lsn.
The pg_replication_slots view (src/backend/catalog/system_views.sql and the underlying C function): This view surfaces slot metadata including WAL safety margins. The safe_wal_size computation involves max_slot_wal_keep_size - (current_wal_position - restart_lsn), which is meaningless when restart_lsn is invalid.

The fix is straightforward: add a check for WALAVAIL_INVALID_LSN alongside the existing WALAVAIL_REMOVED check, returning NULL for safe_wal_size in both cases. This maintains the semantic invariant that safe_wal_size is only meaningful when the slot actually has WAL reserved.

Proposed Solution

The patch adds a condition to return NULL for safe_wal_size when the WAL availability state is WALAVAIL_INVALID_LSN. This is the logical counterpart to the existing WALAVAIL_REMOVED handling — both represent states where computing a distance from restart_lsn is meaningless:

WALAVAIL_REMOVED: The slot's WAL has already been removed; the slot is already in a "lost" state.
WALAVAIL_INVALID_LSN: The slot has never established a WAL position; there's nothing to measure distance from.

Patch Evolution

v1: Initial fix with the core logic change but had an indentation issue.
v2: Fixed indentation, updated the comment explaining the NULL return, and updated documentation. Also added a regression test.
v3: Improved readability of the new regression test.

The addition of a test is notable — the author acknowledges this is a small fix but provides test coverage anyway, which exercises the scenario of querying safe_wal_size for a freshly-created slot that hasn't yet acquired a restart_lsn.

Risk Assessment

This is a low-risk, narrowly-scoped fix:

It only affects the display/reporting path, not WAL retention decisions themselves
The change adds a bail-out condition before arithmetic, with no side effects
Slots that have valid restart_lsn values are completely unaffected
The fix aligns the behavior of safe_wal_size with wal_status (both now correctly return NULL in this state)

Potential Back-patch Consideration

Since this affects the accuracy of a monitoring view and could confuse monitoring tools (non-null safe_wal_size with null restart_lsn is contradictory), this is a reasonable candidate for back-patching to supported branches where max_slot_wal_keep_size and safe_wal_size exist.

Fix safe_wal_size for slots without restart_lsn