Fix safe_wal_size for Slots Without restart_lsn
Core Problem
The pg_replication_slots system view exposes a safe_wal_size column that indicates how much WAL can be written before a replication slot's reserved WAL is at risk of being removed (when max_slot_wal_keep_size is configured). This value is computed from the slot's restart_lsn — the oldest WAL position the slot needs to retain.
The bug occurs when a replication slot exists but has never reserved any WAL — meaning its restart_lsn is InvalidXLogRecPtr (NULL). In this state:
wal_statuscorrectly shows NULLrestart_lsncorrectly shows NULL- But
safe_wal_sizeincorrectly shows a non-null numeric value
The root cause is in the WAL availability computation logic. The code checks for the WALAVAIL_REMOVED case (where WAL has already been removed past the slot's position) and returns NULL for safe_wal_size in that scenario. However, it fails to check for WALAVAIL_INVALID_LSN — the state representing a slot that never had a valid restart_lsn in the first place. When this case falls through, the arithmetic proceeds on InvalidXLogRecPtr (which is 0/0), producing a nonsensical but non-null result.
Architectural Context
This lives in the intersection of two subsystems:
-
Replication slot management (
src/backend/replication/slot.cand related): Slots trackrestart_lsnto prevent WAL recycling. A slot that's been created but not yet activated (e.g., a logical slot awaiting its initial snapshot) legitimately has norestart_lsn. -
The
pg_replication_slotsview (src/backend/catalog/system_views.sqland the underlying C function): This view surfaces slot metadata including WAL safety margins. Thesafe_wal_sizecomputation involvesmax_slot_wal_keep_size - (current_wal_position - restart_lsn), which is meaningless whenrestart_lsnis invalid.
The fix is straightforward: add a check for WALAVAIL_INVALID_LSN alongside the existing WALAVAIL_REMOVED check, returning NULL for safe_wal_size in both cases. This maintains the semantic invariant that safe_wal_size is only meaningful when the slot actually has WAL reserved.
Proposed Solution
The patch adds a condition to return NULL for safe_wal_size when the WAL availability state is WALAVAIL_INVALID_LSN. This is the logical counterpart to the existing WALAVAIL_REMOVED handling — both represent states where computing a distance from restart_lsn is meaningless:
WALAVAIL_REMOVED: The slot's WAL has already been removed; the slot is already in a "lost" state.WALAVAIL_INVALID_LSN: The slot has never established a WAL position; there's nothing to measure distance from.
Patch Evolution
- v1: Initial fix with the core logic change but had an indentation issue.
- v2: Fixed indentation, updated the comment explaining the NULL return, and updated documentation. Also added a regression test.
- v3: Improved readability of the new regression test.
The addition of a test is notable — the author acknowledges this is a small fix but provides test coverage anyway, which exercises the scenario of querying safe_wal_size for a freshly-created slot that hasn't yet acquired a restart_lsn.
Risk Assessment
This is a low-risk, narrowly-scoped fix:
- It only affects the display/reporting path, not WAL retention decisions themselves
- The change adds a bail-out condition before arithmetic, with no side effects
- Slots that have valid
restart_lsnvalues are completely unaffected - The fix aligns the behavior of
safe_wal_sizewithwal_status(both now correctly return NULL in this state)
Potential Back-patch Consideration
Since this affects the accuracy of a monitoring view and could confuse monitoring tools (non-null safe_wal_size with null restart_lsn is contradictory), this is a reasonable candidate for back-patching to supported branches where max_slot_wal_keep_size and safe_wal_size exist.