Fix safe_wal_size for slots without restart_lsn

First seen: 2026-05-27 10:55:18+00:00 · Messages: 3 · Participants: 1

Latest Update

2026-06-01 · claude-opus-4-6

Fix safe_wal_size for Slots Without restart_lsn

Core Problem

The pg_replication_slots system view exposes a safe_wal_size column that indicates how much WAL can be written before a replication slot's reserved WAL is at risk of being removed (when max_slot_wal_keep_size is configured). This value is computed from the slot's restart_lsn — the oldest WAL position the slot needs to retain.

The bug occurs when a replication slot exists but has never reserved any WAL — meaning its restart_lsn is InvalidXLogRecPtr (NULL). In this state:

The root cause is in the WAL availability computation logic. The code checks for the WALAVAIL_REMOVED case (where WAL has already been removed past the slot's position) and returns NULL for safe_wal_size in that scenario. However, it fails to check for WALAVAIL_INVALID_LSN — the state representing a slot that never had a valid restart_lsn in the first place. When this case falls through, the arithmetic proceeds on InvalidXLogRecPtr (which is 0/0), producing a nonsensical but non-null result.

Architectural Context

This lives in the intersection of two subsystems:

  1. Replication slot management (src/backend/replication/slot.c and related): Slots track restart_lsn to prevent WAL recycling. A slot that's been created but not yet activated (e.g., a logical slot awaiting its initial snapshot) legitimately has no restart_lsn.

  2. The pg_replication_slots view (src/backend/catalog/system_views.sql and the underlying C function): This view surfaces slot metadata including WAL safety margins. The safe_wal_size computation involves max_slot_wal_keep_size - (current_wal_position - restart_lsn), which is meaningless when restart_lsn is invalid.

The fix is straightforward: add a check for WALAVAIL_INVALID_LSN alongside the existing WALAVAIL_REMOVED check, returning NULL for safe_wal_size in both cases. This maintains the semantic invariant that safe_wal_size is only meaningful when the slot actually has WAL reserved.

Proposed Solution

The patch adds a condition to return NULL for safe_wal_size when the WAL availability state is WALAVAIL_INVALID_LSN. This is the logical counterpart to the existing WALAVAIL_REMOVED handling — both represent states where computing a distance from restart_lsn is meaningless:

Patch Evolution

The addition of a test is notable — the author acknowledges this is a small fix but provides test coverage anyway, which exercises the scenario of querying safe_wal_size for a freshly-created slot that hasn't yet acquired a restart_lsn.

Risk Assessment

This is a low-risk, narrowly-scoped fix:

Potential Back-patch Consideration

Since this affects the accuracy of a monitoring view and could confuse monitoring tools (non-null safe_wal_size with null restart_lsn is contradictory), this is a reasonable candidate for back-patching to supported branches where max_slot_wal_keep_size and safe_wal_size exist.