Checkpoint Replication Slots Later — Technical Analysis
The Core Problem
During a PostgreSQL checkpoint, the CheckpointGuts() function performs a sequence of sub-operations that flush various pieces of state to stable storage. Among these is CheckPointReplicationSlots(), which persists the on-disk state of replication slots (their restart_lsn, confirmed_flush_lsn, etc.). Currently this call sits near the beginning of CheckpointGuts(), before CheckPointBuffers().
CheckPointBuffers() is the dominant cost of a checkpoint. Under the default "spread checkpoint" behavior (controlled by checkpoint_completion_target, effectively 0.9), buffer writing is deliberately stretched out to consume most of checkpoint_timeout (typically minutes, not seconds) in order to smooth the I/O impact on foreground workload. That means between the moment the slot state is captured and the moment the checkpoint actually finishes, a very long wall-clock interval can elapse.
The asymmetry Aasma is pointing out lies in how WAL retention is computed:
CheckPointReplicationSlots()effectively snapshots the slots early.- Near the end of the checkpoint,
RemoveOldXlogFiles()is called and consultsReplicationSlotsComputeRequiredLSN()to decide how far back WAL must be retained for consumers (logical decoding, physical standbys via slots). - But by that time, slot advancement that occurred during the spread-checkpoint window — potentially tens of gigabytes of WAL worth of progress by fast consumers — is not reflected in what the checkpoint considers "required."
The concrete user-visible symptom: pg_wal holds onto far more segments than necessary for an entire checkpoint cycle. In steady state this means baseline WAL footprint is inflated by roughly the amount of WAL generated during one checkpoint_timeout window, which on busy systems with aggressive replication slots can be substantial. The segments would only be recycled on the next checkpoint — unless that one also snapshots the slots early, perpetuating the lag.
Why the Current Ordering Exists (and Whether It Matters)
The natural question is why CheckPointReplicationSlots() was placed early in the first place. Looking at the durability contract: the checkpoint's redo pointer is established before CheckpointGuts() runs. Anything persisted during CheckpointGuts() only needs to be consistent with that redo pointer and with WAL replayed up to the checkpoint record. Slot state files (pg_replslot/*/state) are not ordered with respect to buffer flushes — they're independent metadata. So from a crash-recovery standpoint, moving the call to the end of CheckpointGuts() (but still before the checkpoint record is written and before RemoveOldXlogFiles()) preserves correctness: on crash, we still restore slots to a state no more advanced than WAL that was durably flushed.
Aasma's assertion "as far as I could tell there is no reason for this to happen early" is the key design claim. If true, this is essentially free — a one-line reordering that tightens the coupling between observed slot positions and WAL retention decisions made in the same checkpoint.
Implications and Subtleties
Several aspects worth flagging for reviewers:
-
Interaction with
RemoveOldXlogFiles(): The retention calculation ultimately goes throughKeepLogSeg(), which combineswal_keep_size,max_slot_wal_keep_size, andReplicationSlotsComputeRequiredLSN(). Moving the slot checkpoint later means the retention computation sees a fresherrestart_lsn, which strictly reduces retention (or leaves it unchanged) — it cannot cause premature WAL removal that breaks a consumer, because slot advancement is monotonic. -
Slot invalidation under
max_slot_wal_keep_size: Slot invalidation (InvalidateObsoleteReplicationSlots) happens around WAL removal. Later slot persistence should not interfere; invalidation mutates slot state in shared memory and marks it dirty, so a subsequentCheckPointReplicationSlots()at the new (later) position would correctly flush the invalidated state. Arguably this is better than the current ordering, where an invalidation decision made during this same checkpoint would not be persisted until the next one. -
Synced slots on standbys: With the PG17 slot synchronization feature, standby servers persist synchronized logical slots. The timing of
CheckPointReplicationSlots()on a standby (invoked via restartpoints) should follow the same logic — if anything, late persistence is even more valuable there because slot sync is continuous. -
Shutdown checkpoints: For
CHECKPOINT_IS_SHUTDOWN, buffer writing is not spread, so the staleness window is negligible. The reordering is a no-op for shutdown checkpoints but still correct. -
Two-phase files, logical rewrite mappings, etc.: These other items in
CheckpointGuts()have their own ordering constraints (e.g.,CheckPointTwoPhase()must precede buffer writing so that 2PC state files exist when buffers referencing them are flushed). Slot state has no such dependency on buffer contents, which is why the move is safe.
Design Decision Summary
The patch is a narrow optimization with a clear rationale: align the observation point of replication slot positions with the decision point for WAL segment recycling, eliminating an artifact of the spread-checkpoint design that causes chronic WAL retention bloat. The change is minimal (reordering within CheckpointGuts()), has no on-disk format implications, and does not change recovery semantics.
The weight of this proposal depends on (a) confirming no subtle ordering dependency was overlooked — particularly around logical decoding snapshots and slot invalidation interplay — and (b) whether the community considers this a bug fix worth backpatching or strictly a master-only improvement. Given that it changes checkpoint-internal ordering, master-only is the likely outcome.
Open Questions for the Thread
- Is there any historical reason (perhaps from the original slot introduction in 9.4) for the early placement that isn't documented in comments?
- Should
ReplicationSlotsComputeRequiredLSN()itself be called later / recomputed, independent of when slot state is persisted? These are two separable concerns: persisting slot files vs. observing slot LSNs for retention. - Does this interact with
pg_basebackup-style operations that might read slot state during a running checkpoint?