Checkpoint replication slots later

First seen: 2026-05-07 10:28:07+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-08 · opus 4.7

Checkpoint Replication Slots Later — Technical Analysis

The Core Problem

During a PostgreSQL checkpoint, the CheckpointGuts() function performs a sequence of sub-operations that flush various pieces of state to stable storage. Among these is CheckPointReplicationSlots(), which persists the on-disk state of replication slots (their restart_lsn, confirmed_flush_lsn, etc.). Currently this call sits near the beginning of CheckpointGuts(), before CheckPointBuffers().

CheckPointBuffers() is the dominant cost of a checkpoint. Under the default "spread checkpoint" behavior (controlled by checkpoint_completion_target, effectively 0.9), buffer writing is deliberately stretched out to consume most of checkpoint_timeout (typically minutes, not seconds) in order to smooth the I/O impact on foreground workload. That means between the moment the slot state is captured and the moment the checkpoint actually finishes, a very long wall-clock interval can elapse.

The asymmetry Aasma is pointing out lies in how WAL retention is computed:

  1. CheckPointReplicationSlots() effectively snapshots the slots early.
  2. Near the end of the checkpoint, RemoveOldXlogFiles() is called and consults ReplicationSlotsComputeRequiredLSN() to decide how far back WAL must be retained for consumers (logical decoding, physical standbys via slots).
  3. But by that time, slot advancement that occurred during the spread-checkpoint window — potentially tens of gigabytes of WAL worth of progress by fast consumers — is not reflected in what the checkpoint considers "required."

The concrete user-visible symptom: pg_wal holds onto far more segments than necessary for an entire checkpoint cycle. In steady state this means baseline WAL footprint is inflated by roughly the amount of WAL generated during one checkpoint_timeout window, which on busy systems with aggressive replication slots can be substantial. The segments would only be recycled on the next checkpoint — unless that one also snapshots the slots early, perpetuating the lag.

Why the Current Ordering Exists (and Whether It Matters)

The natural question is why CheckPointReplicationSlots() was placed early in the first place. Looking at the durability contract: the checkpoint's redo pointer is established before CheckpointGuts() runs. Anything persisted during CheckpointGuts() only needs to be consistent with that redo pointer and with WAL replayed up to the checkpoint record. Slot state files (pg_replslot/*/state) are not ordered with respect to buffer flushes — they're independent metadata. So from a crash-recovery standpoint, moving the call to the end of CheckpointGuts() (but still before the checkpoint record is written and before RemoveOldXlogFiles()) preserves correctness: on crash, we still restore slots to a state no more advanced than WAL that was durably flushed.

Aasma's assertion "as far as I could tell there is no reason for this to happen early" is the key design claim. If true, this is essentially free — a one-line reordering that tightens the coupling between observed slot positions and WAL retention decisions made in the same checkpoint.

Implications and Subtleties

Several aspects worth flagging for reviewers:

Design Decision Summary

The patch is a narrow optimization with a clear rationale: align the observation point of replication slot positions with the decision point for WAL segment recycling, eliminating an artifact of the spread-checkpoint design that causes chronic WAL retention bloat. The change is minimal (reordering within CheckpointGuts()), has no on-disk format implications, and does not change recovery semantics.

The weight of this proposal depends on (a) confirming no subtle ordering dependency was overlooked — particularly around logical decoding snapshots and slot invalidation interplay — and (b) whether the community considers this a bug fix worth backpatching or strictly a master-only improvement. Given that it changes checkpoint-internal ordering, master-only is the likely outcome.

Open Questions for the Thread

  1. Is there any historical reason (perhaps from the original slot introduction in 9.4) for the early placement that isn't documented in comments?
  2. Should ReplicationSlotsComputeRequiredLSN() itself be called later / recomputed, independent of when slot state is persisted? These are two separable concerns: persisting slot files vs. observing slot LSNs for retention.
  3. Does this interact with pg_basebackup-style operations that might read slot state during a running checkpoint?