Fix the Race Condition for Updating Slot Minimum LSN
Core Problem
This thread addresses a race condition in PostgreSQL's replication slot WAL reservation mechanism that can lead to premature WAL removal and subsequent slot invalidation. The bug exists in the interaction between slot creation, slot advancement, and checkpoint processing.
Architectural Context
PostgreSQL's replication slots guarantee that WAL files required by a consumer (logical or physical replication) are retained. The system maintains a global "minimum LSN" (replication_slot_minimum_lsn) that represents the oldest WAL position any slot still needs. During checkpoints, WAL segments older than this minimum are eligible for removal.
The minimum LSN is computed by ReplicationSlotsComputeRequiredLSN(), which scans all slots and takes the minimum of their restart_lsn values. Separately, XLogSetReplicationSlotMinimumLSN() atomically updates the global minimum LSN value that checkpoints consult via XLogGetReplicationSlotMinimumLSN().
The Race Condition
The race involves three concurrent operations:
- Backend A — Creating a new slot, specifically in
ReplicationSlotReserveWal()where it determines and sets the slot's initialrestart_lsn. - Backend B — Advancing an existing slot and calling
ReplicationSlotsComputeRequiredLSN()followed byXLogSetReplicationSlotMinimumLSN(). - Checkpoint process — Reading the global minimum LSN to determine which WAL to remove.
The dangerous interleaving:
| Step | Action | State |
|---|---|---|
| 1 | Backend A creates slot s, determines its restart_lsn = LSN_old but hasn't written it yet |
Global min LSN is stale |
| 2 | Backend B advances slot advtest to LSN_new (much newer), then calls ReplicationSlotsComputeRequiredLSN() |
This scans slots; slot s either has InvalidXLogRecPtr (skipped) or hasn't been updated yet |
| 3 | Backend A writes restart_lsn = LSN_old to slot s and calls ReplicationSlotsComputeRequiredLSN() → sets global min to LSN_old |
Correct momentarily |
| 4 | Backend B's XLogSetReplicationSlotMinimumLSN() executes after Backend A's, overwriting global min with LSN_new |
Global min now too recent |
| 5 | Checkpoint reads global min = LSN_new, removes WAL segments before it |
WAL needed by slot s at LSN_old is removed |
| 6 | Slot s is invalidated because its required WAL no longer exists |
Data loss for consumer |
The fundamental issue is a TOCTOU (time-of-check-time-of-use) problem: the computation of the minimum LSN and the update of the global minimum are not atomic with respect to slot restart_lsn modifications.
Proposed Solution
Approach: Serialization via ReplicationSlotControlLock
The patch applies the same pattern used in commit 2a5225b (which fixed an analogous race for slot_xmin updates):
- Acquire
ReplicationSlotControlLockin exclusive mode when updatingslot.restart_lsnduring WAL reservation inReplicationSlotReserveWal(). - Place
XLogSetReplicationSlotMinimumLSN()underReplicationSlotControlLockprotection — specifically, the lock must be held from the point whereReplicationSlotsComputeRequiredLSN()scans slots through the point whereXLogSetReplicationSlotMinimumLSN()writes the global minimum.
This serialization ensures that:
- If Backend A is writing a new
restart_lsn, Backend B'sReplicationSlotsComputeRequiredLSN()will either see the new value (if it runs after) or will not yet have released the lock forXLogSetReplicationSlotMinimumLSN()to overwrite a correct older minimum. - The global minimum LSN can never be advanced past a value that a concurrent slot creation is trying to reserve.
Files Modified
slot.c— Adding exclusiveReplicationSlotControlLockacquisition aroundrestart_lsnupdates inReplicationSlotReserveWal().slotsync.c— Similar protection for slot synchronization paths that updaterestart_lsn.
Design Tradeoffs
Lock contention: ReplicationSlotControlLock is already used for slot creation/deletion and xmin computation. Adding another exclusive acquisition during WAL reservation increases contention, but:
- Slot creation is infrequent relative to normal operations
- The critical section is short (just the LSN write + computation)
- This mirrors the accepted approach from commit
2a5225b
Alternative not taken: One could imagine a version counter or retry loop, but the LWLock approach is simpler, proven (by the xmin precedent), and the performance impact is negligible given slot creation frequency.
Analysis of copy_replication_slot() Safety
Surya Poondla raises an important question about whether copy_replication_slot() in slotfuncs.c has the same vulnerability. The analysis concludes it does not, for two reasons:
-
No InvalidXLogRecPtr window: When copying a slot,
create_logical_replication_slot()is called with a validsrc_restart_lsn. InsideCreateInitDecodingContext(), becauserestart_lsnis already valid,ReplicationSlotReserveWal()is skipped entirely. The slot'srestart_lsnis set directly tosrc_restart_lsn. -
Monotonicity guarantee: The code errors out if
copy_restart_lsn < src_restart_lsn, so the write never movesrestart_lsnbackward. Any concurrent scan will see a valid LSN that is at least as old as what the source slot had.
This is a valid analysis — the race requires a window where a slot exists but has InvalidXLogRecPtr as its restart_lsn, causing scanners to skip it. The copy path never creates such a window.
Relationship to Prior Work
This fix is directly related to:
- Commit 2a5225b: Fixed the analogous race for
effective_xmin/effective_catalog_xminupdates, establishing the pattern of usingReplicationSlotControlLockfor serialization. - The referenced thread about invalidation of newly created slots: The broader investigation that uncovered this specific race condition.
The consistency of approach (same lock, same pattern) is architecturally sound — it creates a uniform contract that any modification to slot state that affects global minima must be serialized against the computation of those minima.