2026-06-04 · claude-opus-4-6

Incremental Update: Fix Race in ReplicationSlotRelease for Ephemeral Slots

Main Patch Committed

Fujii committed the original fix for the ephemeral slot race condition. He decided against including a regression test, stating the proposed injection-point test was "a bit too narrow" and not worth adding. This resolves the testing debate from the prior analysis in favor of Hou's position (tests of limited long-term value).

New Adjacent Bug Identified: `drop_local_obsolete_slots`

Xuneng Zhou identified a related race condition in drop_local_obsolete_slots() (used in logical replication slot synchronization). The same root cause applies: after ReplicationSlotDropAcquired(false) frees a slot's shared-memory cell, subsequent code continues to reference the now-freed cell. In this case, the post-drop code reads (rather than writes) from the slot — specifically for unlock operations and log messages.

Severity assessment by Zhou: Less severe than the original bug because reads don't corrupt other backends' slot state. However, it can still produce incorrect log messages and unlock operations that reference a completely different slot that has reused the same array cell.

Zhou provided both a reproduction via injection points and a fix patch. The fix captures necessary slot identity information (slot name, database OID) before the drop, so post-drop code uses local copies rather than dereferencing the freed shared-memory cell.

Fujii's Review Questions on the Adjacent Fix

Fujii raised two safety concerns about the proposed fix for drop_local_obsolete_slots:

Is reading slot_database before acquiring the database lock safe? The patch saves local_slot->data.database into a local variable before certain lock acquisitions, raising ordering concerns.
Is local_sync_slot_required() safe to access local_slot without holding a lock? This is a broader question about whether the existing code (pre-patch) already has an unprotected read problem on the slot pointer that could read stale/wrong data if another backend recycles the cell during the check.

These questions suggest potential scope expansion — the race window may be wider than Zhou's patch addresses.

Reproduction Confirmed

Zhou independently reproduced the original ephemeral slot bug using lldb on Mac, demonstrating that after ReplicationSlotDropAcquired(), the slot cell had already been reused by a completely different slot (test_slot_created, persistent, with active_proc = 126) — confirming the corruption scenario described in the prior analysis is real and observable.

2026-06-01 · claude-opus-4-6

Fix Race in ReplicationSlotRelease for Ephemeral Slots

Core Problem

The bug exists in ReplicationSlotRelease(), a function responsible for releasing a backend's hold on a replication slot. The function has a critical use-after-free-style race condition specific to ephemeral slots — slots that are automatically dropped when released (used for transient logical replication operations like pg_create_logical_replication_slot() with the temporary flag).

The Race Condition in Detail

The execution flow in ReplicationSlotRelease() for ephemeral slots is:

Call ReplicationSlotDropAcquired() — this marks the slot's shared memory entry as free in the ReplicationSlotCtl->replication_slots[] array.
Execute common post-release cleanup code that dereferences the now-freed slot pointer to update shared memory fields like active_proc (set to NULL) and effective_xmin (potentially reset).

Between steps 1 and 2, another backend can immediately allocate the same slot array entry for a completely new, unrelated replication slot. The original backend then blindly writes to that memory, corrupting the new slot's state:

Setting active_proc to NULL makes the new slot appear unacquired, allowing a third backend to spuriously acquire it — violating the single-owner invariant.
Writing invalid effective_xmin values could affect vacuum's visibility calculations for the new slot's consumer.

This is a classic TOCTOU (time-of-check-time-of-use) problem in shared memory slot management. The slot array is a fixed-size shared memory structure (max_replication_slots entries), and entries are reused via a linear scan for free entries — making reuse of recently freed entries likely under load.

Architectural Significance

Replication slots are critical infrastructure: they prevent WAL removal and catalog cleanup that consumers need. Corrupting a slot's effective_xmin could allow premature vacuum of rows still needed by a logical subscriber, causing data loss or replication failures. Corrupting active_proc violates mutual exclusion, potentially allowing concurrent access to slot state that assumes single-writer semantics.

Proposed Solution

The fix is structurally simple: wrap the post-drop shared memory cleanup code in a conditional that only executes for non-ephemeral slots. For ephemeral slots, once ReplicationSlotDropAcquired() returns, the function must not touch the slot's shared memory state at all because the slot no longer conceptually exists.

if (!slot_was_ephemeral)
{
    /* Safe to update shared memory — slot still exists */
    SpinLockAcquire(&slot->mutex);
    slot->active_pid = 0;
    SpinLockRelease(&slot->mutex);
    /* ... effective_xmin updates ... */
}

This is correct because:

For ephemeral slots: ReplicationSlotDropAcquired() already handles all necessary cleanup internally (zeroing out the slot, marking it free) before releasing the spinlock. No further shared-memory operations are needed.
For persistent/temporary (non-ephemeral) slots: The slot remains allocated after release; only the active_pid ownership is relinquished. The post-release cleanup is still necessary and safe.

Backpatching Considerations

Fujii confirmed this should be backpatched to all supported branches, indicating the bug has existed since ephemeral slots were introduced. The fix is minimal and low-risk, making it appropriate for stable branch inclusion.

Testing Debate

A secondary discussion arose about whether to add a regression test:

Srinath argues for adding an injection-point-based test. His reasoning: the fix is just an else branch, which future refactoring could easily remove or restructure, silently reintroducing the corruption. An injection point between ReplicationSlotDropAcquired() and the cleanup code would make the race deterministic and testable.
Hou is ambivalent: injection points themselves can be invalidated by refactoring that moves code around, limiting long-term value. The case is also extremely rare in practice.

The decision was deferred to Fujii as the committer.

Key Design Insight

The fundamental issue is that ReplicationSlotRelease() conflated two distinct operations into one code path:

Release ownership of a slot that continues to exist (persistent/temporary slots)
Destroy a slot that should cease to exist (ephemeral slots)

These have fundamentally different post-conditions regarding shared memory validity. The fix correctly separates these paths. A more defensive future approach might be to NULL out the local slot pointer immediately after the drop call to make any subsequent dereference a clear programming error (crash rather than silent corruption).

Fix race in ReplicationSlotRelease for ephemeral slots

Latest Update

Incremental Update: Fix Race in ReplicationSlotRelease for Ephemeral Slots

Main Patch Committed

New Adjacent Bug Identified: `drop_local_obsolete_slots`

Fujii's Review Questions on the Adjacent Fix

Reproduction Confirmed

Fix Race in ReplicationSlotRelease for Ephemeral Slots

Core Problem

The Race Condition in Detail

Architectural Significance

Proposed Solution

Backpatching Considerations

Testing Debate

Key Design Insight

Fix race in ReplicationSlotRelease for ephemeral slots

Latest Update

Incremental Update: Fix Race in ReplicationSlotRelease for Ephemeral Slots

Main Patch Committed

New Adjacent Bug Identified: drop_local_obsolete_slots

Fujii's Review Questions on the Adjacent Fix

Reproduction Confirmed

Fix Race in ReplicationSlotRelease for Ephemeral Slots

Core Problem

The Race Condition in Detail

Architectural Significance

Proposed Solution

Backpatching Considerations

Testing Debate

Key Design Insight

New Adjacent Bug Identified: `drop_local_obsolete_slots`