2026-05-22 · claude-opus-4-6

Deadlock Detector Fails to Activate on Hot Standby Replica

Core Problem

This thread identifies a subtle but serious bug in PostgreSQL's hot standby recovery mechanism where the deadlock detection system fails to trigger, causing the startup process to loop indefinitely without making recovery progress.

The Architectural Context

On a hot standby replica, the startup process replays WAL records to keep the replica up to date. When replaying certain WAL records (specifically XLOG_HEAP2_PRUNE_*), the startup process may need exclusive access to a buffer page via LockBufferForCleanup(). If a backend process on the replica holds a pin on that buffer, a deadlock can occur: the startup process waits for the backend to release the pin, while the backend waits for a lock held by the startup process.

PostgreSQL's normal resolution mechanism works as follows:

ResolveRecoveryConflictWithBufferPin() sets a deadlock_timeout (default 1000ms, though the report uses 3000ms for illustration)
When the timeout fires, the startup process sends the conflicting backend a signal to run deadlock detection
The backend detects the deadlock and cancels itself, releasing the pin

The Bug Mechanism

The bug was introduced in PostgreSQL 15 with the log_startup_progress_interval feature (providing periodic logging during long startup operations). The root cause lies in an optimization in timeout.c where setitimer() is not called if the closest deadline for active timeouts is later than the already-scheduled timer interrupt.

The failure sequence:

Startup progress timer is set to 1000ms (via log_startup_progress_interval)
Startup process enters ResolveRecoveryConflictWithBufferPin() and registers deadlock_timeout at 3000ms
The progress timer's SIGALRM fires at ~1000ms (the timeout was disabled but the real OS timer was not reset due to the optimization)
handle_sig_alarm() sets the process latch unconditionally — even though no active timeout has been reached
ProcWaitForSignal() returns due to the latch being set
ResolveRecoveryConflictWithBufferPin() disables all timeouts and returns without sending the deadlock-check signal
LockBufferForCleanup() sees the buffer is still pinned and calls ResolveRecoveryConflictWithBufferPin() again
The new deadlock_timeout is set to 3000ms, but the OS timer (from the optimization) will fire in ~2000ms
SIGALRM fires at 2000ms — again before deadlock_timeout is reached
This cycle repeats indefinitely — the deadlock timeout is never reached because spurious SIGALRMs keep waking the process early

The critical insight is that the setitimer optimization (commit 09cf1d52267) avoids re-setting the OS timer when a new timeout's deadline is further in the future than the current timer. This means stale timer firings can occur after a timeout is disabled, creating "phantom" SIGALRMs that wake the process without any timeout actually being reached.

Impact

The startup process loops forever in LockBufferForCleanup() making no recovery progress
WAL replay stalls completely
The only escape is if max_standby_streaming_delay is set (which eventually cancels conflicting queries), but many production deployments set this to -1 (infinite) to avoid query cancellation
Affects PostgreSQL 15+ consistently

Proposed Solutions

The author identifies three possible approaches:

Solution 1: Always call `setitimer()` (Demo Patch Provided)

Remove the optimization that skips setitimer() calls. This ensures the OS timer always reflects the actual closest active timeout. Simple but may have performance implications (more syscalls) and doesn't protect against other code paths that might similarly interfere.

Solution 2: Redesign `LockBufferForCleanup` Logic

Make the buffer cleanup waiting logic robust against spurious wakeups. This is architecturally more correct (spurious wakeups should always be handled) but requires more invasive changes.

Solution 3: Only `SetLatch` When a Timeout Actually Fires

Modify handle_sig_alarm() to not set the latch if no timeout has actually been reached. This is the simplest fix but conflicts with the design philosophy that any SIGALRM should wake the process.

The v2 Patch Approach (Solution 2 variant)

The second email proposes a patch that makes ResolveRecoveryConflictWithBufferPin() resilient to spurious wakeups by:

After ProcWaitForSignal() returns, checking whether the buffer is still pinned by another backend
If still pinned and no timeout fired, continuing to wait (looping back to ProcWaitForSignal()) without resetting timeouts
Only breaking out of the loop when either the pin is released or a timeout genuinely fires

Key Technical Insights

The `setitimer` Optimization Problem

The optimization in timeout.c (commit 09cf1d52267644cdbdb734294012cf1228745aaa) avoids calling setitimer() when the new timeout deadline is later than the current OS timer deadline. The intent is to reduce syscall overhead. However, when a timeout is disabled, the OS timer is not necessarily cancelled — meaning a SIGALRM can arrive for a timeout that is no longer active. The handle_sig_alarm() handler then unconditionally sets the process latch, causing spurious wakeups.

Buffer Pin Waiter Protocol

LockBufferForCleanup() uses a protocol where only one backend can be the "pin count waiter" at a time (BM_PIN_COUNT_WAITER flag + wait_backend_pgprocno). The waiter sets itself up, then calls ProcWaitForSignal(). When the last other pinner calls UnpinBuffer(), it wakes the waiter. The current code assumes that any wakeup from ProcWaitForSignal() means either the pin was released or a timeout fired — this assumption is violated by the spurious SIGALRM.

Fujii's Review Feedback

Fujii's review suggests a cleaner implementation:

Use BM_PIN_COUNT_WAITER and wait_backend_pgprocno to check if we're still the waiter, rather than raw refcount
Restructure the loop for clarity with explicit break conditions
Fix the misleading comment about what can wake up ProcWaitForSignal()

Design Considerations

The fundamental question is: whose responsibility is it to handle spurious wakeups?

If timeout.c guarantees no spurious SIGALRMs → Solution 1 or 3
If callers must be robust against spurious wakeups → Solution 2

The PostgreSQL convention generally follows the principle that latch waits should always recheck their conditions (similar to condition variables in threading). This argues for Solution 2, which Fujii's review reinforces by suggesting a clean loop structure that explicitly rechecks conditions after each wakeup.

However, the timeout.c optimization creating phantom wakeups is arguably a latent bug that could affect other callers too, suggesting Solution 1 or 3 may also be warranted as defense-in-depth.

Deadlock detector fails to activate on a hot standby replica

Latest Update

Incremental Update: v3 Patch Discussion

New Patch Version (v3) from Vitaly Davydov

Fujii's Review of v3

Deadlock Detector Fails to Activate on Hot Standby Replica

Core Problem

The Architectural Context

The Bug Mechanism

Impact

Proposed Solutions

Solution 1: Always call `setitimer()` (Demo Patch Provided)

Solution 2: Redesign `LockBufferForCleanup` Logic

Solution 3: Only `SetLatch` When a Timeout Actually Fires

The v2 Patch Approach (Solution 2 variant)

Key Technical Insights

The `setitimer` Optimization Problem

Buffer Pin Waiter Protocol

Fujii's Review Feedback

Design Considerations

Deadlock detector fails to activate on a hot standby replica

Latest Update

Incremental Update: v3 Patch Discussion

New Patch Version (v3) from Vitaly Davydov

Fujii's Review of v3

Deadlock Detector Fails to Activate on Hot Standby Replica

Core Problem

The Architectural Context

The Bug Mechanism

Impact

Proposed Solutions

Solution 1: Always call setitimer() (Demo Patch Provided)

Solution 2: Redesign LockBufferForCleanup Logic

Solution 3: Only SetLatch When a Timeout Actually Fires

The v2 Patch Approach (Solution 2 variant)

Key Technical Insights

The setitimer Optimization Problem

Buffer Pin Waiter Protocol

Fujii's Review Feedback

Design Considerations

Solution 1: Always call `setitimer()` (Demo Patch Provided)

Solution 2: Redesign `LockBufferForCleanup` Logic

Solution 3: Only `SetLatch` When a Timeout Actually Fires

The `setitimer` Optimization Problem