Deadlock Detector Fails to Activate on Hot Standby Replica
Core Problem
This thread identifies a subtle but serious bug in PostgreSQL's hot standby recovery mechanism where the deadlock detection system fails to trigger, causing the startup process to loop indefinitely without making recovery progress.
The Architectural Context
On a hot standby replica, the startup process replays WAL records to keep the replica up to date. When replaying certain WAL records (specifically XLOG_HEAP2_PRUNE_*), the startup process may need exclusive access to a buffer page via LockBufferForCleanup(). If a backend process on the replica holds a pin on that buffer, a deadlock can occur: the startup process waits for the backend to release the pin, while the backend waits for a lock held by the startup process.
PostgreSQL's normal resolution mechanism works as follows:
ResolveRecoveryConflictWithBufferPin()sets adeadlock_timeout(default 1000ms, though the report uses 3000ms for illustration)- When the timeout fires, the startup process sends the conflicting backend a signal to run deadlock detection
- The backend detects the deadlock and cancels itself, releasing the pin
The Bug Mechanism
The bug was introduced in PostgreSQL 15 with the log_startup_progress_interval feature (providing periodic logging during long startup operations). The root cause lies in an optimization in timeout.c where setitimer() is not called if the closest deadline for active timeouts is later than the already-scheduled timer interrupt.
The failure sequence:
- Startup progress timer is set to 1000ms (via
log_startup_progress_interval) - Startup process enters
ResolveRecoveryConflictWithBufferPin()and registersdeadlock_timeoutat 3000ms - The progress timer's SIGALRM fires at ~1000ms (the timeout was disabled but the real OS timer was not reset due to the optimization)
handle_sig_alarm()sets the process latch unconditionally — even though no active timeout has been reachedProcWaitForSignal()returns due to the latch being setResolveRecoveryConflictWithBufferPin()disables all timeouts and returns without sending the deadlock-check signalLockBufferForCleanup()sees the buffer is still pinned and callsResolveRecoveryConflictWithBufferPin()again- The new
deadlock_timeoutis set to 3000ms, but the OS timer (from the optimization) will fire in ~2000ms - SIGALRM fires at 2000ms — again before deadlock_timeout is reached
- This cycle repeats indefinitely — the deadlock timeout is never reached because spurious SIGALRMs keep waking the process early
The critical insight is that the setitimer optimization (commit 09cf1d52267) avoids re-setting the OS timer when a new timeout's deadline is further in the future than the current timer. This means stale timer firings can occur after a timeout is disabled, creating "phantom" SIGALRMs that wake the process without any timeout actually being reached.
Impact
- The startup process loops forever in
LockBufferForCleanup()making no recovery progress - WAL replay stalls completely
- The only escape is if
max_standby_streaming_delayis set (which eventually cancels conflicting queries), but many production deployments set this to-1(infinite) to avoid query cancellation - Affects PostgreSQL 15+ consistently
Proposed Solutions
The author identifies three possible approaches:
Solution 1: Always call setitimer() (Demo Patch Provided)
Remove the optimization that skips setitimer() calls. This ensures the OS timer always reflects the actual closest active timeout. Simple but may have performance implications (more syscalls) and doesn't protect against other code paths that might similarly interfere.
Solution 2: Redesign LockBufferForCleanup Logic
Make the buffer cleanup waiting logic robust against spurious wakeups. This is architecturally more correct (spurious wakeups should always be handled) but requires more invasive changes.
Solution 3: Only SetLatch When a Timeout Actually Fires
Modify handle_sig_alarm() to not set the latch if no timeout has actually been reached. This is the simplest fix but conflicts with the design philosophy that any SIGALRM should wake the process.
The v2 Patch Approach (Solution 2 variant)
The second email proposes a patch that makes ResolveRecoveryConflictWithBufferPin() resilient to spurious wakeups by:
- After
ProcWaitForSignal()returns, checking whether the buffer is still pinned by another backend - If still pinned and no timeout fired, continuing to wait (looping back to
ProcWaitForSignal()) without resetting timeouts - Only breaking out of the loop when either the pin is released or a timeout genuinely fires
Key Technical Insights
The setitimer Optimization Problem
The optimization in timeout.c (commit 09cf1d52267644cdbdb734294012cf1228745aaa) avoids calling setitimer() when the new timeout deadline is later than the current OS timer deadline. The intent is to reduce syscall overhead. However, when a timeout is disabled, the OS timer is not necessarily cancelled — meaning a SIGALRM can arrive for a timeout that is no longer active. The handle_sig_alarm() handler then unconditionally sets the process latch, causing spurious wakeups.
Buffer Pin Waiter Protocol
LockBufferForCleanup() uses a protocol where only one backend can be the "pin count waiter" at a time (BM_PIN_COUNT_WAITER flag + wait_backend_pgprocno). The waiter sets itself up, then calls ProcWaitForSignal(). When the last other pinner calls UnpinBuffer(), it wakes the waiter. The current code assumes that any wakeup from ProcWaitForSignal() means either the pin was released or a timeout fired — this assumption is violated by the spurious SIGALRM.
Fujii's Review Feedback
Fujii's review suggests a cleaner implementation:
- Use
BM_PIN_COUNT_WAITERandwait_backend_pgprocnoto check if we're still the waiter, rather than raw refcount - Restructure the loop for clarity with explicit break conditions
- Fix the misleading comment about what can wake up
ProcWaitForSignal()
Design Considerations
The fundamental question is: whose responsibility is it to handle spurious wakeups?
- If
timeout.cguarantees no spurious SIGALRMs → Solution 1 or 3 - If callers must be robust against spurious wakeups → Solution 2
The PostgreSQL convention generally follows the principle that latch waits should always recheck their conditions (similar to condition variables in threading). This argues for Solution 2, which Fujii's review reinforces by suggesting a clean loop structure that explicitly rechecks conditions after each wakeup.
However, the timeout.c optimization creating phantom wakeups is arguably a latent bug that could affect other callers too, suggesting Solution 1 or 3 may also be warranted as defense-in-depth.