Monthly Summary: Startup Process Deadlock — WaitForProcSignalBarriers vs Aux Process (May 2026)
Overview
This thread resolved a long-latent race condition in PostgreSQL's ProcSignalBarrier (PSB) machinery that causes a deadlock during startup. The bug exists in v15–master but was recently unmasked on master by commit 67c20979c (Dec 2025), which added an unconditional EmitProcSignalBarrier call during every StartupXLOG, turning a rare edge case into a reproducible every-boot failure.
The Race Condition
The ProcSignalBarrier protocol uses per-slot pss_pid (occupancy flag) and pss_barrierGeneration (catch-up counter) fields. The critical interleaving:
- Newcomer sets
pss_barrierGeneration = global_genunder spinlock;pss_pidstill 0. - Emitter bumps
global_gentoglobal_gen+1, scans slots, seespss_pid == 0, skips this slot. - Newcomer writes
pss_pid = MyProcPid, becomes visible. - Waiter sees a live slot with stale
pss_barrierGenerationand waits forever — the newcomer has no pending barrier flag and will never advance its generation.
The fundamental issue: the emitter's lock-free PID check can observe an empty slot that has already snapshot a generation, creating a window where no one will ever notify the newcomer.
Fix Approach
Sawada's patch reorders PID publication relative to generation capture in ProcSignalInit, ensuring the emitter's lock-free pss_pid == 0 check is safe. The key invariant enforced: any emitter that skips a slot (because pss_pid == 0) is guaranteed the newcomer will later read a generation ≥ the emitted one.
Patch Refinement: pg_atomic_write_membarrier_u32()
A micro-optimization was applied in the revised patch — replacing:
pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
pg_memory_barrier();
with the combined:
pg_atomic_write_membarrier_u32(&slot->pss_pid, MyProcPid);
This uses Linux's sys_membarrier() (kernel ≥ 4.3) to avoid a hardware store fence in the writer, while preserving identical correctness semantics. Updated for master and v18 backpatch only.
Secondary Fix: InitializeProcessXLogLogicalInfo Ordering
A related correctness bug was identified: InitializeProcessXLogLogicalInfo() was called before ProcSignalInit(), allowing a process to read stale logical-info state and then register as "caught up." The fix moves this initialization after ProcSignalInit, matching the pattern already used for InitLocalDataChecksumState.
Backpatching Scope
- v15–v18: Core race fix (barrier generation vs. PID publication ordering)
- Master only: Secondary
InitializeProcessXLogLogicalInfoordering fix (since 67c20979c is master-only) - v14: Immune in practice (PSB code exists but has no callers)
Scheduling Decision
Sawada deferred pushing the fix until after the May minor releases, judging the bug "not very visible in practice" on stable branches (only triggered by rare DROP DATABASE/TABLESPACE smgr barriers). The plan: commit to master soon after minor releases, with backbranch variants getting longer soak time.
Rejected Hypothesis
Matthias initially hypothesized that the race was between slot registration and signal handler installation in AuxiliaryProcessMainCommon. Andres rebutted this: postmaster children fork with all signals blocked (BlockSig), unblocking only after handlers are installed — signals in the window are kernel-pended, not lost. Matthias conceded.
Reproduction
Alexander Lakhin confirmed the diagnosis by injecting pg_usleep(10000) between cancel-key initialization and PID publication in ProcSignalInit, turning a rare buildfarm flake into a deterministic failure across multiple test scenarios (DROP DATABASE/TABLESPACE redo paths).