Analysis: ProcKill lock-group vs procLatch recycle race
The Core Problem
This thread reports a genuine, long-standing concurrency bug in ProcKill() (the backend-exit cleanup routine in src/backend/storage/lmgr/proc.c) that has existed since 9.6 — essentially since lock groups were introduced to support parallel query. The bug manifests as a PANIC: latch already owned by PID ... when a newly-started backend inherits a PGPROC slot whose procLatch.owner_pid has not yet been cleared via DisownLatch().
Architectural context
Each backend owns a PGPROC slot drawn from a shared pool. On exit, ProcKill() must:
- Remove itself from its lock group (if any), under the leader's
backendLock (leader_lwlock).
SwitchBackToLocalLatch() — stop using shared procLatch as MyLatch.
DisownLatch(&MyProc->procLatch) — clear owner_pid so the slot is "clean".
- Push the
PGPROC back on the global freelist under ProcStructLock.
The critical invariant the author articulates is: a PGPROC must not appear on the freelist while its procLatch still has a non-zero owner_pid. InitProcess() calls OwnLatch() on whatever slot it pops, and OwnLatch() PANICs if the latch is already owned.
The race
Lock-group cleanup is special because when a non-leader member exits last, it is responsible for freeing the leader's PGPROC as well (the leader flips lockGroupLeader = NULL on itself so the follower knows it may push the leader). The existing PG15-era code structure is roughly:
LWLockAcquire(leader_lwlock);
dlist_delete(&MyProc->lockGroupLink);
if (dlist_is_empty(&leader->lockGroupMembers))
leader->lockGroupLeader = NULL;
else if (leader != MyProc)
/* not last; do nothing for leader */ ;
LWLockRelease(leader_lwlock); <-- L5
SwitchBackToLocalLatch(); <-- L6
pgstat_reset_wait_event_storage();
proc = MyProc; MyProc = NULL;
DisownLatch(&proc->procLatch); <-- L9 (clears owner_pid)
SpinLockAcquire(ProcStructLock);
if (proc->lockGroupLeader == NULL)
/* push on freelist */ ;
SpinLockRelease(ProcStructLock);
The window between L5 (lwlock release) and L9 (DisownLatch) is where the leader is vulnerable. After L5 the leader has already exited the lock group but its latch is still owned. A concurrent last member can now:
- Acquire
leader_lwlock,
- Observe an empty members list,
- Set
leader->lockGroupLeader = NULL,
- Push the leader's PGPROC onto the freelist — all while the leader is still between L5 and L9, i.e. its latch is still owned.
A fresh backend running InitProcess() then pops that slot, calls OwnLatch() and PANICs.
There is a second, symmetric hazard: the leader and last-member can make inconsistent decisions about who pushes which PGPROC, leading to a slot being pushed twice or not at all.
The Proposed Fix
The fix has two components:
- Reorder: move
SwitchBackToLocalLatch(), pgstat_reset_wait_event_storage(), and DisownLatch() to before the lock-group dissolution block. Then, by the time any other member can observe us outside the lock group, owner_pid is already 0.
- Single decision point under
leader_lwlock: decide atomically (while still holding the leader lwlock) two booleans — push_self and push_leader — and perform a single consolidated freelist update at the end. This eliminates the double-push / inconsistent-ownership class of bugs.
Vlad Lesin notes mainline parallel query "usually avoids the race" because typical parallel workers don't leave the leader reaching ProcKill with members still active in that specific interleaving; the bug is much easier to hit with extension-driven workloads that use lock groups more flexibly. This explains why the bug has survived since 9.6 despite being theoretically reachable.
Corroborating Evidence
Andrey Borodin (x4mmm) provides important validation that this is not a hypothetical bug:
- Buildfarm failures exist linked to this pattern ([0] — Thomas Munro's thread).
- Greenplum ("Deep from GP") observed it in production.
- Yugabyte tracked it as issue #20309.
This cross-fork confirmation substantially raises the priority: forks that exercise lock groups more aggressively (GP for MPP, YB for distributed) hit it regularly.
Testing Problem
The bug is difficult to test deterministically because ProcKill() tears down the very infrastructure (shared memory attachments, wait-event storage, latch) that injection points rely on. The author's 0003 patch is a "harness" that heavily refactors injection point infrastructure to allow injection inside ProcKill — explicitly not committable as-is.
Andrey's contribution here is practical: he splits the submission into a reviewable sequence (vAB1-0001 = the test as received, vAB1-0002 = canonicalization to make it actually runnable on current HEAD, vAB1-0003 = the rebased fix) and demonstrates that with the fix applied on top, the test passes.
A subtle point from Vlad's final reply: the ordering of pgstat_reset_wait_event_storage() matters for the test itself — if it runs before the lock-group block, the test harness hangs. This is a nice cross-check that the reorder is correct: the wait-event storage reset is tied to the latch lifecycle and must be deferred along with DisownLatch() relative to... actually, Vlad says deferring it enables the test. That suggests the committed ordering must keep pgstat_reset_wait_event_storage() close to or after wait-event machinery is safe to stop.
Process Critique
Evgeny Voropaev (Tantor Labs) raises important process points that a first-time hackers contributor often misses:
- Patches targeting multiple branches (REL_15_STABLE and PG18) in a single thread break the CFBot, which applies everything to HEAD.
- A CommitFest entry is required for systematic testing and reviewer attention.
- Separate threads per target branch is the convention.
Andrey echoes this more bluntly: the 0001/0002/0003 numbering is misleading because they aren't sequential patch steps — they're alternative artifacts for different branches plus a separate test. His renamed vAB1-* sequence is the shape the series should take.
Key Design Decisions and Tradeoffs
-
Reorder vs. hold lwlock longer: An alternative would be to keep leader_lwlock held across DisownLatch(). The chosen approach (reorder so latch is disowned first) is cleaner because it avoids holding an LWLock across operations that touch local state, and it preserves the invariant without widening critical sections.
-
Single freelist update under decision taken inside lwlock: Rather than two separate freelist pushes (one in the follower for the leader, one in the leader for itself), compute the decision once and push atomically. This is a structural simplification that makes future reasoning easier.
-
Backport scope: The bug affects 9.6+. This creates backporting work across many stable branches. The author submitted separate PG15 and PG18 patches for this reason.
Weight of Opinions
- Andrey Borodin (x4mmm) is a committer and his endorsement ("the problem seems real to me ... the invariant ... seems reasonable") is significant. His willingness to rebase and resubmit in canonical form signals he intends to shepherd this.
- Vlad Lesin (author) has deep knowledge of this area; the analysis of the interleaving is precise and matches the code paths.
- Evgeny Voropaev contributes process guidance rather than technical review at this stage.