[PATCH] Fix ProcKill lock-group vs procLatch recycle race

First seen: 2026-04-27 08:14:59+00:00 · Messages: 6 · Participants: 4

Latest Update

2026-05-07 · opus 4.7

What's new in this round

Two substantive developments, both about test infrastructure packaging rather than the fix itself:

1. Andrey's v2 test cleanup (and explicit "fix unchanged" statement)

Andrey posts a simplified test (step 0001) with concrete refactoring:

Critically, Andrey explicitly refuses to simplify the fix itself, enumerating the four invariants he believes each piece defends:

  1. Early DisownLatch
  2. Decisions taken under leader_lwlock
  3. Deferred freelist pushes under freeProcsLock
  4. Leader-skips-self-push case

This is a useful checklist for reviewers — it frames the fix as four orthogonal concerns rather than one monolithic reorder, and invites targeted review of each.

2. The injection_points header-exposure question — and Michael's answer

Andrey raises a real packaging question: prockill_race.c needs the InjectionPointCondition struct that injection_wait consumes (to tell it which PID to block). That struct is private to contrib/test_decoding-style module injection_points.c. The patch currently reaches in via a relative "../injection_points/injection_point_condition.h" include, which Andrey acknowledges is non-idiomatic. He asks whether to:

Michael Paquier answers decisively with option (c): don't export the struct at all — instead fold prockill_race's callbacks and TAP test into the existing injection_points module. He cites Noah Misch's removable_cutoff() work as precedent and flags a concrete downside of Andrey's approach: exposing a separate test module that depends on injection_points internals complicates installcheck scenarios. Adding more callbacks to the existing module is fine; spawning a sibling module that links against it is not.

This is a clear directional steer from a committer with authority over the injection-point infrastructure. It means the next revision should dissolve prockill_race as a standalone module and merge its C helpers + TAP test into src/test/modules/injection_points/.

Not yet addressed

History (1 prior analysis)
2026-05-06 · opus 4.7

Analysis: ProcKill lock-group vs procLatch recycle race

The Core Problem

This thread reports a genuine, long-standing concurrency bug in ProcKill() (the backend-exit cleanup routine in src/backend/storage/lmgr/proc.c) that has existed since 9.6 — essentially since lock groups were introduced to support parallel query. The bug manifests as a PANIC: latch already owned by PID ... when a newly-started backend inherits a PGPROC slot whose procLatch.owner_pid has not yet been cleared via DisownLatch().

Architectural context

Each backend owns a PGPROC slot drawn from a shared pool. On exit, ProcKill() must:

  1. Remove itself from its lock group (if any), under the leader's backendLock (leader_lwlock).
  2. SwitchBackToLocalLatch() — stop using shared procLatch as MyLatch.
  3. DisownLatch(&MyProc->procLatch) — clear owner_pid so the slot is "clean".
  4. Push the PGPROC back on the global freelist under ProcStructLock.

The critical invariant the author articulates is: a PGPROC must not appear on the freelist while its procLatch still has a non-zero owner_pid. InitProcess() calls OwnLatch() on whatever slot it pops, and OwnLatch() PANICs if the latch is already owned.

The race

Lock-group cleanup is special because when a non-leader member exits last, it is responsible for freeing the leader's PGPROC as well (the leader flips lockGroupLeader = NULL on itself so the follower knows it may push the leader). The existing PG15-era code structure is roughly:

LWLockAcquire(leader_lwlock);
dlist_delete(&MyProc->lockGroupLink);
if (dlist_is_empty(&leader->lockGroupMembers))
    leader->lockGroupLeader = NULL;
else if (leader != MyProc)
    /* not last; do nothing for leader */ ;
LWLockRelease(leader_lwlock);             <-- L5

SwitchBackToLocalLatch();                  <-- L6
pgstat_reset_wait_event_storage();
proc = MyProc; MyProc = NULL;
DisownLatch(&proc->procLatch);             <-- L9 (clears owner_pid)

SpinLockAcquire(ProcStructLock);
if (proc->lockGroupLeader == NULL)
    /* push on freelist */ ;
SpinLockRelease(ProcStructLock);

The window between L5 (lwlock release) and L9 (DisownLatch) is where the leader is vulnerable. After L5 the leader has already exited the lock group but its latch is still owned. A concurrent last member can now:

  • Acquire leader_lwlock,
  • Observe an empty members list,
  • Set leader->lockGroupLeader = NULL,
  • Push the leader's PGPROC onto the freelist — all while the leader is still between L5 and L9, i.e. its latch is still owned.

A fresh backend running InitProcess() then pops that slot, calls OwnLatch() and PANICs.

There is a second, symmetric hazard: the leader and last-member can make inconsistent decisions about who pushes which PGPROC, leading to a slot being pushed twice or not at all.

The Proposed Fix

The fix has two components:

  1. Reorder: move SwitchBackToLocalLatch(), pgstat_reset_wait_event_storage(), and DisownLatch() to before the lock-group dissolution block. Then, by the time any other member can observe us outside the lock group, owner_pid is already 0.
  2. Single decision point under leader_lwlock: decide atomically (while still holding the leader lwlock) two booleans — push_self and push_leader — and perform a single consolidated freelist update at the end. This eliminates the double-push / inconsistent-ownership class of bugs.

Vlad Lesin notes mainline parallel query "usually avoids the race" because typical parallel workers don't leave the leader reaching ProcKill with members still active in that specific interleaving; the bug is much easier to hit with extension-driven workloads that use lock groups more flexibly. This explains why the bug has survived since 9.6 despite being theoretically reachable.

Corroborating Evidence

Andrey Borodin (x4mmm) provides important validation that this is not a hypothetical bug:

  • Buildfarm failures exist linked to this pattern ([0] — Thomas Munro's thread).
  • Greenplum ("Deep from GP") observed it in production.
  • Yugabyte tracked it as issue #20309.

This cross-fork confirmation substantially raises the priority: forks that exercise lock groups more aggressively (GP for MPP, YB for distributed) hit it regularly.

Testing Problem

The bug is difficult to test deterministically because ProcKill() tears down the very infrastructure (shared memory attachments, wait-event storage, latch) that injection points rely on. The author's 0003 patch is a "harness" that heavily refactors injection point infrastructure to allow injection inside ProcKill — explicitly not committable as-is.

Andrey's contribution here is practical: he splits the submission into a reviewable sequence (vAB1-0001 = the test as received, vAB1-0002 = canonicalization to make it actually runnable on current HEAD, vAB1-0003 = the rebased fix) and demonstrates that with the fix applied on top, the test passes.

A subtle point from Vlad's final reply: the ordering of pgstat_reset_wait_event_storage() matters for the test itself — if it runs before the lock-group block, the test harness hangs. This is a nice cross-check that the reorder is correct: the wait-event storage reset is tied to the latch lifecycle and must be deferred along with DisownLatch() relative to... actually, Vlad says deferring it enables the test. That suggests the committed ordering must keep pgstat_reset_wait_event_storage() close to or after wait-event machinery is safe to stop.

Process Critique

Evgeny Voropaev (Tantor Labs) raises important process points that a first-time hackers contributor often misses:

  • Patches targeting multiple branches (REL_15_STABLE and PG18) in a single thread break the CFBot, which applies everything to HEAD.
  • A CommitFest entry is required for systematic testing and reviewer attention.
  • Separate threads per target branch is the convention.

Andrey echoes this more bluntly: the 0001/0002/0003 numbering is misleading because they aren't sequential patch steps — they're alternative artifacts for different branches plus a separate test. His renamed vAB1-* sequence is the shape the series should take.

Key Design Decisions and Tradeoffs

  1. Reorder vs. hold lwlock longer: An alternative would be to keep leader_lwlock held across DisownLatch(). The chosen approach (reorder so latch is disowned first) is cleaner because it avoids holding an LWLock across operations that touch local state, and it preserves the invariant without widening critical sections.

  2. Single freelist update under decision taken inside lwlock: Rather than two separate freelist pushes (one in the follower for the leader, one in the leader for itself), compute the decision once and push atomically. This is a structural simplification that makes future reasoning easier.

  3. Backport scope: The bug affects 9.6+. This creates backporting work across many stable branches. The author submitted separate PG15 and PG18 patches for this reason.

Weight of Opinions

  • Andrey Borodin (x4mmm) is a committer and his endorsement ("the problem seems real to me ... the invariant ... seems reasonable") is significant. His willingness to rebase and resubmit in canonical form signals he intends to shepherd this.
  • Vlad Lesin (author) has deep knowledge of this area; the analysis of the interleaving is precise and matches the code paths.
  • Evgeny Voropaev contributes process guidance rather than technical review at this stage.