Analysis: Reducing pg_stat_statements LWLock Contention (GSoC 2026)
Architectural Context
pg_stat_statements is the de facto standard query-performance telemetry extension in PostgreSQL. Its instrumentation hook runs on every statement executed by every backend, which makes its shared-memory data structure one of the hottest concurrency points in the system outside of the buffer pool and ProcArray. The extension maintains a fixed-size hash table of normalized query entries in DSM, protected by a single tranche LWLock (pgss->lock). The lock discipline is:
- Shared mode: counter updates to an existing entry (the common fast path; individual entry spinlocks/atomics protect the counters themselves).
- Exclusive mode: inserting a new entry, evicting entries (
entry_dealloc), andpg_stat_statements_reset().
The architectural problem is that the exclusive operations are not merely rare bookkeeping — on workloads with high query-text cardinality (ORMs generating many plan shapes, ad-hoc analytics, multi-tenant SaaS), the table saturates at pg_stat_statements.max and the system enters a steady-state churn where entry_dealloc() fires continuously. Each call acquires the cluster-wide exclusive lock and, critically, performs a qsort() over all entries (by usage) to pick eviction victims. That is O(n log n) work under the exclusive lock, during which every other backend — regardless of whether it wants to touch that entry — stalls on lock acquisition.
Borodin's 2022 Yandex incident report (referenced as [2]) is the canonical production postmortem: a database became effectively unavailable because backends piled up waiting on LWLock|pg_stat_statements. That prior patch attempt was rejected because it solved the contention by silently skipping entry creation when the lock was contended — causing invisible observability loss, which is unacceptable for a monitoring extension whose entire value proposition is completeness of data.
Why This Matters Architecturally
- Monotonically worsening problem: faster CPUs and more cores mean more backends hitting the same single lock. PG19 reportedly reduces exclusive hold time per acquisition by ~40% (likely from query text file I/O refactoring or memcpy reductions), but this is a constant-factor improvement against a fundamentally serializing design.
- Observability tools should not cause outages:
pg_stat_statementsis near-universally enabled in production. A latent DoS in a monitoring extension is a correctness-adjacent bug. - The dealloc hot loop is pathological: at
max=100with 1000 distinct query shapes, the student measured ~30,000entry_dealloc()invocations in 60 seconds — i.e., the table thrashes ~500 times per second, each time sorting 100 entries under the global write lock. This is death-by-a-thousand-cuts rather than a single long stall.
Proposed Design
The proposal splits into two independently-committable core patches plus two stretch explorations.
Core Patch 1: Pending-Entry Queue via LWLockConditionalAcquire
Instead of blocking when a new-entry insertion cannot immediately acquire the exclusive lock, the backend enqueues the entry into a bounded pending buffer and moves on. A subsequent lock holder (or a dedicated drain point) flushes the queue into the main hash table. This directly addresses the rejection reason for the 2022 patch: nothing is silently dropped — if the queue overflows, counters (queued, dropped) are exposed in pg_stat_statements_info so operators can see and alarm on observability loss. This is the critical design distinction and shows the mentors steered the student away from the trap that sank prior work.
Open questions the student flags:
- Queue-full behavior under sustained contention: what is the backpressure policy? Spin-wait, drop-and-count, or fall back to blocking acquisition? Each has different failure modes under pathological load.
- Concurrent dealloc during the snapshot-to-eviction window: there is a TOCTOU-style hazard where an entry selected for flush from the queue could race with an eviction that frees the hash slot it was going to be written into.
Core Patch 2: Sort Outside the Exclusive Lock
entry_dealloc() currently does:
LWLockAcquire(EXCLUSIVE);
qsort(entries by usage); // O(n log n) under lock
evict bottom 5%; // O(n)
decay usage counters; // O(n)
LWLockRelease();
Proposed:
LWLockAcquire(SHARED);
snapshot (entry_id, usage) pairs into local array;
LWLockRelease();
qsort locally; // O(n log n) NOT under lock
compute eviction set;
LWLockAcquire(EXCLUSIVE);
evict + decay; // O(n) under lock
LWLockRelease();
This reduces exclusive hold time from O(n log n) to O(n). The subtle correctness issue is that usage values may change between snapshot and eviction — so the eviction set must be re-validated, or the algorithm must accept that it will evict based on slightly stale usage rankings (which is fine semantically; LRU-ish approximations are already approximate).
Stretch Goals
- Lock separation: split the single LWLock into a structural lock (hash table layout) and counter locks (per-entry or striped). This is a larger surgery because it changes the invariants the entire file depends on. It is correctly scoped as post-core — it's the "real" fix but has a much higher review burden.
- Optimized reset path:
pg_stat_statements_reset()is itself an exclusive-lock stall; the student reproduced the same freeze pattern via periodic resets, which is common in monitoring agents that snapshot-then-reset.
Evaluation of the Approach
Strengths:
- Incremental and committable: two independent patches, each small enough to review, each with measurable benefit. This is the correct posture for a GSoC contributor engaging with a conservative reviewer pool.
- Addresses prior rejection rationale head-on: the queued/dropped counters are the key concession that makes this politically viable where the 2022 patch was not.
- Sort-outside-lock is nearly uncontroversial: it's a pure latency-reduction refactor with obvious correctness arguments; likely to land first.
Risks:
- The pending queue introduces a new shared data structure that itself needs synchronization. If implemented with atomics/CAS it's fine; if it needs its own lock, you've just moved the bottleneck.
- "Ships independently" is aspirational — reviewers may insist on seeing the whole design before committing piece one.
LWLockConditionalAcquirein a hot path can cause systematic starvation under sustained exclusive demand; the drain policy matters enormously.
Participant Dynamics
This is an introductory post, so there is no debate yet. Noteworthy context:
- Mentors named (Kirk Wolak, Nikolay Samokhvalov, Andreas "Ads" Karlsson / possibly Andres Freund, Andrei Lepikhov) represent a mix of extension authors, performance specialists, and committers/near-committers. The design's evolution from "conditional-skip" to "pending-queue" bears the fingerprints of mentors who remember why the 2022 patch was rejected.
- The student explicitly cites Borodin's incident report, signaling awareness of the prior art and its failure mode — exactly the framing that hackers-list reviewers respond to well.
Likely Trajectory
Expect the sort-outside-lock patch to attract quick, largely positive review and potentially land in PG20. The pending-queue patch will provoke a longer design discussion about queue-full semantics and whether pg_stat_statements_info is the right surface for reporting drops. The lock-separation stretch goal is a multi-cycle project and unlikely to complete within GSoC; the student's acknowledgement of this ("rarely lands in a single summer") is realistic and well-calibrated.