[GSoC 2026] - Reducing pg_stat_statements LWLock Contention - Introduction

First seen: 2026-05-09 03:27:32+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-11 · opus 4.7

Analysis: Reducing pg_stat_statements LWLock Contention (GSoC 2026)

Architectural Context

pg_stat_statements is the de facto standard query-performance telemetry extension in PostgreSQL. Its instrumentation hook runs on every statement executed by every backend, which makes its shared-memory data structure one of the hottest concurrency points in the system outside of the buffer pool and ProcArray. The extension maintains a fixed-size hash table of normalized query entries in DSM, protected by a single tranche LWLock (pgss->lock). The lock discipline is:

The architectural problem is that the exclusive operations are not merely rare bookkeeping — on workloads with high query-text cardinality (ORMs generating many plan shapes, ad-hoc analytics, multi-tenant SaaS), the table saturates at pg_stat_statements.max and the system enters a steady-state churn where entry_dealloc() fires continuously. Each call acquires the cluster-wide exclusive lock and, critically, performs a qsort() over all entries (by usage) to pick eviction victims. That is O(n log n) work under the exclusive lock, during which every other backend — regardless of whether it wants to touch that entry — stalls on lock acquisition.

Borodin's 2022 Yandex incident report (referenced as [2]) is the canonical production postmortem: a database became effectively unavailable because backends piled up waiting on LWLock|pg_stat_statements. That prior patch attempt was rejected because it solved the contention by silently skipping entry creation when the lock was contended — causing invisible observability loss, which is unacceptable for a monitoring extension whose entire value proposition is completeness of data.

Why This Matters Architecturally

  1. Monotonically worsening problem: faster CPUs and more cores mean more backends hitting the same single lock. PG19 reportedly reduces exclusive hold time per acquisition by ~40% (likely from query text file I/O refactoring or memcpy reductions), but this is a constant-factor improvement against a fundamentally serializing design.
  2. Observability tools should not cause outages: pg_stat_statements is near-universally enabled in production. A latent DoS in a monitoring extension is a correctness-adjacent bug.
  3. The dealloc hot loop is pathological: at max=100 with 1000 distinct query shapes, the student measured ~30,000 entry_dealloc() invocations in 60 seconds — i.e., the table thrashes ~500 times per second, each time sorting 100 entries under the global write lock. This is death-by-a-thousand-cuts rather than a single long stall.

Proposed Design

The proposal splits into two independently-committable core patches plus two stretch explorations.

Core Patch 1: Pending-Entry Queue via LWLockConditionalAcquire

Instead of blocking when a new-entry insertion cannot immediately acquire the exclusive lock, the backend enqueues the entry into a bounded pending buffer and moves on. A subsequent lock holder (or a dedicated drain point) flushes the queue into the main hash table. This directly addresses the rejection reason for the 2022 patch: nothing is silently dropped — if the queue overflows, counters (queued, dropped) are exposed in pg_stat_statements_info so operators can see and alarm on observability loss. This is the critical design distinction and shows the mentors steered the student away from the trap that sank prior work.

Open questions the student flags:

Core Patch 2: Sort Outside the Exclusive Lock

entry_dealloc() currently does:

LWLockAcquire(EXCLUSIVE);
  qsort(entries by usage);          // O(n log n) under lock
  evict bottom 5%;                  // O(n)
  decay usage counters;             // O(n)
LWLockRelease();

Proposed:

LWLockAcquire(SHARED);
  snapshot (entry_id, usage) pairs into local array;
LWLockRelease();
qsort locally;                      // O(n log n) NOT under lock
compute eviction set;
LWLockAcquire(EXCLUSIVE);
  evict + decay;                    // O(n) under lock
LWLockRelease();

This reduces exclusive hold time from O(n log n) to O(n). The subtle correctness issue is that usage values may change between snapshot and eviction — so the eviction set must be re-validated, or the algorithm must accept that it will evict based on slightly stale usage rankings (which is fine semantically; LRU-ish approximations are already approximate).

Stretch Goals

Evaluation of the Approach

Strengths:

Risks:

Participant Dynamics

This is an introductory post, so there is no debate yet. Noteworthy context:

Likely Trajectory

Expect the sort-outside-lock patch to attract quick, largely positive review and potentially land in PG20. The pending-queue patch will provoke a longer design discussion about queue-full semantics and whether pg_stat_statements_info is the right surface for reporting drops. The lock-separation stretch goal is a multi-cycle project and unlikely to complete within GSoC; the student's acknowledgement of this ("rarely lands in a single summer") is realistic and well-calibrated.