Deep Technical Analysis: Improve pg_stat_statements Scalability
The Core Problem
pg_stat_statements is PostgreSQL's most widely deployed extension for query performance monitoring, yet it suffers from three interconnected scalability bottlenecks that become increasingly severe on modern high-core-count hardware and workloads with high query diversity.
Problem 1: Per-Entry SpinLock Contention
Every counter update (calls, total_time, rows, blocks, etc.) acquires a per-entry SpinLock. When a "hot" query is executed concurrently across many backends, all those backends contend on the same spinlock for the same hash table entry. The problem has two compounding factors:
- Hold time grows with struct size: As the community adds more counters to
pgssEntry(JIT stats, WAL stats, I/O timing, etc.), the critical section under the spinlock lengthens, increasing contention duration. - Core count amplification: More CPUs mean more concurrent spinlock acquirers. SpinLocks use busy-wait loops, so under heavy contention, CPUs burn cycles spinning rather than doing useful work. This is a classic non-scalable pattern on NUMA architectures.
The benchmark's "spinlock" test (single query from 256 clients) directly measures this: all backends hammer the same entry's spinlock. The -3% delta on a 16-CPU machine is modest, but the author notes this is expected to be far worse on higher core counts where cache-line bouncing across NUMA nodes dominates.
Problem 2: Exclusive LWLock During Deallocation
The hash table is fixed-size, bounded by pg_stat_statements.max. When full, a backend performing pgss_store() must evict entries:
- It acquires the
pg_stat_statementsLWLock in exclusive mode - It scans all entries to compute usage scores
- It evicts the bottom 5% (least-frequently-used)
- It releases the lock
During this time, every other backend attempting to read or write pg_stat_statements is blocked. This happens inline during query execution — not in a background process. Workloads with many unique queries (ORMs generating dynamic SQL, multi-tenant SaaS) trigger eviction frequently.
Compounding the issue: pg_stat_statements.max is PGC_POSTMASTER, requiring a full server restart to change. In production, operators cannot dynamically tune the table size in response to workload changes.
The benchmark quantifies this devastatingly: in the 100k test (continuous eviction pressure), the unpatched version shows 20,416 wait event samples on the pg_stat_statements LWLock versus essentially zero with the patch. TPS improves 33%.
Problem 3: Query Text File GC Under Exclusive Lock
Query texts are stored in an external flat file (pg_stat_statements.stat). Deallocated entries leave dead text behind. When dead text exceeds live text (file > 2× live data), garbage collection rewrites the entire file — again under the exclusive LWLock. This combines the worst properties: disk I/O under an exclusive lock, meaning all concurrent readers and writers are blocked for the duration of a potentially large sequential file rewrite.
Proposed Solution Architecture
The patch fundamentally re-architects pg_stat_statements to use PostgreSQL's built-in pgstat (cumulative statistics) subsystem, leveraging infrastructure that has been deliberately built over two release cycles (PG18-PG19).
Design: Local Accumulation + Periodic Flush
The pgstat subsystem's core design pattern is write-local, flush-periodic:
- Each backend accumulates counter updates in backend-local memory (no locks needed)
- Periodically (or on demand), pending stats are flushed to shared memory via dshash
- The dshash is partitioned, so flushes from different backends to different entries don't contend
This eliminates the per-entry spinlock entirely on the write path. The critical architectural insight is that query statistics are cumulative counters — they only grow — so local accumulation and periodic merging is semantically correct.
Welford-to-Chan Algorithm Transition
Standard deviation computation uses Welford's online algorithm for single-stream updates. But with local accumulation, each backend computes partial Welford state that must be merged. The patch uses Chan's parallel algorithm for combining partial variance computations, which is the mathematically correct way to merge independently accumulated Welford states. This is a subtle but critical correctness detail — naive merging of variance accumulators produces incorrect results.
Design: dshash Replaces Fixed Hash Table
The pgstat subsystem stores entries in a dshash (dynamically-sized shared hash table backed by DSA memory). This provides:
- Dynamic resizing: The hash table grows as needed without restart.
pg_stat_statements.maxbecomesPGC_SIGHUP. - Partition-level locking: dshash uses per-partition locks rather than a single LWLock, dramatically reducing contention on lookups and insertions.
- Standard lifecycle management: Entry creation, deletion, and lookup follow pgstat conventions rather than bespoke logic.
The hash key design is notable: pgstat's PgStat_HashKey has a single uint64 objid field, so the patch must combine (userid, queryid, toplevel) into a single 64-bit hash. This is a potential collision concern, though with good hash combining it should be negligible in practice.
Design: Throttled Inline Eviction
Instead of blocking all backends behind an exclusive lock for eviction, the patch implements a non-blocking throttled approach:
- When entry count reaches
pgss_max, a backend attempts eviction using a conditional lock (try-lock, non-blocking) - A shared timestamp ensures at most one eviction cycle per 10 seconds (
EVICTION_INTERVAL_MS) - If eviction is throttled or the lock isn't available, the backend skips entry creation rather than blocking
This is a deliberate trade-off: some entries may be lost during high-churn periods. But the author correctly argues this is acceptable because the current design already loses entries (newly created entries are frequently evicted in the next deallocation cycle anyway). The throttled approach makes this data loss explicit and bounded rather than causing cascading performance degradation.
The benchmark validates this: under heavy churn (100k test), both designs retain all 1000 hot entries. The throttled design actually shows better hot entry call counts (52-58k vs 42-44k) because backends aren't blocked waiting for eviction. Deallocation events drop from 37,972 to just 30, confirming the throttling is effective.
The sticky-entry mechanism (entries with calls=0 are protected from eviction) is simplified: entries with calls == 0 are simply not evicted, replacing the more complex sticky logic.
Design: DSA-Backed Query Text with File Fallback
A new GUC pg_stat_statements.query_text_memory (default 64 MB) allocates DSA shared memory for query text storage:
- New texts go to DSA memory first (no file I/O on the write or read path)
- If DSA is full or disabled (
= 0), texts fall back to the existing file - Entry eviction and reset free DSA-allocated texts
- File GC skips DSA-backed entries
This is a pragmatic hybrid design. It eliminates file I/O for the common case while preserving the file path as a safety valve for very large text volumes. The 64 MB default is an open question — it needs to balance memory usage against the number of unique query texts in typical deployments.
Design: pgstat Serialization Callbacks
The patch uses to_serialized_data / from_serialized_data callbacks (added in PG19 via commit 4ba012a8ed9) to handle persistence across restarts. This replaces pg_stat_statements' bespoke shutdown/startup file logic with the standard pgstat serialization framework, reducing maintenance burden and ensuring consistency with how all other stats kinds handle persistence.
Key Open Design Questions
1. Stats Kind ID Assignment
The patch uses PGSTAT_KIND_EXPERIMENTAL. For commit, a proper kind number must be allocated. This is administrative but reflects the broader question of whether pg_stat_statements should be a "built-in custom kind" or a "true custom kind."
2. Eviction Throttle: GUC vs. Constant
The 10-second interval is hardcoded. Making it a GUC adds tunability but also complexity. Given that the throttle is a safety mechanism rather than a precision control, a fixed constant is likely sufficient. However, workloads with extremely high query churn (millions of unique queries per minute) might benefit from a shorter interval.
3. Soft Limit vs. Hard Reject
The current design hard-rejects new entries when at capacity and eviction is throttled. A soft limit allowing temporary overshoot would capture more entries but risks unbounded memory growth under pathological workloads. The author's argument against soft limits is sound: heavy churn could overshoot by orders of magnitude, and the pgstat dshash, while dynamic, still consumes shared memory.
4. Default Query Text Memory Size
64 MB is reasonable for most deployments but could be insufficient for workloads with many long query texts. This should probably be documented in terms of "approximately N query texts of average length M."
Architectural Significance
This patch represents a maturation of the pgstat extensibility infrastructure. The two-cycle investment in building pluggable stats kinds (PG18) and serialization callbacks (PG19) is paying off: pg_stat_statements — the most important consumer — can now be re-architected without modifying core pgstat code.
The broader implication is that other extensions (pg_stat_kcache, pg_wait_sampling, auto_explain aggregation) could follow the same pattern, moving from bespoke shared memory management to the pgstat framework. This establishes pg_stat_statements as the reference implementation for pgstat-based extension statistics.
The elimination of spinlock contention on the write path is particularly significant as PostgreSQL targets workloads on hardware with 100+ cores, where spinlock-based designs become actively harmful due to cache-line bouncing across NUMA domains.
Benchmark Analysis
The benchmark is well-designed with three workloads targeting different bottlenecks:
| Test | Target Bottleneck | Result | Analysis |
|---|---|---|---|
| 5k | Baseline (no contention) | -0.9% | Negligible regression; pgstat flush overhead is minimal |
| 100k | LWLock eviction contention | +33% | Dramatic improvement; eliminates exclusive lock blocking |
| spinlock | Per-entry spinlock | -3% | Slight regression on 16 cores; expected to improve on higher core counts |
The -3% spinlock regression on 16 cores is worth investigating. It likely reflects the overhead of pgstat's flush mechanism (periodic shared memory writes) versus the direct spinlock update. On 16 cores, spinlock contention isn't yet catastrophic enough for the flush overhead to be a net win. On 64+ core machines, the crossover point should favor the patch significantly.
The wait event data is the most compelling evidence: under eviction pressure, 20,416 LWLock wait samples are reduced to essentially zero. This means backends that were previously blocked for significant fractions of their execution time are now doing useful work.