2026-06-04 · claude-opus-4-6
Incremental Update: Review of v3-0001 (In-Transaction Flush) — Architectural Debate on pgstat Flush Design
What's New
The discussion has shifted focus to the foundational patch 0001 (pg_stat_report_anytime), with a detailed review from Kyotaro Horiguchi (NTT) raising architectural concerns about how mid-transaction flushes interact with the existing pgstat callback design. Michael Paquier engaged with these concerns, and the three participants (Horiguchi, Paquier, Sami) are converging on a redesign of the flush entry point.
Key Technical Issues Raised
1. Naming Confusion: pg_stat_report_anytime() → pg_stat_flush_anytime_stats()
Horiguchi argues the SQL-visible function name is unclear. "report" is implementation jargon; "anytime" doesn't convey semantics. He proposes pg_stat_flush_anytime_stats() to align with existing pg_stat_force_next_flush() naming convention.
2. Missing Throttling for Anytime Flushes
The current v3-0001 implementation applies no rate-limiting to the anytime flush API. Horiguchi notes that 1000 backends each calling it once per second would produce 1000 shared-stats updates/sec, whereas normal pgstat paths use PGSTAT_MIN_INTERVAL to throttle. Sami responds that throttling makes tests slower, proposing to disable PGSTAT_MIN_INTERVAL when injection points are enabled (similar to the approach in commit f1e251be80a). Horiguchi conditionally agrees: if the API is testing-only, throttling is unnecessary; if it's general-purpose, pg_stat_force_next_flush() semantics should apply to anytime flushes too.
3. Fundamental Design Problem: Conflation of "Flush" and "Free" Semantics in Callbacks
This is the most architecturally significant issue. Horiguchi identifies that the patch conflates two previously-unified decisions that mid-transaction flushing splits apart:
Before (transaction-boundary only): flush_pending_cb() returning true meant "flushed and entry can be freed" — these were always the same decision.
After (with anytime flush): An entry may be flushable (counters can be reported to shared stats) but NOT freeable (because AtEOXact_PgStat_Relations still needs it). The patch handles this by checking lstats->trans == NULL inside the callback, but this muddies the callback's return semantics.
Horiguchi's proposed solution: Pass an explicit flush context (anytime vs. transaction-boundary) to the callback. The callback reports only whether counters remain; the caller makes lifetime decisions based on context. This keeps the callback's return value meaning clean.
4. Single Entry Point vs. Dual Entry Point for Flushing
A higher-level design question emerged: should pgstat_report_stat() remain the sole flush entry point (with an added context argument), or should there be a separate routine for anytime flushes?
Paquier's preference: Single entry point — add a context argument to pgstat_report_stat() and lift its !IsTransactionOrTransactionBlock() requirement. Simpler long-term.
Horiguchi agrees: The basic operation is common; anytime differs only in context. The flush-timing logic gets slightly more complex, but the existing path works with minor adjustments by carrying context down to callbacks.
5. Assertion Reformulation
Horiguchi suggests rewriting the relaxed assertion from:
Assert(did_flush || nowait || IsTransactionOrTransactionBlock());
to:
Assert(IsTransactionOrTransactionBlock() || (did_flush || nowait));
This better communicates the intent: "the existing requirement is relaxed during transactions."
Emerging Consensus on 0001 Redesign
The three participants are converging on:
- Single flush entry point (
pgstat_report_stat() with context parameter)
- Explicit context passed to callbacks (not inferred from transaction state)
- Throttling conditional on whether API is test-only or general-purpose
- Cleaner callback semantics where return value means "counters remain" independent of lifetime
History (2 prior analyses)
2026-06-01 · claude-opus-4-6
Monthly Summary: Improve pg_stat_statements Scalability — May 2026
Overview
May 2026 saw the pg_stat_statements scalability patch evolve from initial proposal through architectural refinement, CI fixes, and the first substantive community review. The patch fundamentally re-architects pg_stat_statements to use PostgreSQL's built-in pgstat cumulative statistics subsystem, eliminating per-entry spinlock contention, replacing the fixed-size hash table with a dynamically-sized dshash, and introducing DSA-backed query text storage. Benchmark results show a 33% TPS improvement under eviction pressure and elimination of ~20,000 LWLock wait events.
Core Problem Statement
pg_stat_statements suffers from three scalability bottlenecks on modern high-core-count hardware:
- Per-entry spinlock contention — All backends updating the same hot query contend on a single spinlock, with hold time growing as more counters are added to the struct.
- Exclusive LWLock during deallocation — When the fixed-size hash table fills, inline eviction under an exclusive lock blocks all concurrent readers/writers.
- Query text file GC under exclusive lock — Disk I/O for garbage collection occurs while holding the exclusive LWLock.
Proposed Architecture
The patch implements a write-local, flush-periodic design leveraging pgstat infrastructure built over PG18-PG19:
- Local accumulation: Backends accumulate counter updates in local memory (no locks), periodically flushing to shared dshash
- dshash replacement: Dynamic shared hash table with partition-level locking replaces the fixed-size table;
pg_stat_statements.max becomes PGC_SIGHUP
- Throttled eviction: Non-blocking conditional-lock eviction with a 10-second interval; entries are skipped rather than blocking backends
- DSA-backed query text: New
pg_stat_statements.query_text_memory GUC (default 64MB) stores texts in shared memory with file fallback
- Chan's parallel algorithm: Replaces Welford's for merging partial variance computations from independent backends
Key Developments This Month
Patch Reorganization (4-part series)
| Patch |
Content |
| 0001 |
In-transaction flush for pgstat (separate CF entry #6781 — hard dependency) |
| 0002 |
Move hash to custom pgstat kind (core change) |
| 0003 |
Query text DSA + file hybrid storage |
| 0004 |
New columns for pg_stat_statements_info |
Parallel Query Safety Fix (FreeBSD CI)
A FreeBSD CI failure revealed that parallel workers cannot flush stats they didn't accumulate — a direct consequence of the write-local design. The fix marks pg_stat_statements() and pg_stat_statements_reset() as PARALLEL RESTRICTED (version bump to 1.14), a minor behavioral regression from the current PARALLEL SAFE marking that only affects developer/testing scenarios.
First Substantive Review (Lukas Fittl)
Lukas raised several significant issues:
- Rollback handling bug:
tuples_hot_updated and tuples_newpage_updated not properly ignored on rollback in patch 0001
- Unnecessary eager normalization:
generate_normalized_query called before checking if text storage is needed
- Background worker proposal: Eviction should trigger at 90% capacity and run in a background worker rather than inline — eliminates the "5001st entry always lost" problem
- Query text architecture alternative: DSA as write buffer with background flush to disk (avoiding performance cliff at memory-to-disk boundary)
- Memory default concerns: 64MB may be too high for small instances; suggests auto-scaling based on
shared_buffers
PGConf.Dev Discussion
An in-person meeting produced a wiki page (wiki.postgresql.org/wiki/Scalability_of_pg_stat_statements) consolidating design direction and building consensus.
Benchmark Results (16-core machine)
| Test |
Bottleneck |
Improvement |
Key Evidence |
| 5k (baseline) |
None |
-0.9% |
Negligible pgstat flush overhead |
| 100k (eviction) |
LWLock |
+33% |
20,416 → ~0 LWLock wait samples |
| spinlock (hot query) |
Per-entry |
-3% |
Expected to improve at higher core counts |
Core Design Tension
The fundamental disagreement emerging is:
- Sami's approach: Minimize complexity, use pgstat infrastructure as-is, accept bounded data loss during churn
- Lukas's approach: Background worker for eviction and text I/O, providing better worst-case behavior at the cost of additional infrastructure
This divergence will likely shape the next revision.
Status
The patch is functional with CI passing after fixes, but requires resolution of the design questions raised in review before it can progress toward commit. The hard dependency on the in-transaction pgstat flush mechanism (CF #6781) means both patches must advance together.
2026-06-01 · claude-opus-4-6
Incremental Update: v3 Patchset with Major Eviction Redesign + First Bug Reports
What's New
Two significant developments: (1) Sami posted a substantially redesigned v3 patchset incorporating feedback from the PGConf.Dev unconference, with a completely new eviction algorithm and simplified query text storage; (2) Zsolt Parragi identified correctness bugs in the new min/max timing reset logic and a potential integer overflow.
Major Design Changes in v3
1. Eviction: Clock-Sweep Replaces Throttled Skip
The v2 "conditional-lock + throttle + skip entry" eviction design has been completely replaced with a parallel clock-sweep algorithm. This directly addresses Lukas Fittl's criticism that the 5001st entry on a steady workload would always be lost.
New algorithm:
- Each entry carries a refcount (capped at 10) that increases proportionally to access frequency
- An atomic rotating hand cycles through dshash partitions
- On eviction, a backend sweeps one partition, decrementing refcounts
- Entries reaching refcount zero are evicted
- Eviction targets 5% headroom (entries are swept until count drops to 95% of max)
- Multiple backends can sweep different partitions concurrently (parallel eviction per Andres's suggestion)
This is architecturally superior to v2's single-threaded eviction because:
- New entries survive if they arrive with a fresh refcount during a sweep cycle
- Hot entries maintain high refcounts and are never evicted
- No backend ever blocks other backends — each sweeps its own partition independently
Key infrastructure requirement: Two new core patches were needed:
pgstat_drop_entry() must tolerate already-dropped entries (race between concurrent sweepers)
- A new
dshash_seq_init_partition() API restricts sequential scans to a single partition
2. Query Text: Pure DSA (File Eliminated Entirely)
The v2 "DSA primary + file fallback" hybrid has been simplified to DSA-only storage. The rationale from the unconference (attributed to Andres): since GC and reading already require loading all text into memory transiently, the machine must have that memory available anyway — so just keep it in DSA permanently.
Key changes:
- Default
query_text_memory reduced from 64MB to 4MB (significant reduction)
- When DSA is exhausted, entries still exist but query text is NULL
- A backfill mechanism recovers text on subsequent executions when space becomes available
- The entire GC machinery for the query text file is eliminated
3. Background Worker Explicitly Rejected
The unconference consensus was that backpressure is important — a background worker is asynchronous and cannot provide backpressure when the system is under memory pressure. Additionally, failure to spawn a background worker creates operational complexity. This closes the design discussion Lukas opened.
4. Dealloc Counter Semantics Change
pg_stat_statements_info.dealloc now counts individual entries evicted rather than eviction invocations. Under clock-sweep, a single pass removes many entries across a partition, so counting passes is no longer meaningful. This is a semantic break between versions.
Benchmark Results (v3)
| Metric |
Patch |
Upstream |
Analysis |
| select1 (no contention) |
-1.0% |
baseline |
Pure pgstat overhead; pgstat_get_entry_ref() at top of profile |
| LWLock waits (churn) |
502 (PgStatsDSA) |
7,757 (pg_stat_statements) |
15× reduction |
| Hot entry retention |
1000/1000 |
1000/1000 |
Parity |
| Cold entry calls |
805 |
4,458 |
Entries evicted sooner under per-partition sweep |
| Deallocs (high churn) |
11.9M |
38.5K |
Clock-sweep fires frequently in small batches vs. infrequent large batches |
The cold_calls reduction (805 vs 4,458) reveals a behavioral difference: clock-sweep evicts cold entries more aggressively because it sweeps per-partition frequently. The author notes USAGE_DEALLOC_PERCENT could be reduced to improve cold retention.
Correctness Bugs Found by Zsolt Parragi
Bug 1: minmax_only Reset Incorrectly Zeros mean/sum_var_time
The minmax-only reset path (which should only reset min/max timing, not affect mean/variance) is also zeroing mean_time and sum_var_time. When these are subsequently recalculated, they use data from before the reset, producing incorrect variance statistics. This is a clear bug — the minmax reset should leave mean/variance untouched.
Bug 2: min_time Cannot Recover After minmax Reset
After reset, min_time is 0. The update logic only writes a new min if pending->min_time < shared->min_time. Since 0 is less than any positive timing value, the condition is never satisfied, and min_time stays at 0 permanently (until a full reset). The fix should check whether shared->min_time == 0 (indicating post-reset state) and unconditionally accept the first new value.
Bug 3: Potential Integer Overflow in Entry Count Check
if ((int) pg_atomic_read_u64(&pgss_shared->nentries) <= pgss_max * (100 - USAGE_DEALLOC_PERCENT) / 100)
The multiplication pgss_max * (100 - USAGE_DEALLOC_PERCENT) could overflow int for large values of pgss_max. The author noted that a similar pattern elsewhere uses explicit uint64 casts.
Patch Series Structure (v3)
| Patch |
Content |
Status |
| 0001 |
pg_stat_report_anytime() - in-transaction flush |
Unchanged from v2 |
| 0002 |
pgstat_drop_entry() tolerate already-dropped |
NEW - core infrastructure for parallel eviction |
| 0003 |
dshash_seq_init_partition() - partition-scoped scan |
NEW - core infrastructure for clock-sweep |
| 0004 |
Main pg_stat_statements modernization |
Substantially rewritten (clock-sweep, dshash, pgstat kind) |
| 0005 |
Query text in DSA (file eliminated) |
Simplified from v2 (no file fallback) |