pgstat: Flush Some Statistics Within Running Transactions, Take 2
Core Problem
PostgreSQL's statistics subsystem (pgstat) accumulates counters in backend-local memory and only flushes them to shared memory at transaction boundaries. This design creates a significant observability gap: for long-running transactions or workloads where real-time statistics visibility matters, the reported statistics can be stale by minutes or even hours. Users monitoring pg_stat_user_tables, pg_stat_io, pg_stat_wal, etc. cannot see activity from backends that are mid-transaction.
This is architecturally significant because:
- Monitoring blind spots: Long-running analytics queries or batch operations are invisible to monitoring tools until they commit/rollback.
- pg_stat_statements accuracy: Extensions like pg_stat_statements that rely on statistics infrastructure suffer from the same delayed visibility, making real-time query performance analysis unreliable.
- Capacity planning: IO and WAL statistics being deferred means operators cannot react to resource pressure in real-time.
Previous Attempt and Design Evolution
The original thread (take 1) apparently proposed automatic periodic flushing of statistics mid-transaction. Michael Paquier's feedback redirected the approach toward an on-demand API rather than automatic flushing. This is a meaningful architectural choice:
- Automatic flushing would add overhead to every transaction and introduce timing-dependent behavior that's hard to reason about.
- On-demand API gives control to the user/extension, follows the principle of least surprise, and avoids adding unconditional overhead.
Proposed Solution
The patch introduces two APIs:
SQL API: pg_stat_flush_backend(pid)
- If
pidmatches the calling backend → flush occurs immediately (synchronous). - If
pidis a different backend → the target is signaled and flushes at:- Next
CHECK_FOR_INTERRUPTS()for regular backends - Next main-loop iteration for auxiliary processes (bgwriter, walwriter, checkpointer)
- Next
This contrasts with the existing pg_stat_force_next_flush() which only marks that a flush should happen at the next transaction boundary — still deferred.
C API for Extensions
A C-level function that flushes the calling backend only (no cross-backend signaling). This is specifically designed to support the pg_stat_statements improvements being developed in parallel, where the extension needs to flush its own statistics at precise moments.
Key Technical Design: Transactional vs. Non-Transactional Counters
The most architecturally interesting aspect is the selective flush for relation statistics:
Deferred (transaction-dependent) counters:
tuples_inserted,tuples_updated,tuples_deletedlive_tuples,dead_tuplesestimates
These MUST be deferred because their correctness depends on the transaction outcome. If a transaction inserts 1000 rows and then rolls back, those counters should never have been visible. The n_live_tup / n_dead_tup estimates similarly depend on commit/abort.
Immediately flushable counters:
seq_scan,idx_scan(scan counts)tuples_fetched,tuples_returnedblocks_hit,blocks_read(buffer access)n_tup_hot_upd(HOT update counts)
These reflect physical work already performed regardless of transaction outcome. A sequential scan happened whether or not the transaction commits. Block reads from disk are real IO that occurred.
Non-relation statistics (unconditionally flushed):
- Function execution statistics
- IO statistics (
pg_stat_io) - WAL statistics (
pg_stat_wal) - All other pending stats
These have no transactional semantics — WAL written is WAL written, IO performed is IO performed.
Signal-Based Cross-Backend Flush Mechanism
The cross-backend flush via signaling raises several implementation considerations:
- Signal safety: The actual flush doesn't happen in the signal handler but at the next safe point (
CHECK_FOR_INTERRUPTS), avoiding any reentrancy issues with shared memory access. - Auxiliary process support: These processes don't call
CHECK_FOR_INTERRUPTSin the same way, so the patch hooks into their main loop iteration — a pattern already used for other deferred work in these processes. - Best-effort semantics: There's an inherent race between signaling and the target actually flushing. The flush is not synchronous for cross-backend calls, which is appropriate for statistics (eventual consistency is acceptable).
Relationship to pg_stat_statements Work
The C API is explicitly motivated by parallel work on pg_stat_statements improvements. This suggests a design where pg_stat_statements can flush its accumulated query statistics at statement completion rather than waiting for transaction end — critical for seeing individual statement costs within a multi-statement transaction.
Architectural Implications
This patch represents a shift in PostgreSQL's statistics philosophy from "strictly transaction-aligned reporting" to "report what you can as early as you can." The careful separation of transactional and non-transactional counters shows mature understanding of the consistency requirements — you don't want to report phantom writes that might be rolled back, but there's no reason to defer reporting physical IO that already happened.