Technical Analysis: Adding last_execution_start to pg_stat_statements
The Core Problem
pg_stat_statements is PostgreSQL's most widely-used monitoring extension, yet it lacks a fundamental piece of temporal metadata: when was a given statement last executed? Without this, monitoring systems must resort to polling all counters (e.g., calls, total_exec_time) and diffing them to detect activity — an expensive and indirect approach. A simple last_execution_start timestamp would allow queries like WHERE last_execution_start >= $1 to instantly identify recently-active statements, enabling efficient incremental monitoring.
This patch is the successor to a prior stats_last_updated effort that was rejected primarily on performance grounds: calling GetCurrentTimestamp() (which issues a gettimeofday() syscall) inside the hot stats-accumulation path (pgss_store()) caused a measured 5–6% TPS regression on short-transaction workloads.
Architectural Context: Why This Is Harder Than It Looks
The GetCurrentStatementStartTimestamp() Shortcut
The v1 patch's key insight is that PostgreSQL already tracks statement start time in a process-local variable (stmtStartTimestamp), set at the beginning of each statement cycle. Reading this variable is essentially free — no syscall, no lock, just a memory read. This eliminates the performance concern that killed the predecessor patch.
Benchmark validation: v1 showed ~0.1% TPS difference (within noise), compared to the 5–6% regression of the GetCurrentTimestamp() approach.
The Deferred ExecutorEnd Problem
However, this elegant approach has a critical correctness flaw exposed by the extended query protocol. In the extended protocol (used by JDBC, libpq in pipeline mode, etc.), ExecutorEnd for a statement can be deferred until the next Bind message arrives. This means:
- Statement A starts at time T1, executes, but ExecutorEnd is deferred.
- Statement B starts at time T2, its Bind triggers ExecutorEnd for A.
- When
pgss_store()runs for A (during its deferred ExecutorEnd),GetCurrentStatementStartTimestamp()now returns T2, not T1.
This is not a theoretical edge case — it's the normal code path for any client using the extended query protocol with multiple open portals or cursors with fetch sizes. Sami Seih demonstrated this concretely with a JDBC test showing two portals both getting the timestamp of the second portal's start time.
Evolution of Solutions
v1: Direct GetCurrentStatementStartTimestamp() in pgss_store()
- Approach: Read the process-local timestamp variable at stats accumulation time.
- Problem: Incorrect under deferred ExecutorEnd (extended query protocol).
- Status: Rejected.
v2: Bounded Static Array Indexed by Nesting Level
- Approach: Capture
GetCurrentStatementStartTimestamp()inpgss_ExecutorStartand store it in a static arraypgss_exec_start[nesting_level](max 64 levels). Read it back inpgss_ExecutorEnd. - Problems identified by Sami Seih:
- Nesting level is the wrong axis: The problem isn't nested statements — it's multiple concurrent portals at the same nesting level.
GetCurrentStatementStartTimestamp()always returns the top-level statement start time regardless of nesting, so the array doesn't help. - Arbitrary bound of 64: The hardcoded limit is architecturally unclean and the fallback to direct timestamp reading reintroduces the bug.
- Multiple portals break it: With two portals open at the same nesting level (e.g., JDBC fetch-size cursors), both slots would be overwritten and you'd get the wrong timestamp for at least one portal.
- Nesting level is the wrong axis: The problem isn't nested statements — it's multiple concurrent portals at the same nesting level.
- Status: Rejected.
v3: Store Timestamp in EState (queryDesc->estate)
- Approach: Add a
TimestampTzfield toEState(the executor state structure). Set it inpgss_ExecutorStartimmediately afterstandard_ExecutorStartcreates the estate. Read it back inpgss_ExecutorEnd. - Why this is correct: Each query execution gets its own
QueryDescwith its ownEState. Even with multiple concurrent portals, each portal has a distinctEState, so timestamps are correctly isolated per-execution. This follows the existing pattern used byes_processed,es_parallel_workers_to_launch, etc. - For utility statements: No
EStateexists, so these fall back to directGetCurrentStatementStartTimestamp(), which is correct because utility statement ExecutorEnd is not deferred. - Trade-off: This touches core infrastructure (
EStatestruct inexecnodes.h), adding a field that exists primarily for pg_stat_statements' benefit. Sami notes that other EState fields consumed by pgss (likees_processed) have use-cases beyond pgss, but argues this isn't a sufficient reason to avoid the pattern. - Status: Current approach, benchmarks show ~1.1% TPS improvement (within noise — effectively zero overhead).
v4: Rebase Only
- Mechanical rebase to resolve a conflict with commit 66366217822 (which introduced
saved_planOriginin nested utility execution). No functional changes.
Key Design Decisions and Trade-offs
1. Statement Start Time vs. Clock Timestamp
The patch deliberately records the statement start timestamp (as known to PostgreSQL's statement lifecycle), not the wall-clock time at stats recording. This means:
- For
toplevel = false(nested) statements, the timestamp reflects the top-level statement's start time, not the nested statement's own start. This is documented as acceptable behavior. - The timestamp is always <= the actual wall-clock time of execution.
2. Reset Semantics
last_execution_start is:
- Reset by
pg_stat_statements_reset()(full reset). - Preserved across minmax-only resets, consistent with
stats_sincesemantics. - Initialized to the entry allocation time when a new statement is first tracked.
3. Core vs. Extension Boundary
A recurring tension in this thread is whether to modify core PostgreSQL structures (EState) for the benefit of a contrib extension. Pavlo initially wanted to avoid touching core ("I don't want to touch core functionality in this patch. I'm afraid it could lead to even more problems with accepting."), but Sami convincingly argued that EState is the correct architectural location, following existing patterns, and that the bounded-array alternative is fundamentally broken.
What Remains Unresolved
- Committer review: No committer has weighed in on whether adding a field to
EStatepurely for pgss consumption is acceptable. This is the most likely point of contention during final review. - Nested statement documentation: The thread agrees that
toplevel = falsestatements showing the top-level start time needs documentation, but the adequacy of the current docs hasn't been reviewed by a docs specialist. - No discussion of shared memory / serialization implications: The timestamp must survive
pg_stat_statementsfile serialization across restarts — this should be verified but isn't discussed.
Impact Assessment
This is a high-value, low-risk change for the monitoring ecosystem. The zero-overhead approach (reading an already-available timestamp rather than issuing a syscall) elegantly solves the performance concern that blocked the predecessor. The v3 approach of storing in EState is architecturally sound and follows existing patterns. The primary risk is social/process: getting approval for the EState struct modification in core.