2026-05-14 · claude-opus-4-6

Technical Analysis: Adding `last_execution_start` to pg_stat_statements

The Core Problem

pg_stat_statements is PostgreSQL's most widely-used monitoring extension, yet it lacks a fundamental piece of temporal metadata: when was a given statement last executed? Without this, monitoring systems must resort to polling all counters (e.g., calls, total_exec_time) and diffing them to detect activity — an expensive and indirect approach. A simple last_execution_start timestamp would allow queries like WHERE last_execution_start >= $1 to instantly identify recently-active statements, enabling efficient incremental monitoring.

This patch is the successor to a prior stats_last_updated effort that was rejected primarily on performance grounds: calling GetCurrentTimestamp() (which issues a gettimeofday() syscall) inside the hot stats-accumulation path (pgss_store()) caused a measured 5–6% TPS regression on short-transaction workloads.

Architectural Context: Why This Is Harder Than It Looks

The `GetCurrentStatementStartTimestamp()` Shortcut

The v1 patch's key insight is that PostgreSQL already tracks statement start time in a process-local variable (stmtStartTimestamp), set at the beginning of each statement cycle. Reading this variable is essentially free — no syscall, no lock, just a memory read. This eliminates the performance concern that killed the predecessor patch.

Benchmark validation: v1 showed ~0.1% TPS difference (within noise), compared to the 5–6% regression of the GetCurrentTimestamp() approach.

The Deferred ExecutorEnd Problem

However, this elegant approach has a critical correctness flaw exposed by the extended query protocol. In the extended protocol (used by JDBC, libpq in pipeline mode, etc.), ExecutorEnd for a statement can be deferred until the next Bind message arrives. This means:

Statement A starts at time T1, executes, but ExecutorEnd is deferred.
Statement B starts at time T2, its Bind triggers ExecutorEnd for A.
When pgss_store() runs for A (during its deferred ExecutorEnd), GetCurrentStatementStartTimestamp() now returns T2, not T1.

This is not a theoretical edge case — it's the normal code path for any client using the extended query protocol with multiple open portals or cursors with fetch sizes. Sami Seih demonstrated this concretely with a JDBC test showing two portals both getting the timestamp of the second portal's start time.

Evolution of Solutions

v1: Direct `GetCurrentStatementStartTimestamp()` in `pgss_store()`

Approach: Read the process-local timestamp variable at stats accumulation time.
Problem: Incorrect under deferred ExecutorEnd (extended query protocol).
Status: Rejected.

v2: Bounded Static Array Indexed by Nesting Level

Approach: Capture GetCurrentStatementStartTimestamp() in pgss_ExecutorStart and store it in a static array pgss_exec_start[nesting_level] (max 64 levels). Read it back in pgss_ExecutorEnd.
Problems identified by Sami Seih:
1. Nesting level is the wrong axis: The problem isn't nested statements — it's multiple concurrent portals at the same nesting level. GetCurrentStatementStartTimestamp() always returns the top-level statement start time regardless of nesting, so the array doesn't help.
2. Arbitrary bound of 64: The hardcoded limit is architecturally unclean and the fallback to direct timestamp reading reintroduces the bug.
3. Multiple portals break it: With two portals open at the same nesting level (e.g., JDBC fetch-size cursors), both slots would be overwritten and you'd get the wrong timestamp for at least one portal.
Status: Rejected.

v3: Store Timestamp in `EState` (queryDesc->estate)

Approach: Add a TimestampTz field to EState (the executor state structure). Set it in pgss_ExecutorStart immediately after standard_ExecutorStart creates the estate. Read it back in pgss_ExecutorEnd.
Why this is correct: Each query execution gets its own QueryDesc with its own EState. Even with multiple concurrent portals, each portal has a distinct EState, so timestamps are correctly isolated per-execution. This follows the existing pattern used by es_processed, es_parallel_workers_to_launch, etc.
For utility statements: No EState exists, so these fall back to direct GetCurrentStatementStartTimestamp(), which is correct because utility statement ExecutorEnd is not deferred.
Trade-off: This touches core infrastructure (EState struct in execnodes.h), adding a field that exists primarily for pg_stat_statements' benefit. Sami notes that other EState fields consumed by pgss (like es_processed) have use-cases beyond pgss, but argues this isn't a sufficient reason to avoid the pattern.
Status: Current approach, benchmarks show ~1.1% TPS improvement (within noise — effectively zero overhead).

v4: Rebase Only

Mechanical rebase to resolve a conflict with commit 66366217822 (which introduced saved_planOrigin in nested utility execution). No functional changes.

Key Design Decisions and Trade-offs

1. Statement Start Time vs. Clock Timestamp

The patch deliberately records the statement start timestamp (as known to PostgreSQL's statement lifecycle), not the wall-clock time at stats recording. This means:

For toplevel = false (nested) statements, the timestamp reflects the top-level statement's start time, not the nested statement's own start. This is documented as acceptable behavior.
The timestamp is always <= the actual wall-clock time of execution.

2. Reset Semantics

last_execution_start is:

Reset by pg_stat_statements_reset() (full reset).
Preserved across minmax-only resets, consistent with stats_since semantics.
Initialized to the entry allocation time when a new statement is first tracked.

3. Core vs. Extension Boundary

A recurring tension in this thread is whether to modify core PostgreSQL structures (EState) for the benefit of a contrib extension. Pavlo initially wanted to avoid touching core ("I don't want to touch core functionality in this patch. I'm afraid it could lead to even more problems with accepting."), but Sami convincingly argued that EState is the correct architectural location, following existing patterns, and that the bounded-array alternative is fundamentally broken.

What Remains Unresolved

Committer review: No committer has weighed in on whether adding a field to EState purely for pgss consumption is acceptable. This is the most likely point of contention during final review.
Nested statement documentation: The thread agrees that toplevel = false statements showing the top-level start time needs documentation, but the adequacy of the current docs hasn't been reviewed by a docs specialist.
No discussion of shared memory / serialization implications: The timestamp must survive pg_stat_statements file serialization across restarts — this should be verified but isn't discussed.

Impact Assessment

This is a high-value, low-risk change for the monitoring ecosystem. The zero-overhead approach (reading an already-available timestamp rather than issuing a syscall) elegantly solves the performance concern that blocked the predecessor. The v3 approach of storing in EState is architecturally sound and follows existing patterns. The primary risk is social/process: getting approval for the EState struct modification in core.

[PATCH] pg_stat_statements: add last_execution_start column

Latest Update

Technical Analysis: Adding `last_execution_start` to pg_stat_statements

The Core Problem

Architectural Context: Why This Is Harder Than It Looks

The `GetCurrentStatementStartTimestamp()` Shortcut

The Deferred ExecutorEnd Problem

Evolution of Solutions

v1: Direct `GetCurrentStatementStartTimestamp()` in `pgss_store()`

v2: Bounded Static Array Indexed by Nesting Level

v3: Store Timestamp in `EState` (queryDesc->estate)

v4: Rebase Only

Key Design Decisions and Trade-offs

1. Statement Start Time vs. Clock Timestamp

2. Reset Semantics

3. Core vs. Extension Boundary

What Remains Unresolved

Impact Assessment

[PATCH] pg_stat_statements: add last_execution_start column

Latest Update

Technical Analysis: Adding last_execution_start to pg_stat_statements

The Core Problem

Architectural Context: Why This Is Harder Than It Looks

The GetCurrentStatementStartTimestamp() Shortcut

The Deferred ExecutorEnd Problem

Evolution of Solutions

v1: Direct GetCurrentStatementStartTimestamp() in pgss_store()

v2: Bounded Static Array Indexed by Nesting Level

v3: Store Timestamp in EState (queryDesc->estate)

v4: Rebase Only

Key Design Decisions and Trade-offs

1. Statement Start Time vs. Clock Timestamp

2. Reset Semantics

3. Core vs. Extension Boundary

What Remains Unresolved

Impact Assessment

Technical Analysis: Adding `last_execution_start` to pg_stat_statements

The `GetCurrentStatementStartTimestamp()` Shortcut

v1: Direct `GetCurrentStatementStartTimestamp()` in `pgss_store()`

v3: Store Timestamp in `EState` (queryDesc->estate)