2026-05-09 · opus 4.7

EXPLAIN (ANALYZE, WAITS): Per-Statement and Per-Plan-Node Wait Event Attribution

Core Problem

PostgreSQL has long exposed a backend's current wait event through pg_stat_activity.wait_event_type / wait_event via pgstat_report_wait_start() / pgstat_report_wait_end(). This is a point-in-time sample: an observer polling pg_stat_activity sees only what the backend is blocked on right now. There is no first-class mechanism to answer the practical DBA question: "During execution of this particular query, where did time go — on I/O, on lwlocks, on client reads, on timeouts?" — much less to attribute that time to a specific plan node.

Existing workarounds include:

Sampling pg_stat_activity externally (lossy, coarse).
pg_wait_sampling / pg_stat_statements extensions (install-time decision, global scope).
eBPF/DTrace/perf (out of band, not portable, not accessible to end users).

This RFC (tanswis42@gmail.com, attributed in the cover as Ilmar Yunusov) proposes making wait events a first-class EXPLAIN output, alongside BUFFERS, WAL, IO, and MEMORY, producing Statement Wait Events summaries and per-plan-node Wait Events blocks as a companion to the existing timing instrumentation. Architecturally, this converts the wait event subsystem from a state indicator into a time-integrated profile.

Proposed Design

Hook point and fast path

The design is to piggyback on the already-present pgstat_report_wait_start(info) / pgstat_report_wait_end() call sites, which are the single chokepoints for all wait events (IPC, IO, LWLock, Client, Timeout, etc.). A boolean pgstat_wait_event_usage_active is exported; the inline fast-path in utils/wait_event.h becomes:

if (unlikely(pgstat_wait_event_usage_active))
    pgstat_count_wait_event_start(wait_event_info);
*(volatile uint32 *) my_wait_event_info = wait_event_info;

The unlikely() annotation and the use of a single-byte flag are deliberate: when no EXPLAIN (ANALYZE, WAITS) is in flight (the overwhelmingly common case), this is one predictable branch. The author's synthetic microbenchmark cites ~0.1–0.2 ns/wait disabled overhead, though explicitly flagged as directional only.

Earlier patches in the series used pgstat_wait_event_usage_depth > 0 (an integer compare); v5 (patch 0005) refactored to a dedicated bool to keep the cold path semantics separate from the stack-depth bookkeeping, and added the unlikely() hint.

Storage model: fixed 64-entry preallocated buckets + overflow

A central design decision, arrived at through patch evolution, is that wait-end accounting must be allocation-free. The rationale is explicit in the final patch comment:

Wait-end accounting must not allocate: it can run in a critical section.

Waits end inside sync IO, WAL writes, and other paths where palloc via ERROR would be catastrophic. The design progression shows the author iterating on this:

Patch 0001: naive doubling reallocation in usage->memcontext.
Patch 0003: wraps entries in a dedicated memory context created with MemoryContextAllowInCriticalSection(..., true), acknowledging the critical-section problem.
Patch 0004: adds MCXT_ALLOC_NO_OOM fallback paths and introduces an overflow bucket (overflowed_calls / overflowed_time) that records totals without per-event identity when growth fails.
Patch 0005 (the committed final position): abandons growth entirely. Each accumulator preallocates exactly WAIT_EVENT_USAGE_MAX_EVENTS = 64 entries at creation; once full, additional distinct wait event identities flow into the overflow bucket. This makes wait-end branch-light and truly allocation-free.

The overflow bucket is an elegant compromise: it preserves total call count and total wait time (so summary math is not broken) while sacrificing per-identity resolution past 64 distinct events. Given PostgreSQL currently defines on the order of ~250 wait events across all classes but a single plan node typically touches only a handful, 64 is a reasonable bound.

Sorted arrays, binary search insertion

Patch 0004 replaces the linear scan in WaitEventUsageAdd with binary search via WaitEventUsageFind, keeping entries sorted by wait_event_info. This matters because: (a) the wait-end path runs inside every wait so lookup cost multiplies across the query, and (b) sorted order enables efficient merge operations when accumulating from parallel workers (merge-sort style, O(m+n)).

Attribution semantics: inclusive per-node

The key semantic choice, called out explicitly as a review question, is that per-node wait attribution is inclusive — mirroring how EXPLAIN ANALYZE node timing already works. When a wait begins, the current value of pgstat_wait_event_node_usage (and its active_parent chain via pgstat_wait_event_usage_node_stack) is snapshotted; when the wait ends, the elapsed time is added to every node in the active stack. This means a runtime-key pg_sleep wait under a Bitmap Index Scan is attributed to both the Bitmap Index Scan and its Bitmap Heap Scan parent (the test in patch 0004 verifies exactly this invariant).

The documentation warns:

node-level wait event times should not be summed to compute the statement total.

This is analogous to the long-standing semantic of actual time in EXPLAIN ANALYZE and preserves consistency with the existing mental model.

The query-level stack (pgstat_wait_event_usage_query_stack) is a separate linked list walked via query_parent, used for nested EXPLAIN (ANALYZE, WAITS) — a wait is counted once in each active statement-level collector. The separation of these two stacks (node vs. query) is essential: nested EXPLAINs share executor nesting but are independent at the statement level.

Error recovery during nested EXPLAIN

Patch 0004 introduces saved_node_usage in the WaitEventUsage struct. The documented hazard:

A nested EXPLAIN can error out while one of its plan nodes is active, skipping the usual node-level restore. Remember the outer node stack so ending this collector can discard any leaked inner node frames.

Without this, an exception thrown mid-node in an inner EXPLAIN would leave dangling active_parent pointers referencing freed memory from the inner query context. The regression test nested_explain_waits_error using division_by_zero in a plpgsql EXCEPTION block directly exercises this recovery path.

Executor integration and manual instrumentation sites

ExecProcNodeInstr gets pgstat_enter_wait_event_usage / pgstat_restore_wait_event_usage wrappers symmetric to InstrStartNode/InstrStopNode. Critically, several nodes do not route through ExecProcNodeInstr and need manual instrumentation:

MultiExecHash, MultiExecBitmapAnd, MultiExecBitmapOr, MultiExecBitmapIndexScan: these use MultiExecProcNode which explicitly skips ExecProcNodeInstr because it can't count returned tuples.
execAsync.c: ExecAsyncRequest, ExecAsyncConfigureWait, ExecAsyncNotify — all three async entry points.
ExecShutdownNode_walker: if a node is still running at shutdown, the timer must be re-entered and restored.

Each site gets the same previous = pgstat_enter_wait_event_usage(node->wait_event_usage); ... pgstat_restore_wait_event_usage(previous); pattern. This is a notable maintenance tax — any future manual-instrumentation node must remember to do this, which is a concern for reviewability.

Parallel worker aggregation

Parallel integration is substantial and lives in execParallel.c:

A new PARALLEL_KEY_WAIT_EVENT_USAGE TOC key carries a SharedWaitEventUsage metadata header with one SharedWaitEventUsageWorker per worker (for statement-level).
For per-node attribution, SharedExecutorInstrumentation gains a wait_event_usage_offset that points to num_plan_nodes * num_workers SharedWaitEventUsageWorker entries appended after the existing NodeInstrumentation array.
Each worker's entries are written into DSA (dsa_allocate) with a dsa_pointer in the shared header; the leader reads via dsa_get_address and dsa_frees after accumulation.
Patch 0004's ExecParallelReportWaitEventUsageWorker implements an in-place merge of worker reports on rescans. The test parallel rescan waits accumulated specifically guards against regression of this behavior — when a parallel node is rescanned, workers may report multiple times for the same plan node, and the naive replacement would lose earlier measurements. The fix: allocate new DSA buffer sized old_nentries + usage_nentries, merge sorted arrays, free the old buffer.

Public API and opaque struct (final polish, patch 0006)

The final refactor makes WaitEventUsage opaque (forward-declared in the public header, definition moved into wait_event.c). Consumers — explain.c and execParallel.c — access it through:

pgstat_create_wait_event_usage(memcontext)
pgstat_begin_wait_event_usage(memcontext) (now returns the allocated WaitEventUsage * rather than taking a caller-allocated stack struct)
pgstat_wait_event_usage_is_empty(usage)
pgstat_get_wait_event_usage_entries(usage, &entries) returning count
pgstat_get_wait_event_usage_overflow(usage, &calls, &elapsed)
pgstat_accumulate_wait_event_usage_overflow(...)

This is good hygiene: it decouples the execParallel merge code from the internal layout and would permit future changes (e.g. a hash map instead of a sorted array) without touching callers. It also reflects typical committer feedback patterns on such patches — "hide the internals behind an API."

Key Technical Insights and Tradeoffs

The allocation-free wait-end path is the linchpin

The shift from "grow on demand with NO_OOM fallback" to "fixed 64-entry preallocated with overflow" is the single most important design decision in the series. pgstat_report_wait_end is called from LWLockRelease, from inside smgrread completions, from WAL flushes — places where even a soft OOM could cascade. By making the data path strictly non-allocating, the author eliminates an entire class of latent bug, at the cost of per-event identity for truly pathological queries (>64 distinct waits). This is the correct tradeoff for EXPLAIN output where the overflow is visible and clearly labeled.

Cost of inclusive attribution on deep plans

The author acknowledges:

The overhead of this option depends on the number of waits observed and, for node-level output, on the number of active plan nodes when each wait begins.

For a deep plan stack (say, 20 active nodes during an inner join under a sort under an aggregate under a window function), each wait event end traverses the full stack and performs binary search + maybe-insert-and-shift into each node's array. With memmove for insertion-sort ordering, worst-case insertion is O(64) per node per wait. This is unlikely to be measurable for IO-heavy queries (waits are already expensive) but could matter for LWLock-heavy workloads with many short waits.

Parallel worker per-node merge on rescan

The subtlety that worker wait usage must accumulate across parallel relaunches of the same node (not replace) is a real footgun. The test construction — forcing a parallel-aware Index Scan on the right side of a RIGHT JOIN (VALUES (1),(2),(3)) with parallel_leader_participation = off — is quite intricate and reflects that the author found this bug through careful scenario construction. A naive reviewer might well have missed this.

Naming: WAITS vs WAIT_EVENTS

The author explicitly flags the name as a review question. WAITS matches the terseness of BUFFERS/WAL/IO but loses the semantic specificity of "wait events" (a well-known PostgreSQL concept with documented vocabulary). WAIT_EVENTS would be more discoverable and aligns with pg_stat_activity.wait_event. This is the kind of decision that typically gets bikeshedded and settled by a committer.

Testing and determinism

The regression tests lean heavily on pg_sleep(0.01) producing a deterministic Timeout:PgSleep wait event. Durations are normalized by the existing explain_filter (same approach as buffers). The author explicitly acknowledges this as a review question. A concern: tests that depend on parallel_workers reloption and parallel leader participation settings could be fragile on slower buildfarm members.

Participant Analysis

This is a single-author RFC submission (tanswis42@gmail.com, identified as Ilmar Yunusov in the cover letter) with seven patches posted in a single burst. There are no responses in the captured thread, so no committer feedback, no disagreement, and no alternative designs have been introduced. The patch is at the RFC stage and the open review questions are unanswered.

The implementation quality (opaque types, critical-section awareness, parallel worker merge semantics, careful error-recovery for nested EXPLAINs) suggests the author has internalized PostgreSQL coding conventions and likely reviewed similar instrumentation patches (BUFFERS, WAL, IO) as references. The overall shape — adding an EXPLAIN option + hooking the pgstat wait reporter + DSA-backed per-worker accumulation — follows the well-trodden path of those prior features.

Likely Review Concerns

Based on PostgreSQL community patterns, expected pushback areas include:

Hot-path overhead justification: one branch on a global bool per pgstat_report_wait_start/end call. Community will demand rigorous Linux-pinned measurements, not just the author's macOS directional numbers. Wait reporting is in genuinely hot paths (LWLock acquire/release).
The overflow bucket semantic: some may argue all-or-nothing (refuse to track if >64 events) is cleaner than lossy aggregation.
Name bikeshedding: WAITS vs WAIT_EVENTS.
Duplication of manual instrumentation: seven sites now need both InstrStart/Stop and pgstat_enter/restore_wait_event_usage. Could be abstracted into a macro.
Interaction with existing pg_wait_sampling-style extensions: does nested collection break assumptions those make?
Per-plan-node storage overhead: 64 WaitEventUsageEntry × ~24 bytes = ~1.5KB per node, multiplied by nodes × workers for parallel. For large plans in parallel, this could inflate DSM requirements.
Inclusive attribution semantics: reviewers may prefer exclusive (only innermost active node gets credit), which would make node-level times summable and match some other systems' profiling conventions.

[RFC PATCH v0 0/7] Add EXPLAIN ANALYZE wait event reporting

Latest Update