EXPLAIN: Exposing ReadStream / Prefetch Statistics
Problem Statement
The ReadStream API (introduced in PG17) centralizes asynchronous buffer
prefetching for sequential-like access patterns. It is used by SeqScan,
BitmapHeapScan, and TidRangeScan, and is foundational for upcoming work
such as index prefetching and AIO-driven execution. Despite its centrality,
there was no user-visible way to inspect how well a read stream was
actually performing for a given query. Users could observe shared read=N
via BUFFERS, but could not tell:
- How far ahead the stream was prefetching (lookahead distance).
- Whether the stream had ramped up to its configured
effective_io_concurrency/io_combine_limitcapacity. - Whether the executor was actually stalling on I/O (
WaitReadBuffersblocking) or consuming already-completed reads. - The average coalesced I/O size (related to
io_combine_limit). - How many I/Os were in-flight when each new I/O was issued.
This visibility gap made tuning effective_io_concurrency,
io_combine_limit, and understanding AIO behavior essentially a black-box
exercise. Andres (author of much of the AIO/read-stream machinery)
explicitly noted at the end of the thread that this patchset would have
made the PG18 AIO work much easier and that index prefetching development
suffered from its absence.
The Patch: What It Adds to EXPLAIN
A new IO option for EXPLAIN (ultimately kept off by default, unlike
BUFFERS) produces output such as:
Seq Scan on t (actual rows=999996.00 loops=1)
Prefetch: avg=262.629 max=271 capacity=272
I/O: count=31 waits=5 size=14.29 inprogress=1.77
Buffers: shared read=55556
Two conceptually distinct lines emerged:
- Prefetch — describes the ReadStream's internal lookahead queue
heuristic (
avg,max,capacity). This is not strictly "I/O"; it's a queue/controller metric that exists even when all buffers are already in shared_buffers. - I/O — describes actual I/O issued: number of I/Os, number of
WaitReadBuffers()calls that had to block, average coalesced request size (in BLCKSZ units), and the running average of concurrent in-progress I/Os at the moment of issuing a new one.
A matching log_io option was added to auto_explain, mirroring
log_buffers. A new INSTRUMENT_IO flag was introduced so that the
stats collection is only armed when explicitly requested.
Core Architectural Tension: Where Do the Stats Live?
Three competing designs surfaced, and this drove most of the thread:
Design A — TAM callback (Tomas's v1)
Add a new scan_stats callback to the table AM, plus a generic
TableScanStatsData struct that each TAM would fill from its internal
representation (for heap, trivially from the ReadStream).
Problems identified:
- Two near-identical structs (
ReadStreamInstrumentationandTableScanStatsData) with no real abstraction benefit (Melanie). - Would require a second analogous callback for
IndexScanDesconce index prefetching landed. - Heavy TAM-interface churn for what is effectively diagnostic plumbing.
Design B — Stack-based instrumentation (Lukas)
Leverage Lukas's in-flight stack-based instrumentation work (CF #6023),
where IOUsage would sit alongside BufferUsage / WALUsage in the
Instrumentation struct, completely bypassing the TAM interface.
Advantages: uniform with BufferUsage/WALUsage, trivially extends
to utility commands and pg_stat_statements, and avoids per-node
plumbing. Disadvantages: depends on a large, not-yet-committed
patchset just before feature freeze; the clean node↔stream association
is lost (though that becomes a feature when streams are used in
expression evaluation or detoasting).
Design C — Field in TableScanDesc (Andres, chosen)
Add an optional TableScanInstrumentation *rs_instrument pointer
directly in TableScanDesc. The AM populates it (for heap, the read
stream writes into it directly). EXPLAIN reads it without any callback.
Andres's justification: *"All the accesses already happen within AM code… we already combine different scan types (seq, bitmap, tid, sample) via TableScanDesc, so we really already are quite strongly associating the stats with the ScanDesc structures." He further argued this handles multiple read streams per scan by letting the AM merge stats internally — an advantage over the stack-based approach that was unclear about multi-stream semantics.
This is the design that shipped. A key refinement came via
read_stream_enable_stats(stream, stats) — instead of threading an
extra parameter through every read_stream_begin_relation() caller,
the stats pointer is installed on an existing stream post-creation.
Parallel Query Complication
Per-worker stats needed shared memory. BitmapHeapScan already had
shared instrumentation; SeqScan and TidRangeScan did not.
The harder issue was that parallel_aware=false nodes running under
a Gather (common when debug_parallel_query=regress is set, as on the
FreeBSD CI) did not get their instrumentation DSM allocated, because
execParallel.c gates allocation on plan->parallel_aware. This
produced flaky EXPLAIN output and was revealed to be a latent bug
in BitmapHeapScan (commit 5a1e6df3b84c) — visible stats could
silently disappear for the outer side of a parallel join.
The Key-Offset Trick (Melanie)
Tomas's initial fix threaded instrument/parallel_aware checks
through multiple functions, which he and reviewers agreed made the
code substantially less readable. Melanie proposed a cleaner
pattern: allocate the shared instrumentation and the
parallel-aware state as two separate DSM chunks keyed by:
plan_node_id(existing scheme)plan_node_id + PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET
This gives each node effectively two keys, decouples the two
allocation lifetimes, and lets instrumentation be allocated
independently of whether the node is parallel-aware. An Assert
was added to shm_toc_insert() to catch key collisions — useful
given how magic-numbered the parallel executor keyspace is.
This refactor was first applied to the existing index / index-only scan nodes (as a no-op cleanup), then BitmapHeapScan (as a bug fix), then extended to SeqScan and TidRangeScan (new functionality). The BHS fix is a backpatch candidate for PG18.
Smaller Design Decisions Worth Noting
-
Default: IO OFF. Despite Lukas's push to mirror BUFFERS (on-by-default since PG18), Tomas ultimately kept it off because the Prefetch/I/O lines add visual clutter and because BUFFERS is arguably more universally useful. Avoiding default-on also dramatically reduced regression-test churn.
-
Stalls renamed to "waits" per Andres — clearer for non-experts.
-
Fast-path I/Os were initially undercounted (Melanie): reads that never enter
read_stream_start_pending_read()becauseStartReadBuffer()completes immediately were missed. Fixed. -
Synchronous-read waits subtlety (Andres):
READ_BUFFERS_SYNCHRONOUSLYreads performed during ramp-up should still count as waits — they block, just during I/O initiation rather than inWaitReadBuffers(). Undercounting them would hide real latency in ramp-up/down scenarios. -
%.1fformatting — reviewers agreed%.3fprecision was unnecessary and noisy. -
Non-text (JSON/YAML/XML) always emits the I/O group for regression-test stability and consistency with existing EXPLAIN conventions, even when
io_count == 0. -
phs_lenadded toParallelTableScanDescso workers can locate their per-worker instrumentation slot without recomputing viatable_parallelscan_estimate()— a symptom of the TableScanDesc-based design's need to co-locate data with the parallel scan descriptor. -
Leader merges worker stats. Following the
show_indexscan_infoprecedent. The leader's stats cannot be viewed in isolation; this was debated but kept because worker stats are only shown under VERBOSE, and never merging would hide worker I/O entirely from non-VERBOSE output.
Why This Matters Architecturally
This work establishes the pattern by which future read-stream
users (including the eventual index prefetching work, AIO-driven
bulk insert index lookups, detoasting, possibly Sort/Materialize/
HashAgg spill) will expose their I/O behavior. The
TableScanInstrumentation struct is deliberately generic enough
that a non-heap TAM could populate it from something other than a
ReadStream.
The unresolved architectural question — whether per-scan or stack-based instrumentation is the right long-term home — is deferred. Andres's concession that detoasting or expression- evaluation streams would break the clean node↔stream association leaves the door open for a stack-based approach later, but the TableScanDesc-field design is the pragmatic choice for streams with obvious node affiliation.