2026-05-06 · opus 4.7

EXPLAIN: Exposing ReadStream / Prefetch Statistics

Problem Statement

The ReadStream API (introduced in PG17) centralizes asynchronous buffer prefetching for sequential-like access patterns. It is used by SeqScan, BitmapHeapScan, and TidRangeScan, and is foundational for upcoming work such as index prefetching and AIO-driven execution. Despite its centrality, there was no user-visible way to inspect how well a read stream was actually performing for a given query. Users could observe shared read=N via BUFFERS, but could not tell:

How far ahead the stream was prefetching (lookahead distance).
Whether the stream had ramped up to its configured effective_io_concurrency / io_combine_limit capacity.
Whether the executor was actually stalling on I/O (WaitReadBuffers blocking) or consuming already-completed reads.
The average coalesced I/O size (related to io_combine_limit).
How many I/Os were in-flight when each new I/O was issued.

This visibility gap made tuning effective_io_concurrency, io_combine_limit, and understanding AIO behavior essentially a black-box exercise. Andres (author of much of the AIO/read-stream machinery) explicitly noted at the end of the thread that this patchset would have made the PG18 AIO work much easier and that index prefetching development suffered from its absence.

The Patch: What It Adds to EXPLAIN

A new IO option for EXPLAIN (ultimately kept off by default, unlike BUFFERS) produces output such as:

Seq Scan on t (actual rows=999996.00 loops=1)
  Prefetch: avg=262.629 max=271 capacity=272
  I/O: count=31 waits=5 size=14.29 inprogress=1.77
  Buffers: shared read=55556

Two conceptually distinct lines emerged:

Prefetch — describes the ReadStream's internal lookahead queue heuristic (avg, max, capacity). This is not strictly "I/O"; it's a queue/controller metric that exists even when all buffers are already in shared_buffers.
I/O — describes actual I/O issued: number of I/Os, number of WaitReadBuffers() calls that had to block, average coalesced request size (in BLCKSZ units), and the running average of concurrent in-progress I/Os at the moment of issuing a new one.

A matching log_io option was added to auto_explain, mirroring log_buffers. A new INSTRUMENT_IO flag was introduced so that the stats collection is only armed when explicitly requested.

Core Architectural Tension: Where Do the Stats Live?

Three competing designs surfaced, and this drove most of the thread:

Design A — TAM callback (Tomas's v1)

Add a new scan_stats callback to the table AM, plus a generic TableScanStatsData struct that each TAM would fill from its internal representation (for heap, trivially from the ReadStream).

Problems identified:

Two near-identical structs (ReadStreamInstrumentation and TableScanStatsData) with no real abstraction benefit (Melanie).
Would require a second analogous callback for IndexScanDesc once index prefetching landed.
Heavy TAM-interface churn for what is effectively diagnostic plumbing.

Design B — Stack-based instrumentation (Lukas)

Leverage Lukas's in-flight stack-based instrumentation work (CF #6023), where IOUsage would sit alongside BufferUsage / WALUsage in the Instrumentation struct, completely bypassing the TAM interface.

Advantages: uniform with BufferUsage/WALUsage, trivially extends to utility commands and pg_stat_statements, and avoids per-node plumbing. Disadvantages: depends on a large, not-yet-committed patchset just before feature freeze; the clean node↔stream association is lost (though that becomes a feature when streams are used in expression evaluation or detoasting).

Design C — Field in TableScanDesc (Andres, chosen)

Add an optional TableScanInstrumentation *rs_instrument pointer directly in TableScanDesc. The AM populates it (for heap, the read stream writes into it directly). EXPLAIN reads it without any callback.

Andres's justification: *"All the accesses already happen within AM code… we already combine different scan types (seq, bitmap, tid, sample) via TableScanDesc, so we really already are quite strongly associating the stats with the ScanDesc structures." He further argued this handles multiple read streams per scan by letting the AM merge stats internally — an advantage over the stack-based approach that was unclear about multi-stream semantics.

This is the design that shipped. A key refinement came via read_stream_enable_stats(stream, stats) — instead of threading an extra parameter through every read_stream_begin_relation() caller, the stats pointer is installed on an existing stream post-creation.

Parallel Query Complication

Per-worker stats needed shared memory. BitmapHeapScan already had shared instrumentation; SeqScan and TidRangeScan did not.

The harder issue was that parallel_aware=false nodes running under a Gather (common when debug_parallel_query=regress is set, as on the FreeBSD CI) did not get their instrumentation DSM allocated, because execParallel.c gates allocation on plan->parallel_aware. This produced flaky EXPLAIN output and was revealed to be a latent bug in BitmapHeapScan (commit 5a1e6df3b84c) — visible stats could silently disappear for the outer side of a parallel join.

The Key-Offset Trick (Melanie)

Tomas's initial fix threaded instrument/parallel_aware checks through multiple functions, which he and reviewers agreed made the code substantially less readable. Melanie proposed a cleaner pattern: allocate the shared instrumentation and the parallel-aware state as two separate DSM chunks keyed by:

plan_node_id (existing scheme)
plan_node_id + PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET

This gives each node effectively two keys, decouples the two allocation lifetimes, and lets instrumentation be allocated independently of whether the node is parallel-aware. An Assert was added to shm_toc_insert() to catch key collisions — useful given how magic-numbered the parallel executor keyspace is.

This refactor was first applied to the existing index / index-only scan nodes (as a no-op cleanup), then BitmapHeapScan (as a bug fix), then extended to SeqScan and TidRangeScan (new functionality). The BHS fix is a backpatch candidate for PG18.

Smaller Design Decisions Worth Noting

Default: IO OFF. Despite Lukas's push to mirror BUFFERS (on-by-default since PG18), Tomas ultimately kept it off because the Prefetch/I/O lines add visual clutter and because BUFFERS is arguably more universally useful. Avoiding default-on also dramatically reduced regression-test churn.
Stalls renamed to "waits" per Andres — clearer for non-experts.
Fast-path I/Os were initially undercounted (Melanie): reads that never enter read_stream_start_pending_read() because StartReadBuffer() completes immediately were missed. Fixed.
Synchronous-read waits subtlety (Andres): READ_BUFFERS_SYNCHRONOUSLY reads performed during ramp-up should still count as waits — they block, just during I/O initiation rather than in WaitReadBuffers(). Undercounting them would hide real latency in ramp-up/down scenarios.
%.1f formatting — reviewers agreed %.3f precision was unnecessary and noisy.
Non-text (JSON/YAML/XML) always emits the I/O group for regression-test stability and consistency with existing EXPLAIN conventions, even when io_count == 0.
phs_len added to ParallelTableScanDesc so workers can locate their per-worker instrumentation slot without recomputing via table_parallelscan_estimate() — a symptom of the TableScanDesc-based design's need to co-locate data with the parallel scan descriptor.
Leader merges worker stats. Following the show_indexscan_info precedent. The leader's stats cannot be viewed in isolation; this was debated but kept because worker stats are only shown under VERBOSE, and never merging would hide worker I/O entirely from non-VERBOSE output.

Why This Matters Architecturally

This work establishes the pattern by which future read-stream users (including the eventual index prefetching work, AIO-driven bulk insert index lookups, detoasting, possibly Sort/Materialize/ HashAgg spill) will expose their I/O behavior. The TableScanInstrumentation struct is deliberately generic enough that a non-heap TAM could populate it from something other than a ReadStream.

The unresolved architectural question — whether per-scan or stack-based instrumentation is the right long-term home — is deferred. Andres's concession that detoasting or expression- evaluation streams would break the clean node↔stream association leaves the door open for a stack-based approach later, but the TableScanDesc-field design is the pragmatic choice for streams with obvious node affiliation.

EXPLAIN: showing ReadStream / prefetch stats

Latest Update