RFC: Allow EXPLAIN to Output Page Fault / Storage I/O Information
Core Problem
When diagnosing slow queries, PostgreSQL's EXPLAIN (ANALYZE, BUFFERS) tells you whether pages were found in shared buffers vs. required kernel-level reads, but it cannot distinguish between reads served from the OS page cache (fast) and reads that actually hit physical storage (slow). This gap is critical for performance diagnosis: a query showing many "shared read" buffer misses might still be fast if the working set fits in OS page cache, or painfully slow if those reads hit spinning disk or saturated SSDs.
The fundamental question is: can we expose per-query physical storage I/O metrics through EXPLAIN?
Evolution of the Approach
Phase 1: Page Faults via getrusage() (Original Proposal)
The initial patch proposed tracking ru_minflt (minor page faults / OS cache hits) and ru_majflt (major page faults / disk access) from getrusage(2). This was quickly shown to be unreliable:
- Tom Lane raised instability concerns immediately
- Jelte Fennema-Nio experimentally confirmed that major page fault counts were always zero even after dropping caches
- The mapping between PostgreSQL read requests and kernel page faults is not 1:1 (code pages vs data pages, filesystem readahead, etc.)
Phase 2: ru_inblock/ru_oublock (Adopted Approach)
Jelte discovered that ru_inblock (filesystem-level read operations) and ru_oublock (filesystem-level write operations) from getrusage(2) provide meaningful, actionable data:
ru_inblockis 0 when everything is in page cache, high when hitting storage- On Linux, these map to
task_io_accounting'sread_bytes/write_bytesdivided into 512-byte block units - pg_stat_kcache has used these counters successfully for years without problem reports
The patch was renamed from PAGEFAULTS to "Storage I/O" to reflect the actual semantics.
Key Architectural Tensions
AIO Compatibility (The Central Blocker)
Andres Freund raised the most significant technical objection: this approach is fundamentally incompatible with worker-based AIO. The reasoning:
getrusage()tracks I/O per-process. If I/O is dispatched to background AIO worker processes, the issuing backend'sru_inblock/ru_oublockwon't reflect that I/O.- With worker-based AIO, you'd get non-zero but significantly undercounted values — worse than showing nothing because it creates a false sense of accuracy.
io_uring-based AIO does correctly attribute I/O to the submitting process (confirmed by Andres's testing).
This created a critical design decision: the patch must be AIO-method-aware. The final approach suppresses Storage I/O output when io_method=worker and only reports it for synchronous I/O or io_uring.
Why io_uring Isn't Universal Yet
Andres explained three barriers to making io_uring the default:
- Not enabled at kernel level everywhere (RHEL disabled it for a while)
- Requires liburing as an optional build dependency
- Requires tuning file descriptor ulimits (contentious with Tom Lane)
This means worker-based AIO will be the default for the foreseeable future, significantly limiting the patch's applicability.
Background Writer Interference
Tom Lane noted that bgwriter, walwriter, and checkpointer take I/O load away from foreground processes. Jelte countered that:
- These only take away write I/O — read I/O attribution remains accurate
- If background processes handle writes, the query runs faster anyway, so the absence of write I/O in the query's stats correctly indicates those writes aren't blocking execution
Per-Node vs. Per-Query Granularity
The author initially attempted per-plan-node tracking (like BUFFERS) but abandoned it due to performance overhead from frequent getrusage() calls. The final design reports only aggregate planning-phase and execution-phase totals, requiring only 2 getrusage() calls per phase boundary.
Technical Implementation Details
Measurement Mechanism
- Two
getrusage()calls bracket each phase (planning, execution) - Difference calculation yields per-phase I/O counts
- Parallel workers transmit their accumulated I/O to the leader via the existing parallel worker communication channel
- The accumulation is additive: leader's own I/O + sum of all workers' I/O
Unit Controversy
- Linux reports in 512-byte blocks (from
task_io_accounting) - Cross-platform consistency is uncertain (FreeBSD, macOS, etc. may differ)
- Final decision: report raw counts without unit conversion, labeled as "times" (later changed to no suffix)
- This avoids incorrect unit assumptions across platforms
Integration Point
- Made part of BUFFERS option (enabled by default with ANALYZE since PG17's c2a4078e)
- Conditional on
io_method != worker - Parallel worker results aggregated into leader's totals
Platform Support
- Works on: Linux, FreeBSD, macOS, OpenBSD, NetBSD, DragonFlyBSD, AIX, Solaris, illumos
- Not supported: Windows (getrusage port doesn't include these fields)
Current Status
The patch has gone through multiple rebases and review cycles. As of 2026, it's technically functional with the AIO-awareness guard, but adoption is limited by:
- Worker-based AIO being the default means most users won't see the output
- The feature's value proposition is strongest precisely for the buffered I/O case that worker-based AIO makes unmeasurable
- No committer has stepped up to push it through
The patch remains in "needs review" state, waiting for either io_uring to become more widely default, or for a consensus that partial availability is acceptable.