RFC: Allow EXPLAIN to Output Page Fault Information

First seen: 2024-12-24 08:53:06+00:00 · Messages: 42 · Participants: 9

Latest Update

2026-05-22 · claude-opus-4-6

RFC: Allow EXPLAIN to Output Page Fault / Storage I/O Information

Core Problem

When diagnosing slow queries, PostgreSQL's EXPLAIN (ANALYZE, BUFFERS) tells you whether pages were found in shared buffers vs. required kernel-level reads, but it cannot distinguish between reads served from the OS page cache (fast) and reads that actually hit physical storage (slow). This gap is critical for performance diagnosis: a query showing many "shared read" buffer misses might still be fast if the working set fits in OS page cache, or painfully slow if those reads hit spinning disk or saturated SSDs.

The fundamental question is: can we expose per-query physical storage I/O metrics through EXPLAIN?

Evolution of the Approach

Phase 1: Page Faults via getrusage() (Original Proposal)

The initial patch proposed tracking ru_minflt (minor page faults / OS cache hits) and ru_majflt (major page faults / disk access) from getrusage(2). This was quickly shown to be unreliable:

Phase 2: ru_inblock/ru_oublock (Adopted Approach)

Jelte discovered that ru_inblock (filesystem-level read operations) and ru_oublock (filesystem-level write operations) from getrusage(2) provide meaningful, actionable data:

The patch was renamed from PAGEFAULTS to "Storage I/O" to reflect the actual semantics.

Key Architectural Tensions

AIO Compatibility (The Central Blocker)

Andres Freund raised the most significant technical objection: this approach is fundamentally incompatible with worker-based AIO. The reasoning:

  1. getrusage() tracks I/O per-process. If I/O is dispatched to background AIO worker processes, the issuing backend's ru_inblock/ru_oublock won't reflect that I/O.
  2. With worker-based AIO, you'd get non-zero but significantly undercounted values — worse than showing nothing because it creates a false sense of accuracy.
  3. io_uring-based AIO does correctly attribute I/O to the submitting process (confirmed by Andres's testing).

This created a critical design decision: the patch must be AIO-method-aware. The final approach suppresses Storage I/O output when io_method=worker and only reports it for synchronous I/O or io_uring.

Why io_uring Isn't Universal Yet

Andres explained three barriers to making io_uring the default:

  1. Not enabled at kernel level everywhere (RHEL disabled it for a while)
  2. Requires liburing as an optional build dependency
  3. Requires tuning file descriptor ulimits (contentious with Tom Lane)

This means worker-based AIO will be the default for the foreseeable future, significantly limiting the patch's applicability.

Background Writer Interference

Tom Lane noted that bgwriter, walwriter, and checkpointer take I/O load away from foreground processes. Jelte countered that:

Per-Node vs. Per-Query Granularity

The author initially attempted per-plan-node tracking (like BUFFERS) but abandoned it due to performance overhead from frequent getrusage() calls. The final design reports only aggregate planning-phase and execution-phase totals, requiring only 2 getrusage() calls per phase boundary.

Technical Implementation Details

Measurement Mechanism

Unit Controversy

Integration Point

Platform Support

Current Status

The patch has gone through multiple rebases and review cycles. As of 2026, it's technically functional with the AIO-awareness guard, but adoption is limited by:

  1. Worker-based AIO being the default means most users won't see the output
  2. The feature's value proposition is strongest precisely for the buffered I/O case that worker-based AIO makes unmeasurable
  3. No committer has stepped up to push it through

The patch remains in "needs review" state, waiting for either io_uring to become more widely default, or for a consensus that partial availability is acceptable.