pg_stat_io_histogram

First seen: 2026-01-26 09:40:52+00:00 · Messages: 31 · Participants: 6

Latest Update

2026-05-09 · opus 4.7

pg_stat_io_histogram: Per-Bucket I/O Latency Distribution Tracking

The Core Problem

PostgreSQL's existing observability around I/O latency is coarse. pg_stat_io (introduced in PG16) tracks cumulative counts and—with track_io_timing=on—cumulative total times per (backend_type, object, context, io_op) combination. A cumulative total tells you the mean, but nothing about the distribution. When a user reports "PostgreSQL feels stuck" (often a slow COMMIT), the real cause is frequently a tail-latency event in the I/O stack: a stuck multipath device, a hypervisor pause, a controller reset, or a flaky network-attached volume. Diagnosing this today requires cross-correlating iostat -x, wait events sampled at high frequency, and kernel tracing—a painful, time-sensitive exercise.

Jakub Wartak's proposal is to complement pg_stat_io with a second shape of the same data: a fixed-bucket histogram per dimension tuple, exposing p99/p99.9 tail behavior directly from SQL. The target use case is explicitly outlier hunting ("I/O stuck for seconds"), not fine-grained performance tuning, though the design tries to serve both.

Architectural Shape

The patch plugs into the existing pgstat_count_io_op_time() path. For each I/O that gets timed (i.e. when track_io_timing / track_wal_io_timing is on), in addition to incrementing the count and accumulating total time, a bucket index is computed from the measured latency and the corresponding counter in a histogram array is incremented. The histogram is attached to PgStat_BktypeIO, so it fans out over the full [backend_type][object][context][op] cross product.

Bucketing is logarithmic base-2: bucket i covers roughly [2^(i+min_shift), 2^(i+1+min_shift)) microseconds. The index is computed via pg_leftmost_one_pos* / __builtin_clzl rather than a loop of compares—a single bsr/lzcnt instruction on x86—so the per-I/O hot-path cost is nominally a handful of cycles plus one cache line touch. Sixteen buckets covering ~8µs up to ~128ms was the initial shape, deliberately flat at the low end to separate "page cache hit" from "actual device I/O."

The Three Hard Tradeoffs

1. Measurement fidelity vs. what we can actually measure

Andres Freund raised two fundamental epistemological issues early:

There is no good user-space fix for either: getting true device latency requires kernel cooperation (e.g. pulling completion timestamps out of io_uring_cqe or block-layer tracepoints). Jakub validated the concern with a bpftrace script and found kernel-vs-user discrepancies of only a few microseconds on an idle system, but acknowledged that on a loaded machine this will not hold. The consensus was: document the limitation, ship it anyway, because the outlier use case (I/Os taking 30 seconds) completely swamps any scheduler noise and is exactly what the tool is for.

2. Hot-path CPU cost

The pgstat_count_io_op_time() call happens on every timed I/O, including page-cache hits that return in sub-microsecond time. Andres correctly pointed out that Jakub's initial in-memory pgbench benchmark was useless for measuring this: a read-only pgbench with hot data is bottlenecked by per-statement overhead and at most 1 I/O per transaction. He prescribed a specific adversarial benchmark: pg_prewarm with io_combine_limit=1 on a modern client CPU (high per-core memory bandwidth) where the bucket computation is actually on the critical path.

Jakub's results on an Intel Ultra 7 (pinned P-core, no turbo, etc.) showed ~2% overhead with the initial version, reduced with __builtin_clzl(). Ants Aasma later ran cleaner benchmarks on a Ryzen 9 9900X and saw pgstat_count_io_op_time at 0.4–0.6% of samples in the worst case (seqscan of a table sized to do one I/O per page with track_io_timing=on). pgstat_io_flush_cb was essentially invisible at 0.00%, because stats flush is rate-limited to ~1/sec/backend. Andres explicitly rejected pursuing vectorization of the flush path—the flush frequency is low enough that it doesn't matter; only the accounting hot path does.

The overhead of converting instr_time to nanoseconds/microseconds came up. Andres noted his pending rdtsc patch makes this conversion non-trivial (cycles→ns involves a multiply). Jakub aligned on keeping INSTR_TIME_GET_NANOSEC since that's already cheap in the current implementation.

3. Memory footprint — the fight that drove most of the thread

This is where the bulk of the design effort went, and where Andres's objections were firmest.

The natural layout—attach a 16-bucket uint64[16] histogram to every PgStat_BktypeIO—balloons the struct from 2880 bytes to 18240 bytes. Multiplied by BACKEND_NUM_TYPES (18), the snapshot PgStat_IO grew to 328KB. Because pgStatLocal is in BSS and contains the snapshot area, every backend with a stats snapshot pays this on demand-paged allocation. PendingIOStats and PendingBackendStats similarly grew 6x. Shared-memory stats roughly doubled (~300KB → ~580KB).

Andres's position, stated multiple times and with increasing firmness: storing mostly-zero data is not acceptable. The issue isn't raw bytes; it's that most (object, context, op) combinations for most backend types are structurally impossible (e.g. a WAL writer doesn't do bulkread-context relation reads). Per pgstat_tracks_io_op() / _bktype() / _object(), the set of legal tuples is a small fraction of the full cross product. Allocating histograms for the illegal tuples is pure waste that most backends then fault into resident memory without ever using.

Jakub explored a ladder of mitigations:

  1. Condensing backend-type IDs (v7-0005, -0006): remap sparse B_* IDs into contiguous ones, skipping the 4 types already filtered by pgstat_tracks_io_bktype(). Modest gain.
  2. Different struct for per-backend pending vs. global (v7-0002): per-backend PendingBackendStats doesn't actually need histograms, so clone the struct without them. Clean win, restores PendingBackendStats to its original 2880 bytes.
  3. Lazy allocation of snapshot (v7-0004, v6-0002): only allocate the snapshot PgStat_IO when pgstat_io_snapshot_cb is actually called.
  4. Conditional flush (suggested by Ants): skip the histogram aggregation loop entirely when counts[...] == 0, with a matching conditional memset after lock release.
  5. Indirect offset table (Andres's preferred design, implemented in v9/v10): maintain a sparse [object][context][op] → slot-index table. Only allocate histograms for the tuples that can actually be populated. One extra indirection at count time, but no added branches. Shared-memory usage dropped from 578KB to 361KB.

The v10 patch finally bites the bullet and has the postmaster dynamically size the shared-memory allocation for the histogram slots based on the answer to pgstat_tracks_io_*() queried at startup. This required restructuring the pgstat kind registration (which assumes fixed-sized kinds) and was the piece Jakub most resisted ("I was afraid to touch that shm code, it looks complex"), but it's what ultimately satisfies Andres's memory objection.

Bucket Design Subtleties

Several bucketing questions surfaced:

Alternative designs considered and rejected: uint32 buckets with dynamic downshifting on overflow (complexity not worth 2x savings), variable-width buckets (buckets at bit N get ~half the population of N+1, so could use one fewer bit—rejected as over-engineering), ddsketch-style dynamic range (too complex for v1).

Secondary Issues

Where the Thread Landed

By v10 (May 2026), the patch had:

Open items were largely resolved: memory footprint was back within ~60KB of master for shared stats (vs. +270KB in v5), CPU overhead was demonstrated to be <1% even in adversarial benchmarks, and the bucket range question had a path to expansion (24 buckets) if desired. Target release was PG20.

Who Carries What Weight

Andres Freund is the decisive technical authority in this thread—he is a committer, the author of both the AIO subsystem and much of the pgstat infrastructure this patch extends, and his objections on memory footprint shaped the entire back half of the design process. His willingness to tolerate some memory growth ("I think some increase here doesn't have to be fatal") paired with hard rejection of "mostly-zero" allocation defined the acceptance criteria.

Ants Aasma provided the most rigorous performance measurements in the thread (the Ryzen 9900X benchmarks with proper isolation) and operational perspective on why the upper bucket range matters. His suggestion to make histogram aggregation conditional on non-zero counts is in the final design.

Tomáš Vondra came in late with a code review and a thoughtful summary of the tradeoff space; as a committer working adjacent territory (EXPLAIN I/O stats), his +1 signaled the feature had cross-reviewer buy-in.