Skip prefetch for block references that follow a FPW or WILL_INIT of the same block

2026-05-08 · opus 4.7

Core Problem: Redundant `fadvise64()` Calls in WAL Recovery Prefetcher

PostgreSQL's recovery prefetcher (xlogprefetcher.c, introduced in PG15 via commit 3985b600 by Thomas Munro) scans ahead in the WAL stream during crash recovery / standby replay and issues posix_fadvise(WILLNEED) for referenced blocks so the kernel can begin async I/O before the redo routine actually needs the buffer. This dramatically reduces I/O stalls during recovery on latency-bound storage.

The prefetcher already handles two important cases where prefetch is pointless:

Full Page Writes (FPW / BKPIMAGE_APPLY) — the record itself contains the entire 8KB page image, so the pre-replay on-disk contents are irrelevant. The redo routine will overwrite the page wholesale.
WILL_INIT (BKPBLOCK_WILL_INIT) — the redo routine will zero-initialize the block (e.g., new heap page extension, new index page). Again, on-disk contents don't matter.

In both cases, XLogPrefetcherNextBlock() skips issuing the fadvise and bumps skip_fpw / skip_init counters.

The Bug / Missed Optimization

The prefetcher also maintains a small recent-block LRU window — XLOGPREFETCHER_SEQ_WINDOW_SIZE (currently 4 entries) — of recently-prefetched blocks, so that when the same block is referenced again a few records later, the second reference hits skip_rep rather than issuing a redundant fadvise. The window is tiny because it's scanned linearly on every block reference and expanding it would burn CPU in the hot redo path.

The defect: when a block was skipped via the FPW or WILL_INIT path, it was not inserted into the recent-block window. Consequently, a subsequent WAL record touching the same block (with no FPW attached — which is the common case within a single checkpoint cycle) would fall through to PrefetchSharedBuffer() and issue an fadvise. That fadvise is strictly wasted work: either the page is already in shared_buffers (because the FPW redo just wrote it there), or it's about to be overwritten anyway.

This pattern is endemic to high-volume append workloads. After a checkpoint, the first touch of each heap page emits an FPW; the following hundreds of tuple-insert records target the same page until it fills up. Every one of those subsequent inserts was issuing a pointless fadvise64 syscall.

The Proposed Fix

The patch is conceptually trivial: in the FPW-skip and WILL_INIT-skip branches of XLogPrefetcherNextBlock(), record the block into the recent-block window as if a prefetch had been issued. The existing duplicate-detection loop then naturally suppresses the follow-on references, reclassifying them as skip_rep.

No new data structures are introduced; the existing 4-slot sequential window is reused. This keeps the per-block CPU cost of the prefetcher unchanged — critical, because the prefetcher sits directly in the redo-apply critical path and any regression here slows down every recovery.

Measured Impact

On a 10 GB WAL insert-only no-index workload:

Metric	Baseline	Patched	Delta
`fadvise` syscalls	1,204,992	122,753	−89.8%
`skip_rep` count	80.02M	81.11M	+1.09M
Redo time (NVMe)	37.3s	25.8s	−31%
Redo time (2ms-latency disk)	188.0s	60.0s	−68% (3.1×)
System CPU	9.38s	3.39s	−64%

The ~1.09M additional skip_rep hits correspond almost exactly to the ~1.08M eliminated fadvise calls — confirming the mechanism. The dramatic win on the high-latency disk is noteworthy: even though fadvise is "just" a syscall, at scale the kernel work (radix-tree lookups, readahead window adjustments, potential I/O submission for pages already in pagecache) becomes a first-order cost.

Key Technical Tradeoffs and Limitations

Why the Window is Only 4 Entries

XLOGPREFETCHER_SEQ_WINDOW_SIZE = 4 is hardcoded and scanned linearly. Satya explicitly notes this is the patch's soft spot: in workloads with >4 indexes on a table, each index touch pushes older entries out of the window before the next reference to the same heap page arrives. In such cases, duplicates within the lookahead distance can still leak through to fadvise.

Satya considered two alternative improvements but deliberately deferred them:

Enlarging the window proportional to maintenance_io_concurrency — linear scan cost grows.
Replacing the linear window with a hash table — better asymptotic behavior but adds allocation, hashing cost, and complexity to the critical path.

The decision to ship the minimal fix first is the right engineering call: it captures the common case (append-heavy OLTP, bulk loads, vacuum-generated WAL on pages without many indexes) with zero added CPU cost, and leaves the door open for a more aggressive follow-up.

Correctness Argument

The change is safe because the recent-block window is purely an optimization — it only causes the prefetcher to not call fadvise. fadvise(WILLNEED) is itself only advisory; skipping it cannot cause incorrect recovery. The actual block read happens later via the normal ReadBuffer() path in the redo routine. Moreover, after an FPW or WILL_INIT, the block is guaranteed to be resident in shared_buffers once that record is applied, so a later reference won't even hit the OS page cache — making the prefetch doubly useless.

One subtlety: the prefetcher runs ahead of the replayer. Between the moment the FPW reference is observed by the prefetcher and the moment a later reference to the same block is observed, the FPW itself may not yet have been applied. But that's irrelevant — the later reference will, by construction, also not need the old on-disk contents (because the FPW will be applied before the later record is). So suppressing the prefetch is still correct.

Thread Status

This appears to be a single-author patch with a very small scope, posted in March 2026 and rebased in May 2026 with no recorded feedback in the two messages shown. The benchmark numbers are compelling enough that absent objections on the window-sizing question, this is the kind of focused optimization that tends to get committed after a modest review pass. The cross-reference to thread [1] suggests Satya encountered this while reviewing a related WAL-prefetch patch from another author, which is how such micro-optimizations typically surface.