index prefetching

First seen: 2023-06-08 15:40:12+00:00 · Messages: 488 · Participants: 19

Latest Update

2026-06-01 · claude-opus-4-6

Index Prefetching for PostgreSQL: Deep Technical Analysis

The Core Problem

PostgreSQL's index scan execution model has a fundamental I/O inefficiency: when performing a regular index scan (btree, hash, GiST, SP-GiST), the executor reads TIDs from index leaf pages and then issues synchronous, one-at-a-time heap page fetches. On datasets larger than RAM, this means each heap tuple fetch potentially incurs a random I/O wait, completely failing to leverage modern storage devices that require deep I/O queues (16-256+ concurrent requests) to deliver their full bandwidth.

Bitmap heap scans already solve this for their use case by sorting TIDs and prefetching heap pages. But plain index scans—which are required for ordered output, distance queries, and LIMIT operations—have no such optimization. The performance gap between a bitmap scan and an equivalent index scan can easily be 10x on cold data.

Evolution of the Design (Three Major Approaches)

Approach 1: AM-Level Prefetching (June 2023 - Early 2024)

The initial PoC did prefetching directly within each index AM (e.g., btree's _bt_readpage). Each AM would look ahead in its current leaf page's item array and issue PrefetchBuffer() calls for upcoming heap blocks.

Advantages: Simple, self-contained per AM. Fatal flaws:

Approach 2: Executor-Level Prefetching (Late 2023 - Mid 2024)

Moved prefetching entirely to the executor/indexam.c layer, using a queue of TIDs read ahead via index_getnext_tid(). A "prefetch queue" accumulated TIDs, issued prefetches, and returned them to the caller.

Advantages: AM-agnostic, worked with any index type. Fatal flaw: Completely broke kill_prior_tuple. By reading TIDs ahead, the executor desynchronized the index scan position from the heap scan position. When the executor determined a tuple was dead (and should be LP_DEAD-marked in the index), the relevant index leaf page might have already been unpinned. This violated the fundamental index AM locking protocol described in "Index Locking Considerations."

Approach 3: The "Complex" Batch Interface (August 2024 - Present)

The final design introduces a new amgetbatch() index AM callback that returns a "batch" of TIDs corresponding to a single leaf page. This is paired with a restructured table AM interface where heapam directly controls the read stream for heap prefetching.

Key insight: The batch boundary (one leaf page) provides a natural synchronization point for kill_prior_tuple—dead items are accumulated per-batch and processed when the batch is freed (before reading the next leaf page). The index AM never needs to hold pins beyond what the table AM instructs.

Architecture:

Executor (nodeIndexscan.c)
  → table_index_getnext_slot() [new slot-based table AM interface]
    → heapam_index_getnext_slot() [heapam implementation]
      → read_stream [for heap page prefetching]
        → heapam_getnext_stream() [callback returns next block from batch]
      → amgetbatch() [loads next leaf page into batch ring buffer]
      → amfreebatch() / amreleasebatch() [releases batch resources]

Critical Technical Decisions and Tradeoffs

1. Batch Ring Buffer and Resource Management

The patch maintains a ring buffer of loaded batches (currently max 64-128 slots). The read stream callback can look ahead across multiple loaded batches. When the ring buffer fills, the stream pauses (using Thomas Munro's read_stream_pause/resume mechanism) rather than resetting—avoiding the catastrophic distance collapse that plagued earlier versions.

Tradeoff: More batches = deeper prefetch possible, but more memory and more index leaf page pins held simultaneously. The current limit of ~128 batches balances memory usage against the need to maintain adequate I/O queue depth on high-latency storage (cloud SSDs with 0.5-3.5ms latency need queue depths of 50-100+).

2. Dropping Index Page Pins Eagerly (The "dropPin" Generalization)

For plain index scans with MVCC snapshots, nbtree has dropped leaf page pins since 2015 (commit 2ed5b87f). The patch generalizes this to all amgetbatch index AMs, including during index-only scans. This is critical because held index page pins would interfere with the read stream's buffer pin management.

For index-only scans, the trick is: cache visibility map information for all items in a batch while still holding the leaf page pin, then drop the pin. This prevents races where VACUUM concurrently sets a page all-visible while recycling TIDs that the scan hasn't yet processed. The patch uses "fake LSNs" (borrowed from GiST) for unlogged relations where real WAL LSNs aren't available.

3. Read Stream Distance Heuristics

A major source of regressions was the read stream's distance (look-ahead window) collapsing to 1-2 for workloads with high cache hit ratios. The pattern (miss, hit, miss, hit, ...) causes distance to oscillate between 1 and 2 indefinitely with the naive distance * 2 / distance - 1 algorithm.

Multiple fixes were explored:

4. The "Priorbatch" Optimization

To avoid regressing ultra-short queries (like pgbench SELECT returning 1 row), the patch delays creating a read stream until after the scan has already consumed one full batch. This means point queries never pay the cost of read stream initialization (~3-5% overhead for tiny queries).

5. Yielding (Controversial, Eventually Removed)

Peter introduced "yielding" in the read stream callback to keep the scan responsive—preventing it from reading too far ahead for LIMIT/merge join queries. This was ultimately removed because it interfered with I/O combining and prevented adequate prefetch distance on sequential workloads. The replacement relies on better distance heuristics and planner hints (ExecSetTupleBound).

6. Table AM Layering Revolution

The patch fundamentally restructures the table AM interface:

Performance Results

On cold data (larger than RAM) with random access patterns:

The gains scale with storage latency—cloud storage with 1-3ms latency sees even larger improvements.

Key Unresolved Issues (as of the thread's end)

  1. GiST/SP-GiST ordered scans don't fit the leaf-page-at-a-time batch model (they use pairing heaps). These will likely continue using amgettuple for now.
  2. Optimal batch count for high-latency storage remains empirically determined rather than theoretically grounded.
  3. Backward scan performance is inherently worse due to SSDs not optimizing reverse-sequential access and lack of OS readahead for descending patterns.
  4. The "simple" vs "complex" debate is resolved in favor of the complex approach, which can read ahead across leaf pages—critical for indexes with few distinct heap blocks per leaf (like pgbench_accounts_pkey with only 6 distinct heap blocks per leaf page).