2026-05-06 · opus 4.7

ReadStream Look-ahead Exhausts Local Buffers with High `effective_io_concurrency`

The Core Problem

PostgreSQL 18 introduced the ReadStream API (src/backend/storage/aio/read_stream.c) as the unified abstraction for prefetched/asynchronous buffer reads, replacing many ad-hoc PrefetchBuffer call sites. ReadStream sizes its look-ahead window (max_pinned_buffers) as a function of effective_io_concurrency: roughly max(io_combine_limit, effective_io_concurrency * some_factor), clamped by buffer-pool availability. For the shared buffer pool this is fine — shared buffers number in the thousands-to-millions and a single backend's stream cannot monopolize them.

For local buffers (backing temporary tables), the situation is categorically different:

Local buffers are a per-backend, fixed-size pool governed by temp_buffers (default 1024 × 8 kB = 8 MB).
The same backend that runs the seqscan on a temp table must also use local buffers for the TOAST heap and TOAST index of that table — there is no shared/local separation for temp relations' TOAST, because a temp table's TOAST table is itself temp.
ReadStream's look-ahead happily pins up to max_pinned_buffers local buffers on the main heap seqscan. With effective_io_concurrency >= 64 and default temp_buffers, the clamp max_pinned_buffers = Min(..., num_temp_buffers) can still allow the stream to pin ~all of them.
When the executor then detoasts a wide VARCHAR/TEXT datum, index_getnext_slot on the TOAST index tries to pin a local buffer and fails with ERROR: no empty local buffer available (raised from GetLocalVictimBuffer in localbuf.c).

The repro in the original report is diabolically effective: a temp table with several multi-kilobyte TEXT/VARCHAR columns guarantees nearly every main-heap tuple triggers TOAST fetches, and with effective_io_concurrency=128 the seqscan's stream pins enough local buffers to starve the TOAST access path during the very first returned row.

Why This Is Architecturally Interesting

This is a regression specific to v18's ReadStream rollout. Under v17 and earlier, sequential scans did not aggressively prefetch via ReadStream, so local-buffer pinning was driven only by what the executor actively needed. The ReadStream design's implicit assumption — "I can safely pin max_pinned_buffers without starving other consumers" — holds for shared buffers (where the consumer is one of thousands of clients sharing a large pool) but is false for local buffers (where the consumer shares a tiny pool with itself, including re-entrant access paths like TOAST).

Feike Steenbergen's report that switching io_method from io_uring back to worker masks the problem is consistent with this analysis: different io_methods have slightly different pin-holding dynamics (worker completes I/Os off the critical path and releases pins faster), but the root cause — look-ahead sizing ignoring the re-entrant TOAST demand — is the same. It is not an io_method bug.

Proposed Solutions

Patch 1 — Induja Sreekanthan (Google): static caps for temp relations

Modifies ReadStream construction when the target relation uses local buffers:

max_pinned_buffers capped at 75% of num_temp_buffers, reserving a 25% headroom for TOAST and index access paths.
max_ios capped at DEFAULT_EFFECTIVE_IO_CONCURRENCY (=1 prior to v18, now a small constant) to account for the possibility of multiple concurrent ReadStreams in the same backend (e.g. nested-loop with inner seqscan on another temp table, or bitmap heap scan plus TOAST fetch).

Strengths: simple, localized to stream construction, no per-iteration overhead. Weaknesses: 75% is a magic number; it does not adapt to the actual number of streams in flight; TOAST itself could use a ReadStream in the future and would also be capped to 75%, possibly leaving the reserve unused while the TOAST path itself starves.

Patch 2 — Eduard Stepanov (Tantor Labs): dynamic budget check via `GetAdditionalLocalPinLimit()`

Rather than a static percentage, the patch consults GetAdditionalLocalPinLimit() inside read_stream_look_ahead() itself. This function (already present in localbuf.c for analogous purposes elsewhere) reports the remaining pin budget given currently held local pins. The changes:

Add a third predicate to the look-ahead loop's while condition that aborts further pinning when the dynamic limit is exhausted.
Re-check after each pin (since each pin reduces the remaining budget).
When the budget runs out mid-accumulation, flush the pending read immediately instead of waiting to reach io_combine_limit. This prevents deadlock-like stalls where the stream is holding pins but cannot progress because it cannot acquire one more to round out a combined read.

Strengths: self-adjusting; composes correctly with any other consumer of local buffers in the same backend (TOAST, another concurrently-open ReadStream, etc.); no arbitrary constant. Weaknesses: adds a function call inside the hot look-ahead loop; the early-flush behavior produces smaller I/Os under pressure, which is arguably correct but changes I/O shape.

Architecturally the second approach is more PostgreSQL-idiomatic: it mirrors how buffer-pool pressure is already handled elsewhere and it keeps the policy where the pinning happens rather than at stream-construction time. Committers reviewing this are likely to favor Stepanov's direction, possibly combined with a sanity floor (always allow at least io_combine_limit pins so streams can make forward progress).

Key Technical Insights

Local buffers are re-entrant per backend. Any API that pins "up to N" local buffers without consulting GetAdditionalLocalPinLimit() is buggy by construction when N approaches temp_buffers.
TOAST access paths are invisible to the planner's prefetch sizing. The ReadStream for the main heap has no idea the tuples it returns will trigger index+heap lookups into sibling relations sharing the same local pool.
io_method is a red herring for this bug. The pin exhaustion is deterministic given enough look-ahead; io_uring vs worker only shifts the timing.
The 75% heuristic ignores multi-stream scenarios. A merge join between two temp tables with ReadStreams on both sides would each take 75% — impossible. Stepanov's dynamic check handles this naturally.
This is a v18 regression, not a latent issue, because ReadStream adoption is what raised effective look-ahead on seqscans of temp tables from ~0 to effective_io_concurrency-scaled values.

Status

As of the last message (Steenbergen, May 2026), the thread has a confirmed production impact report but no committer has yet picked up either patch on this thread. The fix is clearly required for v18 stable; the choice between the static-cap and dynamic-budget approaches (or a hybrid) is the remaining design question. Given Thomas Munro's authorship of ReadStream, his review would be decisive.

BUG: ReadStream look-ahead exhausts local buffers when effective_io_concurrency>=64

Latest Update

Update: `io_method='worker'` is not a reliable workaround

Why this matters analytically

ReadStream Look-ahead Exhausts Local Buffers with High `effective_io_concurrency`

The Core Problem

Why This Is Architecturally Interesting

Proposed Solutions

Patch 1 — Induja Sreekanthan (Google): static caps for temp relations

Patch 2 — Eduard Stepanov (Tantor Labs): dynamic budget check via `GetAdditionalLocalPinLimit()`

Key Technical Insights

Status

BUG: ReadStream look-ahead exhausts local buffers when effective_io_concurrency>=64

Latest Update

Update: io_method='worker' is not a reliable workaround

Why this matters analytically

ReadStream Look-ahead Exhausts Local Buffers with High effective_io_concurrency

The Core Problem

Why This Is Architecturally Interesting

Proposed Solutions

Patch 1 — Induja Sreekanthan (Google): static caps for temp relations

Patch 2 — Eduard Stepanov (Tantor Labs): dynamic budget check via GetAdditionalLocalPinLimit()

Key Technical Insights

Status

Update: `io_method='worker'` is not a reliable workaround

ReadStream Look-ahead Exhausts Local Buffers with High `effective_io_concurrency`

Patch 2 — Eduard Stepanov (Tantor Labs): dynamic budget check via `GetAdditionalLocalPinLimit()`