ReadStream Look-ahead Exhausts Local Buffers with High effective_io_concurrency
The Core Problem
PostgreSQL 18 introduced the ReadStream API (src/backend/storage/aio/read_stream.c) as the unified abstraction for prefetched/asynchronous buffer reads, replacing many ad-hoc PrefetchBuffer call sites. ReadStream sizes its look-ahead window (max_pinned_buffers) as a function of effective_io_concurrency: roughly max(io_combine_limit, effective_io_concurrency * some_factor), clamped by buffer-pool availability. For the shared buffer pool this is fine — shared buffers number in the thousands-to-millions and a single backend's stream cannot monopolize them.
For local buffers (backing temporary tables), the situation is categorically different:
- Local buffers are a per-backend, fixed-size pool governed by
temp_buffers (default 1024 × 8 kB = 8 MB).
- The same backend that runs the seqscan on a temp table must also use local buffers for the TOAST heap and TOAST index of that table — there is no shared/local separation for temp relations' TOAST, because a temp table's TOAST table is itself temp.
- ReadStream's look-ahead happily pins up to
max_pinned_buffers local buffers on the main heap seqscan. With effective_io_concurrency >= 64 and default temp_buffers, the clamp max_pinned_buffers = Min(..., num_temp_buffers) can still allow the stream to pin ~all of them.
- When the executor then detoasts a wide
VARCHAR/TEXT datum, index_getnext_slot on the TOAST index tries to pin a local buffer and fails with ERROR: no empty local buffer available (raised from GetLocalVictimBuffer in localbuf.c).
The repro in the original report is diabolically effective: a temp table with several multi-kilobyte TEXT/VARCHAR columns guarantees nearly every main-heap tuple triggers TOAST fetches, and with effective_io_concurrency=128 the seqscan's stream pins enough local buffers to starve the TOAST access path during the very first returned row.
Why This Is Architecturally Interesting
This is a regression specific to v18's ReadStream rollout. Under v17 and earlier, sequential scans did not aggressively prefetch via ReadStream, so local-buffer pinning was driven only by what the executor actively needed. The ReadStream design's implicit assumption — "I can safely pin max_pinned_buffers without starving other consumers" — holds for shared buffers (where the consumer is one of thousands of clients sharing a large pool) but is false for local buffers (where the consumer shares a tiny pool with itself, including re-entrant access paths like TOAST).
Feike Steenbergen's report that switching io_method from io_uring back to worker masks the problem is consistent with this analysis: different io_methods have slightly different pin-holding dynamics (worker completes I/Os off the critical path and releases pins faster), but the root cause — look-ahead sizing ignoring the re-entrant TOAST demand — is the same. It is not an io_method bug.
Proposed Solutions
Patch 1 — Induja Sreekanthan (Google): static caps for temp relations
Modifies ReadStream construction when the target relation uses local buffers:
max_pinned_buffers capped at 75% of num_temp_buffers, reserving a 25% headroom for TOAST and index access paths.
max_ios capped at DEFAULT_EFFECTIVE_IO_CONCURRENCY (=1 prior to v18, now a small constant) to account for the possibility of multiple concurrent ReadStreams in the same backend (e.g. nested-loop with inner seqscan on another temp table, or bitmap heap scan plus TOAST fetch).
Strengths: simple, localized to stream construction, no per-iteration overhead. Weaknesses: 75% is a magic number; it does not adapt to the actual number of streams in flight; TOAST itself could use a ReadStream in the future and would also be capped to 75%, possibly leaving the reserve unused while the TOAST path itself starves.
Patch 2 — Eduard Stepanov (Tantor Labs): dynamic budget check via GetAdditionalLocalPinLimit()
Rather than a static percentage, the patch consults GetAdditionalLocalPinLimit() inside read_stream_look_ahead() itself. This function (already present in localbuf.c for analogous purposes elsewhere) reports the remaining pin budget given currently held local pins. The changes:
- Add a third predicate to the look-ahead loop's
while condition that aborts further pinning when the dynamic limit is exhausted.
- Re-check after each pin (since each pin reduces the remaining budget).
- When the budget runs out mid-accumulation, flush the pending read immediately instead of waiting to reach
io_combine_limit. This prevents deadlock-like stalls where the stream is holding pins but cannot progress because it cannot acquire one more to round out a combined read.
Strengths: self-adjusting; composes correctly with any other consumer of local buffers in the same backend (TOAST, another concurrently-open ReadStream, etc.); no arbitrary constant. Weaknesses: adds a function call inside the hot look-ahead loop; the early-flush behavior produces smaller I/Os under pressure, which is arguably correct but changes I/O shape.
Architecturally the second approach is more PostgreSQL-idiomatic: it mirrors how buffer-pool pressure is already handled elsewhere and it keeps the policy where the pinning happens rather than at stream-construction time. Committers reviewing this are likely to favor Stepanov's direction, possibly combined with a sanity floor (always allow at least io_combine_limit pins so streams can make forward progress).
Key Technical Insights
- Local buffers are re-entrant per backend. Any API that pins "up to N" local buffers without consulting
GetAdditionalLocalPinLimit() is buggy by construction when N approaches temp_buffers.
- TOAST access paths are invisible to the planner's prefetch sizing. The ReadStream for the main heap has no idea the tuples it returns will trigger index+heap lookups into sibling relations sharing the same local pool.
io_method is a red herring for this bug. The pin exhaustion is deterministic given enough look-ahead; io_uring vs worker only shifts the timing.
- The 75% heuristic ignores multi-stream scenarios. A merge join between two temp tables with ReadStreams on both sides would each take 75% — impossible. Stepanov's dynamic check handles this naturally.
- This is a v18 regression, not a latent issue, because ReadStream adoption is what raised effective look-ahead on seqscans of temp tables from ~0 to
effective_io_concurrency-scaled values.
Status
As of the last message (Steenbergen, May 2026), the thread has a confirmed production impact report but no committer has yet picked up either patch on this thread. The fix is clearly required for v18 stable; the choice between the static-cap and dynamic-budget approaches (or a hybrid) is the remaining design question. Given Thomas Munro's authorship of ReadStream, his review would be decisive.