Bypassing cursors in postgres_fdw to enable parallel plans

First seen: 2025-01-06 08:52:10+00:00 · Messages: 24 · Participants: 5

Latest Update

2026-05-14 · claude-opus-4-6

Bypassing Cursors in postgres_fdw to Enable Parallel Plans

The Core Problem

PostgreSQL's postgres_fdw uses SQL-level cursors (DECLARE CURSOR / FETCH) to retrieve data from remote servers. This is a deliberate design choice that enables incremental fetching (controlled by fetch_size), preventing the local server from having to buffer entire remote result sets in memory. However, this cursor-based approach has a critical limitation: it prevents parallel query execution on the remote server.

The root cause lies in ExecutePlan() in the executor:

if (queryDesc->already_executed || numberTuples != 0)
    use_parallel_mode = false;

When a cursor is used, the executor is started and stopped for each FETCH operation. Parallel workers cannot be kept running across these stop/start cycles because:

  1. It would be inefficient to maintain idle workers between fetches
  2. It would be incorrect — state changes in the leader process between fetches would not be reflected in workers, leading to data corruption

This means any query pushed down to a remote server via postgres_fdw is forced into sequential execution, even when the remote table is large enough to benefit enormously from parallel scans.

The Proposed Solution: Streaming Fetch Mode

The patch introduces an alternative fetch mechanism that bypasses cursors entirely. Instead of DECLARE CURSOR + FETCH, it uses:

  1. PQsendQueryParams() — sends the query directly (no cursor declaration)
  2. PQsetChunkedRowsMode(fetch_size) — a libpq feature that delivers results in chunks of fetch_size rows, providing similar flow-control benefits to cursor-based fetching without requiring a SQL cursor

This allows the remote server to execute the full query in one shot, enabling its planner to choose parallel plans when beneficial.

The Key Architectural Challenge: Overlapping Scans

The fundamental difficulty is handling overlapping scans on the same connection. In cursor mode, multiple cursors can coexist on a single connection — for example, during a nested-loop join where both the outer and inner relations are foreign tables on the same server. Each cursor maintains independent state.

Without cursors, there's only one "active query" per connection at a time. If scan A is streaming results and scan B needs to use the same connection, scan A's remaining results must be drained and buffered locally before scan B can proceed.

Evolution of the Buffering Strategy

The approach evolved significantly through review:

  1. v1: Global only_query flag (session-lifetime, applied to all connections) — rejected as too coarse
  2. v2: Per-connection num_queries counter with increment/decrement — flawed on error paths
  3. v3: Always buffer first query in tuplestore — simpler but wasteful for single-scan cases
  4. v4: Linked list of PgFdwScanState structs in PgFdwConnState — over-engineered
  5. v5/v6 (current): Single active_scan pointer in PgFdwConnState — correct minimal design

The final design maintains a pointer to the currently-active scan (the one whose results are on the wire). When another scan needs the connection, save_to_tuplestore() drains remaining results into a Tuplestorestate in the active scan's PgFdwScanState. The pointer is then cleared, allowing the new scan to proceed.

Critical Design Decisions and Tradeoffs

1. Tuplestore Buffering vs. Memory Usage

When overlapping scans exist, one scan's results must be fully materialized. This trades memory (potentially spilling to disk via work_mem) for the ability to use the connection for other scans. However, performance testing showed this is often a net win because:

2. GUC vs. Server/Table Option

The feature evolved from a GUC (postgres_fdw.use_cursor) to a server/table-level option (streaming_fetch), aligning with the existing fetch_size option. This is architecturally correct because different foreign servers may have different characteristics.

3. Rescan Handling

In cursor mode, rescans have three strategies: CLOSE+re-execute, MOVE BACKWARD ALL, or rewind if all tuples are cached. In streaming mode, only two are possible: discard tuplestore and re-execute, or rewind the tuplestore. The MOVE BACKWARD ALL optimization is cursor-specific.

4. Active Scan Pointer Semantics

Robert Haas emphasized that active_scan must be non-NULL only when there are results actually on the wire that need draining. If a rescan has already consumed all pending results, the pointer must be cleared. Failure to maintain this invariant leads to unnecessary save_to_tuplestore() calls and the cascading bug of needing to resend queries (which requires an ExprContext that isn't available).

5. GUC Mutability

The GUC/option can change mid-session. Each scan must remember whether it was started with or without a cursor and handle rescans/end-scans based on that remembered state, not the current option value. This prevents catastrophic state confusion.

Unresolved Issues

  1. ctid handling: Alexander Pyhalov noted that TTSOpsMinimalTuple tuplestores cannot store tuple ctids, which affects certain postgres_fdw operations. The patch doesn't address this.

  2. Async mode interaction: The patch disables async execution (async_capable = false) when streaming fetch is used, as the two features are incompatible in the current design.

  3. Error handling: Proper cleanup of connection state on errors remains tricky — save_to_tuplestore() must handle partial result sets gracefully, and PGresult objects must always be freed.

Performance Results

Scenario Cursor Mode Streaming Mode Speedup
COUNT(*) on 990K rows (parallel enabled) 62ms 24ms 2.6x
Nested-loop join (990K × 990K, rescan-heavy) 112,825ms 389ms 290x
Self-join merge (990K rows, disk spill) 4,661ms 3,527ms 1.3x
Simple sequential scan (20M rows, no parallelism) 20,090ms 24,103ms 0.83x (regression)

The worst case (simple large sequential scan without parallel benefit) shows ~20% regression due to tuplestore overhead. The best case (rescan-heavy workloads) shows dramatic improvement from eliminating cursor management round-trips.