Bypassing Cursors in postgres_fdw to Enable Parallel Plans
The Core Problem
PostgreSQL's postgres_fdw uses SQL-level cursors (DECLARE CURSOR / FETCH) to retrieve data from remote servers. This is a deliberate design choice that enables incremental fetching (controlled by fetch_size), preventing the local server from having to buffer entire remote result sets in memory. However, this cursor-based approach has a critical limitation: it prevents parallel query execution on the remote server.
The root cause lies in ExecutePlan() in the executor:
if (queryDesc->already_executed || numberTuples != 0)
use_parallel_mode = false;
When a cursor is used, the executor is started and stopped for each FETCH operation. Parallel workers cannot be kept running across these stop/start cycles because:
- It would be inefficient to maintain idle workers between fetches
- It would be incorrect — state changes in the leader process between fetches would not be reflected in workers, leading to data corruption
This means any query pushed down to a remote server via postgres_fdw is forced into sequential execution, even when the remote table is large enough to benefit enormously from parallel scans.
The Proposed Solution: Streaming Fetch Mode
The patch introduces an alternative fetch mechanism that bypasses cursors entirely. Instead of DECLARE CURSOR + FETCH, it uses:
PQsendQueryParams()— sends the query directly (no cursor declaration)PQsetChunkedRowsMode(fetch_size)— a libpq feature that delivers results in chunks offetch_sizerows, providing similar flow-control benefits to cursor-based fetching without requiring a SQL cursor
This allows the remote server to execute the full query in one shot, enabling its planner to choose parallel plans when beneficial.
The Key Architectural Challenge: Overlapping Scans
The fundamental difficulty is handling overlapping scans on the same connection. In cursor mode, multiple cursors can coexist on a single connection — for example, during a nested-loop join where both the outer and inner relations are foreign tables on the same server. Each cursor maintains independent state.
Without cursors, there's only one "active query" per connection at a time. If scan A is streaming results and scan B needs to use the same connection, scan A's remaining results must be drained and buffered locally before scan B can proceed.
Evolution of the Buffering Strategy
The approach evolved significantly through review:
- v1: Global
only_queryflag (session-lifetime, applied to all connections) — rejected as too coarse - v2: Per-connection
num_queriescounter with increment/decrement — flawed on error paths - v3: Always buffer first query in tuplestore — simpler but wasteful for single-scan cases
- v4: Linked list of
PgFdwScanStatestructs inPgFdwConnState— over-engineered - v5/v6 (current): Single
active_scanpointer inPgFdwConnState— correct minimal design
The final design maintains a pointer to the currently-active scan (the one whose results are on the wire). When another scan needs the connection, save_to_tuplestore() drains remaining results into a Tuplestorestate in the active scan's PgFdwScanState. The pointer is then cleared, allowing the new scan to proceed.
Critical Design Decisions and Tradeoffs
1. Tuplestore Buffering vs. Memory Usage
When overlapping scans exist, one scan's results must be fully materialized. This trades memory (potentially spilling to disk via work_mem) for the ability to use the connection for other scans. However, performance testing showed this is often a net win because:
- Cursor management overhead (DECLARE/FETCH/CLOSE round-trips) is eliminated
- The remote server can use parallel execution
- Even without parallelism, eliminating cursor rescan overhead (CLOSE + re-DECLARE) can yield 100x speedups in nested-loop scenarios
2. GUC vs. Server/Table Option
The feature evolved from a GUC (postgres_fdw.use_cursor) to a server/table-level option (streaming_fetch), aligning with the existing fetch_size option. This is architecturally correct because different foreign servers may have different characteristics.
3. Rescan Handling
In cursor mode, rescans have three strategies: CLOSE+re-execute, MOVE BACKWARD ALL, or rewind if all tuples are cached. In streaming mode, only two are possible: discard tuplestore and re-execute, or rewind the tuplestore. The MOVE BACKWARD ALL optimization is cursor-specific.
4. Active Scan Pointer Semantics
Robert Haas emphasized that active_scan must be non-NULL only when there are results actually on the wire that need draining. If a rescan has already consumed all pending results, the pointer must be cleared. Failure to maintain this invariant leads to unnecessary save_to_tuplestore() calls and the cascading bug of needing to resend queries (which requires an ExprContext that isn't available).
5. GUC Mutability
The GUC/option can change mid-session. Each scan must remember whether it was started with or without a cursor and handle rescans/end-scans based on that remembered state, not the current option value. This prevents catastrophic state confusion.
Unresolved Issues
-
ctid handling: Alexander Pyhalov noted that
TTSOpsMinimalTupletuplestores cannot store tuple ctids, which affects certain postgres_fdw operations. The patch doesn't address this. -
Async mode interaction: The patch disables async execution (
async_capable = false) when streaming fetch is used, as the two features are incompatible in the current design. -
Error handling: Proper cleanup of connection state on errors remains tricky —
save_to_tuplestore()must handle partial result sets gracefully, andPGresultobjects must always be freed.
Performance Results
| Scenario | Cursor Mode | Streaming Mode | Speedup |
|---|---|---|---|
| COUNT(*) on 990K rows (parallel enabled) | 62ms | 24ms | 2.6x |
| Nested-loop join (990K × 990K, rescan-heavy) | 112,825ms | 389ms | 290x |
| Self-join merge (990K rows, disk spill) | 4,661ms | 3,527ms | 1.3x |
| Simple sequential scan (20M rows, no parallelism) | 20,090ms | 24,103ms | 0.83x (regression) |
The worst case (simple large sequential scan without parallel benefit) shows ~20% regression due to tuplestore overhead. The best case (rescan-heavy workloads) shows dramatic improvement from eliminating cursor management round-trips.