postgres_fdw: Use COPY to speed up batch inserts

First seen: 2025-10-15 15:02:01+00:00 · Messages: 32 · Participants: 8

Latest Update

2026-05-09 · opus 4.7

postgres_fdw: Use COPY to speed up batch inserts — Technical Analysis

The Core Problem

postgres_fdw currently sends batched inserts to the remote server using a prepared INSERT ... VALUES ($1,$2,...), ($3,$4,...), ... statement, where the number of parameter placeholders scales with batch_size * nattrs. This has two well-known architectural drawbacks:

  1. Parameter limit pressure: PostgreSQL has a 65535-parameter cap on the extended query protocol. Large batch_size values combined with wide tables force users to lower batch_size, limiting throughput.
  2. Per-statement overhead: Each INSERT batch incurs parser/planner/executor setup on the remote. COPY FROM STDIN bypasses most of that, streaming tuples through a much tighter ingest path (see CopyFrom() in src/backend/commands/copyfrom.c).

Tomas Vondra had previously blogged about the idea; Matheus Alcantara's patch operationalizes it: when batching, push data to the remote via a COPY ... FROM STDIN stream instead of parameterized INSERT.

Benchmarks across the thread consistently show ~2x TPS improvements at batch_size >= 100, with larger wins at wider rows. At batch_size=10, the patch can actually regress slightly (especially on UNLOGGED tables) because the per-COPY setup cost isn't amortized.

Design Evolution: Four Major Pivots

The thread is notable for how many times the scope had to be narrowed because of subtle executor/FDW API interactions. Each pivot was forced by a concrete semantic problem:

Pivot 1: Unconditional COPY → opt-in via use_copy_for_insert

The initial v1 triggered COPY whenever numSlots > 1. Jian He produced a crash repro with ALTER ... DROP COLUMN/ADD COLUMN sequences that exposed an attisdropped indexing bug (the code was doing p_flinfo[attnum-1] rather than tracking p_nums as create_foreign_modify() does). More importantly, Jian He pointed out that COPY regresses when the remote table has BEFORE ROW / INSTEAD OF ROW triggers, because CopyFrom() is forced into CIM_SINGLE mode (see the comment block in copyfrom.c — multi-insert can't be used when BEFORE triggers might observe in-flight tuples). Matheus confirmed via benchmarks (TPS 11.6 vs 13.6 with triggers present) and concluded a table-level opt-in was required.

Pivot 2: Opt-in boolean → threshold-based + batch_size interaction

Masahiko Sawada (committer, heavy postgres_fdw reviewer) flagged that users setting use_copy_for_insert=true without raising batch_size from its default of 1 would hit a sharp regression (88 TPS vs 133 TPS for 1000-row inserts with batch_size=1). This birthed copy_for_batch_insert_threshold (later renamed batch_with_copy_threshold) — a minimum row count before switching to COPY. The boolean option was eventually dropped entirely; setting the threshold > 0 became the enable switch.

Pivot 3: EXPLAIN consistency problem

Because the decision to use COPY depends on runtime row counts (known only in execute_foreign_modify()), postgresExplainForeignModify() cannot reliably display the remote SQL. The deparse happens at plan time in postgresPlanForeignModify(), where batch_size and row counts are not available. Matheus initially worked around this by showing COPY in EXPLAIN ANALYZE (runtime) but INSERT in plain EXPLAIN. Sawada noted this inconsistency as the primary blocker to a general-purpose solution.

Pivot 4: Restriction to COPY FROM (and tuple routing)

The final design narrows COPY usage to the two paths that go through BeginForeignInsert(): (a) user-issued COPY FROM on a foreign table, and (b) tuple routing into a foreign-table partition via INSERT INTO parent. Direct INSERT INTO foreign_table keeps using the old prepared-statement path. This dodges the EXPLAIN problem because:

Sawada's suggestion was to detect this case via resultRelInfo->ri_RootResultRelInfo == NULL (true only for COPY — partition routing sets it). Matheus adopted this; the ri_returningList == NIL check became an Assert.

Key Technical Insights

Why COPY must be in BeginForeignInsert, not BeginForeignModify

CopyFrom() in core calls BeginForeignInsertExecForeignBatchInsert. The PgFdwModifyState built here is a separate instance from the INSERT path's state. Matheus initially tried caching the COPY SQL in fmstate->query but discovered that a user-issued COPY FROM foreign_table can dispatch both full and partial batches through CopyMultiInsertBufferFlush(), with the trailing partial batch sometimes having numSlots == 1 — which falls back to INSERT and overwrites the cached SQL. A dedicated copy_query field was introduced.

Escaping and transmission modes (late-stage bug)

Sawada's v12 review caught two data-corruption bugs:

  1. No COPY escaping: raw text appended to the buffer meant tabs, newlines, and backslashes in data would corrupt the stream (reproduced with a type-mismatch error exposing malformed COPY data on the wire). Fixed by adding a CopyAttributeOutText()-equivalent, then later reworked to use FORMAT CSV because CSV escaping is simpler (just quote and double-up embedded quotes).
  2. Missing set_transmission_modes(): postgres_fdw normally wraps output-function calls with set_transmission_modes() to force extra_float_digits = 3, DateStyle = ISO, etc., guaranteeing lossless round-trips. The COPY path bypassed this, causing silent precision loss for floats when the client had extra_float_digits = 0. This is a classic FDW correctness landmine — the same mistake has bitten other patches historically.

Format choice: TEXT vs CSV vs BINARY

Buffer management

Jakub Wartak found an ugly failure mode: very wide rows (e.g., repeat(md5(x), 10000000)) could blow past MaxAllocSize (1 GB) in the StringInfo buffer, and the backend would then hang in PQ cleanup. Fixed by flushing to the wire at COPYBUFSIZ (8192 bytes) boundaries rather than accumulating the whole batch in a single StringInfo.

The trigger interaction (subtle)

Matheus documented why even the narrowed scope still needs !trig_insert_after_row on the local foreign table: normal INSERT path uses INSERT ... RETURNING * to synchronize the TupleTableSlot with remote-side effects (e.g., remote DEFAULTs or remote BEFORE triggers modifying the row). COPY has no RETURNING, so local AFTER ROW triggers would see stale values. This constraint is encoded as a fallback to the INSERT path when ri_TrigDesc->trig_insert_after_row is set.

Design Tradeoffs and Unresolved Tensions

  1. Asymmetry between direct INSERT and partition-routed INSERT: Sawada flagged that users will find it confusing that INSERT INTO foreign_partition (direct) uses INSERT-batching, but INSERT INTO parent (routed to the same partition) uses COPY. This is an FDW API artifact — BeginForeignInsert is only called in the routing case. The thread accepted this as a tolerable limitation for v1.

  2. Partial-batch handling: With batch_with_copy_threshold=50 and 140 rows, the patch sends the first 100-row batch via COPY and the trailing 40 via INSERT. This is workable but adds cognitive load.

  3. Testing COPY was actually used: Sawada proposed an elegant test technique — a remote trigger calling current_query() and RAISE NOTICE of "COPY command" vs "INSERT command". client_min_messages=log propagates the notice through postgres_fdw so it appears in regression output. For per-batch (not per-row) verification, Matheus added elog(DEBUG1) because current_query()-based notices fire per row.

Participants of Note

Architectural Significance

This patch is a good case study in FDW API seams: the optimization is conceptually trivial ("use COPY instead of INSERT") but the FDW routines (BeginForeignInsert vs BeginForeignModify), the executor's tuple-routing machinery, the EXPLAIN infrastructure's plan-time deparse assumption, and postgres_fdw's transmission-mode invariants all interact in ways that kept narrowing the safe scope. The final shape — threshold-gated, opt-in, CSV-formatted, restricted to COPY FROM and partition routing, with local-AFTER-trigger fallback — is substantially less ambitious than the original v1 but corresponds to the slice where correctness is provable and EXPLAIN output isn't lying to users.