postgres_fdw: Use COPY to speed up batch inserts — Technical Analysis
The Core Problem
postgres_fdw currently sends batched inserts to the remote server using a prepared INSERT ... VALUES ($1,$2,...), ($3,$4,...), ... statement, where the number of parameter placeholders scales with batch_size * nattrs. This has two well-known architectural drawbacks:
- Parameter limit pressure: PostgreSQL has a 65535-parameter cap on the extended query protocol. Large
batch_sizevalues combined with wide tables force users to lowerbatch_size, limiting throughput. - Per-statement overhead: Each
INSERTbatch incurs parser/planner/executor setup on the remote.COPY FROM STDINbypasses most of that, streaming tuples through a much tighter ingest path (seeCopyFrom()insrc/backend/commands/copyfrom.c).
Tomas Vondra had previously blogged about the idea; Matheus Alcantara's patch operationalizes it: when batching, push data to the remote via a COPY ... FROM STDIN stream instead of parameterized INSERT.
Benchmarks across the thread consistently show ~2x TPS improvements at batch_size >= 100, with larger wins at wider rows. At batch_size=10, the patch can actually regress slightly (especially on UNLOGGED tables) because the per-COPY setup cost isn't amortized.
Design Evolution: Four Major Pivots
The thread is notable for how many times the scope had to be narrowed because of subtle executor/FDW API interactions. Each pivot was forced by a concrete semantic problem:
Pivot 1: Unconditional COPY → opt-in via use_copy_for_insert
The initial v1 triggered COPY whenever numSlots > 1. Jian He produced a crash repro with ALTER ... DROP COLUMN/ADD COLUMN sequences that exposed an attisdropped indexing bug (the code was doing p_flinfo[attnum-1] rather than tracking p_nums as create_foreign_modify() does). More importantly, Jian He pointed out that COPY regresses when the remote table has BEFORE ROW / INSTEAD OF ROW triggers, because CopyFrom() is forced into CIM_SINGLE mode (see the comment block in copyfrom.c — multi-insert can't be used when BEFORE triggers might observe in-flight tuples). Matheus confirmed via benchmarks (TPS 11.6 vs 13.6 with triggers present) and concluded a table-level opt-in was required.
Pivot 2: Opt-in boolean → threshold-based + batch_size interaction
Masahiko Sawada (committer, heavy postgres_fdw reviewer) flagged that users setting use_copy_for_insert=true without raising batch_size from its default of 1 would hit a sharp regression (88 TPS vs 133 TPS for 1000-row inserts with batch_size=1). This birthed copy_for_batch_insert_threshold (later renamed batch_with_copy_threshold) — a minimum row count before switching to COPY. The boolean option was eventually dropped entirely; setting the threshold > 0 became the enable switch.
Pivot 3: EXPLAIN consistency problem
Because the decision to use COPY depends on runtime row counts (known only in execute_foreign_modify()), postgresExplainForeignModify() cannot reliably display the remote SQL. The deparse happens at plan time in postgresPlanForeignModify(), where batch_size and row counts are not available. Matheus initially worked around this by showing COPY in EXPLAIN ANALYZE (runtime) but INSERT in plain EXPLAIN. Sawada noted this inconsistency as the primary blocker to a general-purpose solution.
Pivot 4: Restriction to COPY FROM (and tuple routing)
The final design narrows COPY usage to the two paths that go through BeginForeignInsert(): (a) user-issued COPY FROM on a foreign table, and (b) tuple routing into a foreign-table partition via INSERT INTO parent. Direct INSERT INTO foreign_table keeps using the old prepared-statement path. This dodges the EXPLAIN problem because:
- For
COPY FROM, there is no meaningful EXPLAIN output to worry about. - For partition routing, per-partition remote SQL is not currently exposed in EXPLAIN anyway.
Sawada's suggestion was to detect this case via resultRelInfo->ri_RootResultRelInfo == NULL (true only for COPY — partition routing sets it). Matheus adopted this; the ri_returningList == NIL check became an Assert.
Key Technical Insights
Why COPY must be in BeginForeignInsert, not BeginForeignModify
CopyFrom() in core calls BeginForeignInsert → ExecForeignBatchInsert. The PgFdwModifyState built here is a separate instance from the INSERT path's state. Matheus initially tried caching the COPY SQL in fmstate->query but discovered that a user-issued COPY FROM foreign_table can dispatch both full and partial batches through CopyMultiInsertBufferFlush(), with the trailing partial batch sometimes having numSlots == 1 — which falls back to INSERT and overwrites the cached SQL. A dedicated copy_query field was introduced.
Escaping and transmission modes (late-stage bug)
Sawada's v12 review caught two data-corruption bugs:
- No COPY escaping: raw text appended to the buffer meant tabs, newlines, and backslashes in data would corrupt the stream (reproduced with a type-mismatch error exposing malformed COPY data on the wire). Fixed by adding a
CopyAttributeOutText()-equivalent, then later reworked to useFORMAT CSVbecause CSV escaping is simpler (just quote and double-up embedded quotes). - Missing
set_transmission_modes():postgres_fdwnormally wraps output-function calls withset_transmission_modes()to forceextra_float_digits = 3,DateStyle = ISO, etc., guaranteeing lossless round-trips. The COPY path bypassed this, causing silent precision loss for floats when the client hadextra_float_digits = 0. This is a classic FDW correctness landmine — the same mistake has bitten other patches historically.
Format choice: TEXT vs CSV vs BINARY
- BINARY was rejected as non-portable across versions/architectures.
- TEXT required reimplementing
CopyAttributeOutText-style escaping (backslash sequences for tab/newline/backslash,\Nfor NULL). - CSV won in the end: simpler escaping (RFC 4180-ish quoting), and benchmarks showed no regression vs TEXT (still ~2x faster than INSERT at batch_size=100).
Buffer management
Jakub Wartak found an ugly failure mode: very wide rows (e.g., repeat(md5(x), 10000000)) could blow past MaxAllocSize (1 GB) in the StringInfo buffer, and the backend would then hang in PQ cleanup. Fixed by flushing to the wire at COPYBUFSIZ (8192 bytes) boundaries rather than accumulating the whole batch in a single StringInfo.
The trigger interaction (subtle)
Matheus documented why even the narrowed scope still needs !trig_insert_after_row on the local foreign table: normal INSERT path uses INSERT ... RETURNING * to synchronize the TupleTableSlot with remote-side effects (e.g., remote DEFAULTs or remote BEFORE triggers modifying the row). COPY has no RETURNING, so local AFTER ROW triggers would see stale values. This constraint is encoded as a fallback to the INSERT path when ri_TrigDesc->trig_insert_after_row is set.
Design Tradeoffs and Unresolved Tensions
-
Asymmetry between direct INSERT and partition-routed INSERT: Sawada flagged that users will find it confusing that
INSERT INTO foreign_partition(direct) uses INSERT-batching, butINSERT INTO parent(routed to the same partition) uses COPY. This is an FDW API artifact —BeginForeignInsertis only called in the routing case. The thread accepted this as a tolerable limitation for v1. -
Partial-batch handling: With
batch_with_copy_threshold=50and 140 rows, the patch sends the first 100-row batch via COPY and the trailing 40 via INSERT. This is workable but adds cognitive load. -
Testing COPY was actually used: Sawada proposed an elegant test technique — a remote trigger calling
current_query()andRAISE NOTICEof "COPY command" vs "INSERT command".client_min_messages=logpropagates the notice through postgres_fdw so it appears in regression output. For per-batch (not per-row) verification, Matheus addedelog(DEBUG1)becausecurrent_query()-based notices fire per row.
Participants of Note
- Matheus Alcantara — patch author, drove the design through 12+ revisions over ~6 months.
- Tomas Vondra (committer) — originated the idea, provided the first substantive review with performance data and pointed to
copyto.c/CopyAttributeOutTextas the right reference for escaping. His benchmark PDF framing (COPY vs master vs resourceowner patch) shaped the performance discussion. - Masahiko Sawada (committer, postgres_fdw domain expert) — the de-facto reviewer-of-record. Every major scope pivot (threshold, EXPLAIN, COPY-only restriction,
ri_RootResultRelInfodetection,set_transmission_modes, escaping, column_name option) came from his reviews. - Jakub Wartak — exploratory testing, found the wide-row buffer explosion.
- Jian He — found the
attisdropped/p_flinfoindexing crash and raised the trigger-slowdown concern that forced the opt-in design. - Andrew Dunstan (committer) — brief but pointed intervention: "find out why COPY is worse with triggers." This pushed Matheus to read
CopyFrom()and surface theCIM_SINGLEfallback mechanism, which in turn justified the opt-in. - Dewei Dai — mechanical cleanup review (PQclear, pfree, typos).
- Solaimurugan V — end-to-end verification of the remote COPY path and the trigger edge case.
Architectural Significance
This patch is a good case study in FDW API seams: the optimization is conceptually trivial ("use COPY instead of INSERT") but the FDW routines (BeginForeignInsert vs BeginForeignModify), the executor's tuple-routing machinery, the EXPLAIN infrastructure's plan-time deparse assumption, and postgres_fdw's transmission-mode invariants all interact in ways that kept narrowing the safe scope. The final shape — threshold-gated, opt-in, CSV-formatted, restricted to COPY FROM and partition routing, with local-AFTER-trigger fallback — is substantially less ambitious than the original v1 but corresponds to the slice where correctness is provable and EXPLAIN output isn't lying to users.