Parallel INSERT SELECT take 2 — Deep Technical Analysis
The Core Architectural Problem
PostgreSQL's parallel query infrastructure was extended to handle the SELECT side of INSERT ... SELECT in commit 05c8482f7f, but the feature was subsequently reverted. This thread ("take 2") is Fujitsu's attempt to revive it by addressing two fundamental obstacles that blocked the original work:
Problem 1: Parallel-Safety Checking Overhead
Determining whether an INSERT target relation is parallel-safe requires inspecting every object whose execution is triggered by a tuple insertion:
- Column default expressions
DOMAINCHECK expressions- CHECK constraints
- Partition key and its support functions
- Index expressions, predicates, and AM functions
- Operator functions for indexes/constraints
- Trigger functions (BEFORE/AFTER ROW, statement-level)
For a partitioned table with many leaf partitions this check is O(partitions × objects-per-partition), must be done during planning (before the planner even knows which partitions will be touched), and — crucially — requires acquiring locks on every partition. For OLTP-ish INSERTs that only touch one partition, this planning overhead can easily exceed the execution savings. This overhead was the primary cause of the earlier revert.
Problem 2: XID Assignment in Parallel Workers
heap_insert() requires a current transaction XID. Parallel workers were historically read-only, so they never needed to allocate XIDs. If workers were to perform writes (the more ambitious "parallel INSERT" variant), the infrastructure must cope with:
- Visibility correctness:
TransactionIdIsCurrentTransactionId()relies onParallelCurrentXids, an array populated from the leader. If a worker allocates the XID first and another worker reads a tuple written under that XID before the XID propagates, the second worker misclassifies "my own transaction's writes" as belonging to a foreign transaction — breaking MVCC within the parallel group. - Nested subtransactions: An indeterminate stack of parent
TransactionStates may need XIDs, all of which must be communicated through shared memory.
The original patch sidestepped this by assigning XID in the leader before entering parallel mode, at the cost of wasting an XID when the SELECT returns no rows.
Design Decisions and the Declarative Approach
After consideration of two alternatives — (a) caching the result of parallel-safety checks in relcache/shared hash, vs. (b) a user-declared parallel-safety marking — the team chose declarative labeling via a new pg_class.relparallel column (later called relparalleldml) settable with CREATE/ALTER TABLE ... PARALLEL DML { UNSAFE | RESTRICTED | SAFE }, defaulting to UNSAFE.
Key implications of this choice:
- Zero planning-time cost for the partitioned-table case. The planner simply reads a single flag from the root relation's pg_class entry and trusts it.
- User-visible contract mirroring
proparallel: the precedent of making users responsible for function parallel labels is extended to tables. - A
pg_get_parallel_safety(regclass)SRF (later split intopg_get_table_parallel_dml_safetyandpg_get_table_max_parallel_dml_hazard) is provided as a development-time tool that scans all dependent objects and lists the problematic ones. This deliberately does not run at plan time. - Runtime guardrail: an attempt was made to detect unsafe function execution inside a parallel worker by injecting a check at
FunctionCallInvoke(). This had to grapple with the fact that built-in functions don't have theirproparallelreadily accessible from low-level fmgr paths — leading to a proposal to pack parallel-safety intoFmgrBuiltinby replacing thebool strict, retsetpair with a bitflag char plus a newproparallelchar. This runtime check was ultimately factored out into a separate thread and removed from later patch versions. - Foreign and temporary tables are forcibly treated as (at most) parallel-restricted; attempts to mark them SAFE error out. Foreign tables are unsafe because there is no parallel-INSERT FDW API.
Design Disagreements
Should the DDL verify the declaration?
Bharath repeatedly pushed for verification at ALTER TABLE ... PARALLEL DML SAFE time (walk partitions + dependent objects, reject if any are unsafe). Greg Nancarrow and Amit Kapila argued for the pure declarative semantics — because any such DDL-time check is immediately invalidated the moment someone does ALTER FUNCTION ... PARALLEL UNSAFE on a function referenced in a constraint/trigger/default, and cascading invalidation from pg_proc back to every affected pg_class row is expensive and requires walking pg_depend, pg_inherits, pg_index, etc. The declarative camp prevailed: the executor catches any violation at runtime.
Should non-partitioned tables also be declarative?
Bharath argued the declarative approach only needs to cover partitioned tables, because for plain tables the planning-time safety check is cheap. Greg and Hou argued for consistency: mixing declarative (partitioned) and computed (plain) creates two code paths and confuses users. Later (Aug 2021), this evolved toward a hybrid: user declaration wins if provided, otherwise auto-compute for non-partitioned relations — see Hou's post pointing to the follow-up thread.
Index expression double-parsing
Index expressions/predicates are stored as serialized node trees and must be stringToNode()-ed to inspect. The planner currently doesn't open index info for INSERT (that's done in the executor), so a safety check at plan time would re-parse these, duplicating executor work. Passing parsed info down was acknowledged as a possible refactor but deferred.
Tomas Vondra's 2026 Reassessment
Five years after the thread went dormant, Tomas revived it with two substantive critiques and an alternative:
Critique: Declarative is the wrong default philosophy
Echoing Robert Haas's earlier position, Tomas argues that table parallel-safety is fundamentally different from function parallel-safety. Function safety is undecidable (halting problem-ish: what does arbitrary PL/pgSQL do?), so we must ask the author. Table safety is fully decidable from the catalog — we choose not to compute it only because of cost. Shifting responsibility to users invites bug reports from people who mislabel their tables.
His counter-proposal: maintain pg_class.relparalleldml automatically. When a trigger is added, a default expression changes, or a referenced function's proparallel is altered, cascade the update through all affected parent/partition-root relations. The deadlock-hazard objection (raised by Robert) is acknowledged but considered solvable via ordered locking.
Alternative: Materialize, don't interleave
Tomas proposes sidestepping the entire design by running the SELECT to completion in a tuplestore, leaving parallel mode, then doing the INSERT serially. This would:
- Eliminate the XID-in-workers problem entirely (no writes in parallel mode).
- Eliminate the target-relation parallel-safety checks (only the SELECT's parallel-safety matters, which is already handled).
- Introduce a tuplestore cost, but that's cost-model-able.
This is essentially what users do manually today (SELECT into a staging table, then INSERT). It matches the observation that the reverted 05c8482f7f only ever parallelized SELECT anyway — the INSERT always ran in the leader. If the INSERT must be in the leader, why be in parallel mode during it at all?
Patch Evolution
Through v1 → v10, the patch set stabilized into:
- 0001 —
CREATE/ALTER TABLE ... PARALLEL DMLsyntax,pg_class.relparalleldmlcolumn,\dt+support, reject for foreign/temp. - 0002 — planner changes: consult
relparalleldmlinis_parallel_allowed_for_modify()andmax_parallel_hazard(); leader-assigned XID before parallel mode. - 0003 —
pg_get_table_parallel_dml_safety()/pg_get_table_max_parallel_dml_hazard()SRFs; walker over constraints/defaults/triggers/index exprs/domains; uses typcache for already-parsed domain constraints. - 0004 — regression tests; mislabeled-function fixes in existing tests exposed by
force_parallel_mode=regress.
The runtime-function-check piece (originally 0003) was split out entirely to another thread because of the FmgrBuiltin redesign implications.
Why This Matters
This feature is the primary parallelism gap vs. Oracle for ETL/DW workloads. The architectural tension — fast planning vs. correctness without user burden — is a recurring PostgreSQL theme (cf. selectivity estimation, JIT decisions). The declarative approach is expedient but philosophically inconsistent with PostgreSQL's usual "the server figures it out" posture, which is why the thread never reached committer consensus and died for five years.