Parallel INSERT SELECT take 2

First seen: 2021-04-12 01:21:57+00:00 · Messages: 42 · Participants: 7

Latest Update

2026-05-09 · opus 4.7

Parallel INSERT SELECT take 2 — Deep Technical Analysis

The Core Architectural Problem

PostgreSQL's parallel query infrastructure was extended to handle the SELECT side of INSERT ... SELECT in commit 05c8482f7f, but the feature was subsequently reverted. This thread ("take 2") is Fujitsu's attempt to revive it by addressing two fundamental obstacles that blocked the original work:

Problem 1: Parallel-Safety Checking Overhead

Determining whether an INSERT target relation is parallel-safe requires inspecting every object whose execution is triggered by a tuple insertion:

For a partitioned table with many leaf partitions this check is O(partitions × objects-per-partition), must be done during planning (before the planner even knows which partitions will be touched), and — crucially — requires acquiring locks on every partition. For OLTP-ish INSERTs that only touch one partition, this planning overhead can easily exceed the execution savings. This overhead was the primary cause of the earlier revert.

Problem 2: XID Assignment in Parallel Workers

heap_insert() requires a current transaction XID. Parallel workers were historically read-only, so they never needed to allocate XIDs. If workers were to perform writes (the more ambitious "parallel INSERT" variant), the infrastructure must cope with:

The original patch sidestepped this by assigning XID in the leader before entering parallel mode, at the cost of wasting an XID when the SELECT returns no rows.

Design Decisions and the Declarative Approach

After consideration of two alternatives — (a) caching the result of parallel-safety checks in relcache/shared hash, vs. (b) a user-declared parallel-safety marking — the team chose declarative labeling via a new pg_class.relparallel column (later called relparalleldml) settable with CREATE/ALTER TABLE ... PARALLEL DML { UNSAFE | RESTRICTED | SAFE }, defaulting to UNSAFE.

Key implications of this choice:

  1. Zero planning-time cost for the partitioned-table case. The planner simply reads a single flag from the root relation's pg_class entry and trusts it.
  2. User-visible contract mirroring proparallel: the precedent of making users responsible for function parallel labels is extended to tables.
  3. A pg_get_parallel_safety(regclass) SRF (later split into pg_get_table_parallel_dml_safety and pg_get_table_max_parallel_dml_hazard) is provided as a development-time tool that scans all dependent objects and lists the problematic ones. This deliberately does not run at plan time.
  4. Runtime guardrail: an attempt was made to detect unsafe function execution inside a parallel worker by injecting a check at FunctionCallInvoke(). This had to grapple with the fact that built-in functions don't have their proparallel readily accessible from low-level fmgr paths — leading to a proposal to pack parallel-safety into FmgrBuiltin by replacing the bool strict, retset pair with a bitflag char plus a new proparallel char. This runtime check was ultimately factored out into a separate thread and removed from later patch versions.
  5. Foreign and temporary tables are forcibly treated as (at most) parallel-restricted; attempts to mark them SAFE error out. Foreign tables are unsafe because there is no parallel-INSERT FDW API.

Design Disagreements

Should the DDL verify the declaration?

Bharath repeatedly pushed for verification at ALTER TABLE ... PARALLEL DML SAFE time (walk partitions + dependent objects, reject if any are unsafe). Greg Nancarrow and Amit Kapila argued for the pure declarative semantics — because any such DDL-time check is immediately invalidated the moment someone does ALTER FUNCTION ... PARALLEL UNSAFE on a function referenced in a constraint/trigger/default, and cascading invalidation from pg_proc back to every affected pg_class row is expensive and requires walking pg_depend, pg_inherits, pg_index, etc. The declarative camp prevailed: the executor catches any violation at runtime.

Should non-partitioned tables also be declarative?

Bharath argued the declarative approach only needs to cover partitioned tables, because for plain tables the planning-time safety check is cheap. Greg and Hou argued for consistency: mixing declarative (partitioned) and computed (plain) creates two code paths and confuses users. Later (Aug 2021), this evolved toward a hybrid: user declaration wins if provided, otherwise auto-compute for non-partitioned relations — see Hou's post pointing to the follow-up thread.

Index expression double-parsing

Index expressions/predicates are stored as serialized node trees and must be stringToNode()-ed to inspect. The planner currently doesn't open index info for INSERT (that's done in the executor), so a safety check at plan time would re-parse these, duplicating executor work. Passing parsed info down was acknowledged as a possible refactor but deferred.

Tomas Vondra's 2026 Reassessment

Five years after the thread went dormant, Tomas revived it with two substantive critiques and an alternative:

Critique: Declarative is the wrong default philosophy

Echoing Robert Haas's earlier position, Tomas argues that table parallel-safety is fundamentally different from function parallel-safety. Function safety is undecidable (halting problem-ish: what does arbitrary PL/pgSQL do?), so we must ask the author. Table safety is fully decidable from the catalog — we choose not to compute it only because of cost. Shifting responsibility to users invites bug reports from people who mislabel their tables.

His counter-proposal: maintain pg_class.relparalleldml automatically. When a trigger is added, a default expression changes, or a referenced function's proparallel is altered, cascade the update through all affected parent/partition-root relations. The deadlock-hazard objection (raised by Robert) is acknowledged but considered solvable via ordered locking.

Alternative: Materialize, don't interleave

Tomas proposes sidestepping the entire design by running the SELECT to completion in a tuplestore, leaving parallel mode, then doing the INSERT serially. This would:

This is essentially what users do manually today (SELECT into a staging table, then INSERT). It matches the observation that the reverted 05c8482f7f only ever parallelized SELECT anyway — the INSERT always ran in the leader. If the INSERT must be in the leader, why be in parallel mode during it at all?

Patch Evolution

Through v1 → v10, the patch set stabilized into:

The runtime-function-check piece (originally 0003) was split out entirely to another thread because of the FmgrBuiltin redesign implications.

Why This Matters

This feature is the primary parallelism gap vs. Oracle for ETL/DW workloads. The architectural tension — fast planning vs. correctness without user burden — is a recurring PostgreSQL theme (cf. selectivity estimation, JIT decisions). The declarative approach is expedient but philosophically inconsistent with PostgreSQL's usual "the server figures it out" posture, which is why the thread never reached committer consensus and died for five years.