Parallel Data Loading for pgbench -i: Technical Analysis
Core Problem
pgbench -i (initialization mode) is the canonical benchmark-setup workflow, and it has been single-threaded since inception. On modern hardware with dozens of cores and fast storage, the client-side data generation phase is CPU-bound in a single backend's COPY stream, leaving most of the machine idle. For large scale factors (-s 500, -s 2000), initialization can dominate benchmark setup time, which is a real friction point for anyone running repeated pgbench experiments.
The patch, originating from a blog-post idea by Tomas Vondra, proposes to reuse pgbench's existing -j (threads) infrastructure — currently only meaningful in client/benchmark mode — to parallelize the initialization path. This is attractive because pgbench already has thread plumbing, per-thread libpq connections, and the random-data generation is trivially partitionable by row range.
Architectural Challenges Surfaced
The thread's most interesting content is not the speedup numbers (which are real) but the subtle correctness/performance interactions that parallelism exposes in what looked like a simple change.
1. COPY FREEZE is incompatible with concurrent writers
COPY FREEZE requires that the target table was created or truncated in the same transaction as the COPY, AND that no other transaction can see the table. This is the key optimization that avoids a second pass by VACUUM to set all-frozen/all-visible bits. In a parallel, non-partitioned load this is fundamentally unachievable: only one worker can hold the AccessExclusiveLock from the TRUNCATE, and any other worker touching the same relation would invalidate FREEZE's preconditions.
This is why Lakshmi's early measurements consistently showed the non-partitioned parallel case losing on total runtime despite faster data generation: the skipped FREEZE simply shifted the cost (and more of it) to the subsequent VACUUM phase. Heikki Linnakangas (committer) diagnosed this definitively by instrumenting the workers.
2. Serialization bug in the original design
Heikki's debug trace uncovered a more embarrassing problem: the v2 patch had the main thread run BEGIN; TRUNCATE pgbench_accounts; on connection 0, then hand that connection to worker 0. Workers 1..N opened fresh connections and issued COPY, but they immediately blocked on the AccessExclusiveLock held by worker 0's still-open transaction. The output shows all workers "sending COPY" at t=0.00s but only worker 0 actually progressing until t=6.19s, at which point the other nine workers start in lockstep. The intended parallelism was effectively serialized for the first batch — the patch was masking this because total wall time still improved versus master for partitioned cases (where each worker has its own relation).
The fix is either to commit the TRUNCATE before dispatching workers (losing FREEZE entirely), or to restrict parallelism to the partitioned case where each worker owns a distinct relation and can legitimately TRUNCATE+FREEZE within its own transaction.
3. The partitioned case is the natural fit
Heikki's recommendation — adopted in v3 — is to parallelize only when --partitions is used, with one thread per partition. This resolves both problems at once:
- Each worker operates on an independent partition with its own lock scope.
- Each worker can legitimately use
COPY FREEZE. - No cross-worker coordination or lock contention on the parent (attach-partition is deferred until all loads complete, or partitions are built as standalone tables then attached).
The v3 numbers validate this: scale 100 partitioned drops from 29.3s (master) to 12.6s, with VACUUM collapsing from 7.8s to 1.6s — the FREEZE effect is visible in the VACUUM column.
4. API design tension
Hayato Kuroda (Fujitsu) raised legitimate usability concerns about the v3 "partitioned-only" design:
- Forced parallelism: You can no longer run a partitioned init serially, because
--partitionsimplicitly triggers one-thread-per-partition. - Memory pressure: With many partitions (hundreds or thousands, which is a realistic pgbench workload), one connection + one thread per partition risks OOM and connection exhaustion.
- Redundant TRUNCATE: v3 has workers TRUNCATE their assigned partition even though
initTruncateTables()already truncated everything before threads launch — a leftover from the v2 design where the TRUNCATE was needed to enable FREEZE within the worker's transaction.
Hayato's counter-proposal is to reintroduce an explicit -j in init mode with the constraint partitions >= nthreads or partitions % nthreads == 0, so each worker owns a contiguous block of partitions. This decouples parallelism degree from partition count and preserves the FREEZE property (each worker still operates on disjoint relations).
Patch Evolution
| Version | Strategy | Key Issue |
|---|---|---|
| v1 | -j parallelizes all phases; workers split row ranges for unpartitioned case, own partitions for partitioned case |
Non-partitioned gains limited by lost FREEZE; partitioned case creates standalone tables then attaches |
| v2 | Split into 0001 (parallel load) + 0002 (parallel partition create/attach); added COPY batching | Hidden serialization via TRUNCATE lock discovered by Heikki |
| v3 | Partitioned-only, one thread per partition, no explicit -j |
API rigidity; redundant TRUNCATE; potential OOM with many partitions |
| v4 | Addresses Hayato/Lakshmi feedback (details not in thread excerpt — likely restores -j with worker-to-partition mapping) |
— |
Secondary Observation: COPY Batching
In v2, Mircea included a COPY_BATCH_SIZE chunking optimization discovered via profiling. Hayato correctly pointed out this is orthogonal to parallelism and should be proposed independently — it would benefit serial pgbench -i too. This is a good example of reviewer discipline: keeping patches atomic improves reviewability and git-bisect behavior.
Connection to Other Work
The non-partitioned case's VACUUM bottleneck would be substantially mitigated by Sawada-san's ongoing parallel heap vacuum work (referenced by Mircea). If that lands, the case for non-partitioned parallel load strengthens because the post-load VACUUM cost would itself be parallelized. This is a genuine cross-feature dependency worth tracking.
Design Verdict
The thread converges on a clear architectural principle: parallelism in pgbench init is only a clean win when each worker owns an independent relation, because that's the condition under which COPY FREEZE remains valid. The partitioned case satisfies this naturally; the unpartitioned case does not, and no amount of clever transaction juggling changes that. The remaining design question is purely ergonomic: how to expose the worker count without either (a) forcing parallelism or (b) conflating it with --partitions.