Parallel Apply for Non-Streaming Transactions: Deep Technical Analysis
1. The Core Architectural Problem
Logical replication's apply pipeline has a fundamental scalability asymmetry with the publisher: while hundreds of backend sessions can concurrently generate WAL on the publisher, a single apply worker per subscription serializes all those transactions on the subscriber. With synchronous_commit=remote_apply, Nisha Moond's benchmarks showed this produces a ~94% TPS collapse on the publisher (16x slower than async, and 6.6x slower than physical synchronous replication). Even asynchronously, the subscriber cannot keep up with a moderately busy publisher — a 10-minute pgbench run took ~46 minutes to replicate on HEAD.
PostgreSQL already supports streaming=parallel, but that only parallelizes large in-progress transactions; the common case of many small concurrent OLTP transactions remains single-threaded. Extending parallelism to non-streaming transactions is the goal, but doing so safely requires solving two hard sub-problems:
- Transaction dependency detection — to avoid out-of-order failures, unique-key violations, FK violations, and deadlocks.
- Replication progress tracking under out-of-order commits — the existing
replication_originmodel assumes a monotonically increasingremote_lsnwatermark, which breaks if Txn at LSN 1300 commits before Txn at LSN 1000.
2. Why Physical-Replication Pipelining Doesn't Apply
Bruce Momjian suggested the Cybertec "end of the road for streaming replication" pipelining idea (read-ahead process pinning pages, replay loop consumes ready-to-apply records). Amit Kapila firmly rebutted this: in logical apply, the replay IS the work. There is no WAL decoding, no page lookup by LSN, no buffer pinning to overlap — the apply worker essentially re-executes INSERT/UPDATE/DELETE through the executor stack exactly as a user backend would. Pipelining the parse/decode/pin stages (which dominate physical redo) buys nothing when those stages are already done on the publisher. This is an important conceptual distinction: the only remaining parallelism axis is across transactions, which mandates dependency tracking.
3. The Dependency Detection Design
3.1 The Writeset / Replica-Identity Hash Table
The core mechanism is a hash table keyed by (RelationId, ReplicaIdentity columns) with value = last remote XID to touch that key. For every change the leader reads from the network:
- Hash the RI of the old tuple (for UPDATE/DELETE) and new tuple (for INSERT/UPDATE).
- Probe the hash. On hit, emit a
PA_MSG_XACT_DEPENDENCYinternal message telling the parallel worker to wait for the conflicting XID. On miss, insert the entry. - On UPDATE, the new tuple's RI must also be probed, because a prior INSERT of that key in another in-flight transaction would conflict on the unique index. This is a subtle point Amit highlighted in the opening design.
Qiuwenhui pointed out this is essentially MySQL 8.4's writeset-based parallel replication, and Tomas Vondra suggested adopting that terminology.
3.2 Extensions beyond RI
- Unique keys: without tracking all unique keys (not just PK/RI), you get silent inconsistency. Amit's canonical example: Txn1 INSERT(1,1), Txn2 INSERT(2,2), Txn3 DELETE(2,2), Txn4 UPDATE(1,1)→(1,2). Applying Txn4 before Txn2+Txn3 violates the unique constraint on column
b. - Foreign keys: applying a child-row INSERT before the parent INSERT fails. The leader probes FK values in new tuples against the RI/unique hash; a hit means dependency. Implemented in v4 patch 0009.
- Triggers, non-immutable CHECK constraints, expression indexes: these are escape hatches — the initial design simply forces serialization (commit-order wait) for tables with user-defined triggers or expression indexes, because the leader cannot statically predict their effects. Hou's v7 adds handling of non-immutable triggers by falling back to waiting for the previous commit.
- TRUNCATE / schema change: must wait for all in-flight XIDs touching that relation; implemented by sequential scan of the hash table.
3.3 Cost Analysis and Alternative Data Structures
Konstantin Knizhnik raised the concern that a full (relid, RI) hash might be prohibitive with hundreds of concurrent transactions each updating millions of rows, and suggested Bloom filters. Tomas Vondra echoed this and asked for CPU/memory measurements. Kuroda's profiling answered this empirically: check_dependency_on_replica_identity consumes ~1.39% CPU in simple-update workloads, and cleanup_replica_identity_table only 0.84%, with per-cleanup cost ~1.2 microseconds against a per-txn budget of ~74 microseconds. The hash approach was judged fast enough; a shared DSA-based hash table (Kuroda's 0002 experiment) actually showed a 1–2% regression because the shared-memory synchronization cost exceeded the maintenance savings. Decision: keep a leader-local HTAB for dependency tracking, with three distinct hash table types in the final design:
replica_identity_table(simplehash): per-change hot path, needs maximum speed.parallelized_txns(dshash in DSA): shared across leader + workers for cross-process lookup, must grow dynamically.ParallelApplyTxnHash(HTAB): per-large-transaction state, simple XID key.
4. Replication Progress Tracking Under Out-of-Order Commit
Amit's original proposal introduced three new fields in ReplicationState:
lowest_remote_lsn— the safe restart point; everything below is durably applied in order.highest_remote_lsn— the furthest commit applied.list_remote_lsn— a gap list of commit LSNs applied between lowest and highest.
Plus in-memory list_in_progress_xacts. On commit of an XID, if it's the head of in-progress, advance lowest_remote_lsn (the watermark) and garbage-collect list_remote_lsn entries below it; otherwise append to list_remote_lsn. On restart, replay starts at lowest_remote_lsn, skipping any commit whose LSN appears in the gap list. This is isomorphic to tracking "which transactions in a commit-ordered range have already been durably applied" and extends CheckPointReplicationOrigin.
Mikhail Nikalayeu and the thread consensus pushed for preserving commit order by default (safer for causality, crash semantics, user expectations). Out-of-order commit becomes an opt-in feature for users whose integrity is enforced at the application layer.
5. The Commit-Order-Preserving Variant
Konstantin's empirical surprise was that enforcing commit order costs very little: his random-update test went from 74s (no order) to 88s (preserve order) vs 488s on master — still a ~5.5x speedup. Kuroda later confirmed: pgbench simple-update with 4 workers gave HEAD=6098 TPS, out-of-order parallel=16221, preserve-order parallel=12926 (still 2.1x).
The mechanism is elegant: each parallel worker, before committing, waits on the pa_lock_transaction advisory lock of the previously dispatched transaction's XID. The leader sends a PA_MSG_XACT_DEPENDENCY with the last committed XID before dispatching the COMMIT. This reuses the same dependency-wait infrastructure already needed for row-level conflicts — no separate mechanism.
Konstantin also raised a subtle commit-order deadlock risk: if T1 commits before T2 on publisher but T2 acquires row locks first on subscriber, T2 cannot commit (waiting for T1's commit) but T1 cannot proceed (waiting for T2's row locks). The dependency hash should serialize these, but Konstantin worried about index-internal locks from INSERTs that aren't visible as writeset conflicts. This remains an open verification item.
6. The Leader's Role: Producer vs Producer+Consumer
Dilip Kumar pushed back on the POC's default behavior where the leader applies a transaction itself if no worker is free. His argument: this is a single-producer / multi-consumer queue; making the producer join as a consumer caps the leader's dispatch rate and creates head-of-line blocking that idle workers can't fill. He proposed the leader should default to pure dispatch, with an opt-in GUC for resource-constrained environments.
Amit countered with a deadlock scenario: pa-1 blocked on a publisher backend's row lock, pa-2 blocked on pa-1 (dependency wait) — if the leader doesn't apply the next independent transaction itself, the sender's TCP buffer fills and times out. Amit's compromise: distribute a threshold (5–10) to each worker before the leader applies locally. Unresolved at time of analysis, but tagged as a later optimization.
Ashutosh Bapat added a smaller concern: if the leader errors while applying, all parallel workers must restart to keep dependency state consistent — a larger blast radius than pure dispatch.
7. The Internal Message Protocol
A non-trivial protocol design emerged around how the leader signals the parallel worker. Three iterations:
- Option 1 (v15): Two wrapper bytes —
PARALLEL_APPLY_INTERNAL_MESSAGE+LOGICAL_REP_MSG_INTERNAL_MESSAGE+PAWorkerMsgType+ data. Asymmetric between wire and spool. - Option 2 (chosen): Single byte
LOGICAL_REP_MSG_INTERNAL_MESSAGEdoubles as both "distinguishes from PqReplMsg_WALData on the MQ" and "apply_dispatch action". InnerPAWorkerMsgTypediscriminates dependency vs relation-map messages. - Option 3 (v12, rejected): Stuff internal codes directly into
LogicalRepMsgType. Rejected because that enum represents the external wire protocol; contaminating it with internal subscriber messages is a layering violation.
The remaining internal messages are:
PA_MSG_XACT_DEPENDENCY('d'): "wait for these XIDs to commit before proceeding."PA_MSG_RELMAP('r'): leader-to-worker relation metadata sync, needed because non-streaming transactions don't redundantly shipLOGICAL_REP_MSG_RELATIONper txn the way streamed transactions do.
8. Performance Results Summary
| Test | HEAD | Parallel Apply | Speedup |
|---|---|---|---|
| pgbench sync repl, 40 workers | 5435 TPS | 30046 TPS | 5.53x |
| 10-min pgbench catch-up, 16 workers | 46 min | 12.3 min | 3.73x |
| Random update catchup, 8 workers (Konstantin) | 8:30 | 1:30 | 5.7x |
| 1M update (v4, w=4, with RI) | 17.2s | 8.9s | 1.93x |
| 1M update, REPLICA IDENTITY FULL, w=16 | ~7200s (extrapolated) | 718s | ~10x |
Notably, at w=0 (parallel disabled but dependency tracking still running) there's a ~7% regression versus HEAD, which is a known issue to optimize.
9. Key Open/Resolved Design Points
- Shared vs local hash table (RESOLVED): local HTAB; shared showed 1-2% regression.
- Commit order by default (RESOLVED): yes, preserve by default; out-of-order is opt-in.
- Partitioned tables with triggers/constraints (RESOLVED): use routing to identify leaf partition at leader, then check safety of that leaf; fallback to serialization if too expensive.
- Two-phase transactions (ADDRESSED in v6 patch 0006): parallel worker handles PREPARE; leader handles COMMIT/ROLLBACK PREPARED.
- Streamed transactions interaction (ADDRESSED in v6 patch 0007): streamed txns don't get tracked as dependency sources but do wait for previously dispatched non-streamed txns.
- Non-immutable triggers (ADDRESSED in v7): target relations with mutable triggers force commit-order wait.
10. Authority and Influence Map
Amit Kapila (committer, logical replication area owner) is the design lead and final arbiter; his architectural judgments (e.g., rejecting the pipelining analogy, structuring the patch split, accepting Option 2 message format) consistently close debates. Hou Zhijie and Kuroda Hayato (Fujitsu) are the implementation drivers across 17+ patch versions. Konstantin Knizhnik provides sharp independent validation and alternative-approach benchmarking (prefetch vs parallel apply), and his deadlock-risk observations carry significant weight. Dilip Kumar (committer) supplies architectural critique on the producer/consumer pattern. Tomas Vondra (committer) pushes for smaller patches, better instrumentation (writeset statistics view), TAP-level crash tests, and draws the explicit MySQL writeset analogy. Shveta Malik, Peter Smith provide detailed code-level review. Álvaro Herrera's question about the expected speedup magnitude is the kind of framing question that reveals whether this is incremental or transformational work — the answer (3.5x–5.5x) puts it squarely in the transformational category.