Parallel Apply

First seen: 2025-08-11 04:45:41+00:00 · Messages: 103 · Participants: 17

Latest Update

2026-05-06 · opus 4.7

Parallel Apply for Non-Streaming Transactions: Deep Technical Analysis

1. The Core Architectural Problem

Logical replication's apply pipeline has a fundamental scalability asymmetry with the publisher: while hundreds of backend sessions can concurrently generate WAL on the publisher, a single apply worker per subscription serializes all those transactions on the subscriber. With synchronous_commit=remote_apply, Nisha Moond's benchmarks showed this produces a ~94% TPS collapse on the publisher (16x slower than async, and 6.6x slower than physical synchronous replication). Even asynchronously, the subscriber cannot keep up with a moderately busy publisher — a 10-minute pgbench run took ~46 minutes to replicate on HEAD.

PostgreSQL already supports streaming=parallel, but that only parallelizes large in-progress transactions; the common case of many small concurrent OLTP transactions remains single-threaded. Extending parallelism to non-streaming transactions is the goal, but doing so safely requires solving two hard sub-problems:

  1. Transaction dependency detection — to avoid out-of-order failures, unique-key violations, FK violations, and deadlocks.
  2. Replication progress tracking under out-of-order commits — the existing replication_origin model assumes a monotonically increasing remote_lsn watermark, which breaks if Txn at LSN 1300 commits before Txn at LSN 1000.

2. Why Physical-Replication Pipelining Doesn't Apply

Bruce Momjian suggested the Cybertec "end of the road for streaming replication" pipelining idea (read-ahead process pinning pages, replay loop consumes ready-to-apply records). Amit Kapila firmly rebutted this: in logical apply, the replay IS the work. There is no WAL decoding, no page lookup by LSN, no buffer pinning to overlap — the apply worker essentially re-executes INSERT/UPDATE/DELETE through the executor stack exactly as a user backend would. Pipelining the parse/decode/pin stages (which dominate physical redo) buys nothing when those stages are already done on the publisher. This is an important conceptual distinction: the only remaining parallelism axis is across transactions, which mandates dependency tracking.

3. The Dependency Detection Design

3.1 The Writeset / Replica-Identity Hash Table

The core mechanism is a hash table keyed by (RelationId, ReplicaIdentity columns) with value = last remote XID to touch that key. For every change the leader reads from the network:

Qiuwenhui pointed out this is essentially MySQL 8.4's writeset-based parallel replication, and Tomas Vondra suggested adopting that terminology.

3.2 Extensions beyond RI

3.3 Cost Analysis and Alternative Data Structures

Konstantin Knizhnik raised the concern that a full (relid, RI) hash might be prohibitive with hundreds of concurrent transactions each updating millions of rows, and suggested Bloom filters. Tomas Vondra echoed this and asked for CPU/memory measurements. Kuroda's profiling answered this empirically: check_dependency_on_replica_identity consumes ~1.39% CPU in simple-update workloads, and cleanup_replica_identity_table only 0.84%, with per-cleanup cost ~1.2 microseconds against a per-txn budget of ~74 microseconds. The hash approach was judged fast enough; a shared DSA-based hash table (Kuroda's 0002 experiment) actually showed a 1–2% regression because the shared-memory synchronization cost exceeded the maintenance savings. Decision: keep a leader-local HTAB for dependency tracking, with three distinct hash table types in the final design:

4. Replication Progress Tracking Under Out-of-Order Commit

Amit's original proposal introduced three new fields in ReplicationState:

Plus in-memory list_in_progress_xacts. On commit of an XID, if it's the head of in-progress, advance lowest_remote_lsn (the watermark) and garbage-collect list_remote_lsn entries below it; otherwise append to list_remote_lsn. On restart, replay starts at lowest_remote_lsn, skipping any commit whose LSN appears in the gap list. This is isomorphic to tracking "which transactions in a commit-ordered range have already been durably applied" and extends CheckPointReplicationOrigin.

Mikhail Nikalayeu and the thread consensus pushed for preserving commit order by default (safer for causality, crash semantics, user expectations). Out-of-order commit becomes an opt-in feature for users whose integrity is enforced at the application layer.

5. The Commit-Order-Preserving Variant

Konstantin's empirical surprise was that enforcing commit order costs very little: his random-update test went from 74s (no order) to 88s (preserve order) vs 488s on master — still a ~5.5x speedup. Kuroda later confirmed: pgbench simple-update with 4 workers gave HEAD=6098 TPS, out-of-order parallel=16221, preserve-order parallel=12926 (still 2.1x).

The mechanism is elegant: each parallel worker, before committing, waits on the pa_lock_transaction advisory lock of the previously dispatched transaction's XID. The leader sends a PA_MSG_XACT_DEPENDENCY with the last committed XID before dispatching the COMMIT. This reuses the same dependency-wait infrastructure already needed for row-level conflicts — no separate mechanism.

Konstantin also raised a subtle commit-order deadlock risk: if T1 commits before T2 on publisher but T2 acquires row locks first on subscriber, T2 cannot commit (waiting for T1's commit) but T1 cannot proceed (waiting for T2's row locks). The dependency hash should serialize these, but Konstantin worried about index-internal locks from INSERTs that aren't visible as writeset conflicts. This remains an open verification item.

6. The Leader's Role: Producer vs Producer+Consumer

Dilip Kumar pushed back on the POC's default behavior where the leader applies a transaction itself if no worker is free. His argument: this is a single-producer / multi-consumer queue; making the producer join as a consumer caps the leader's dispatch rate and creates head-of-line blocking that idle workers can't fill. He proposed the leader should default to pure dispatch, with an opt-in GUC for resource-constrained environments.

Amit countered with a deadlock scenario: pa-1 blocked on a publisher backend's row lock, pa-2 blocked on pa-1 (dependency wait) — if the leader doesn't apply the next independent transaction itself, the sender's TCP buffer fills and times out. Amit's compromise: distribute a threshold (5–10) to each worker before the leader applies locally. Unresolved at time of analysis, but tagged as a later optimization.

Ashutosh Bapat added a smaller concern: if the leader errors while applying, all parallel workers must restart to keep dependency state consistent — a larger blast radius than pure dispatch.

7. The Internal Message Protocol

A non-trivial protocol design emerged around how the leader signals the parallel worker. Three iterations:

The remaining internal messages are:

8. Performance Results Summary

Test HEAD Parallel Apply Speedup
pgbench sync repl, 40 workers 5435 TPS 30046 TPS 5.53x
10-min pgbench catch-up, 16 workers 46 min 12.3 min 3.73x
Random update catchup, 8 workers (Konstantin) 8:30 1:30 5.7x
1M update (v4, w=4, with RI) 17.2s 8.9s 1.93x
1M update, REPLICA IDENTITY FULL, w=16 ~7200s (extrapolated) 718s ~10x

Notably, at w=0 (parallel disabled but dependency tracking still running) there's a ~7% regression versus HEAD, which is a known issue to optimize.

9. Key Open/Resolved Design Points

10. Authority and Influence Map

Amit Kapila (committer, logical replication area owner) is the design lead and final arbiter; his architectural judgments (e.g., rejecting the pipelining analogy, structuring the patch split, accepting Option 2 message format) consistently close debates. Hou Zhijie and Kuroda Hayato (Fujitsu) are the implementation drivers across 17+ patch versions. Konstantin Knizhnik provides sharp independent validation and alternative-approach benchmarking (prefetch vs parallel apply), and his deadlock-risk observations carry significant weight. Dilip Kumar (committer) supplies architectural critique on the producer/consumer pattern. Tomas Vondra (committer) pushes for smaller patches, better instrumentation (writeset statistics view), TAP-level crash tests, and draws the explicit MySQL writeset analogy. Shveta Malik, Peter Smith provide detailed code-level review. Álvaro Herrera's question about the expected speedup magnitude is the kind of framing question that reveals whether this is incremental or transformational work — the answer (3.5x–5.5x) puts it squarely in the transformational category.