Streaming replication and WAL archive interactions

2026-05-06 · opus 4.7

Core Problem: Coordinating WAL Archiving Across Primary and Standbys

Architectural Context

PostgreSQL's WAL archiving (archive_mode/archive_command) and streaming replication were designed as largely independent subsystems. When archive_mode=always is set on a standby, both the primary and the standby(s) will independently attempt to archive WAL segments they receive. This produces several well-known operational hazards:

Redundant archival work: The same WAL segment is archived N times in a cluster with N servers, wasting CPU, network, and archive storage I/O (idempotency is the archive_command author's problem).
Race conditions: Two nodes racing to archive the same segment name can corrupt the archive if the archive_command is not carefully written to be atomic.
Divergent .ready/.done state: archive_status/ flag files on the standby are maintained independently of what the primary has actually archived. A standby promoted after the primary has already archived segment X still holds a .ready for X, causing re-archival.
Operator burden: Installations (as the OP describes, "tens of thousands of clusters") resort to external daemons to synchronize archiving state between upstream and downstream, which is fragile and duplicates logic that belongs in the server.

The feature being discussed — informally archive_mode=shared — aims to let the primary tell the standby which segments it has already archived, so the standby can mark them .done without re-archiving, while still allowing the standby to take over archiving duties on promotion or when the primary is unable to archive.

Patch Evolution and What v5 Changes

The OP (smallkeen@gmail.com) is iterating on an existing v4 patch from earlier in this (longer) thread. The v5 revision makes four concrete architectural changes:

Wire format: TLI+Segno instead of full WAL filename. The previous patch apparently transmitted a 24-character WAL filename over the replication protocol. Sending (TimeLineID, XLogSegNo) — effectively 12 bytes — is both cheaper and more type-safe: the standby can locally render the filename using its own segment size, eliminating any risk of protocol/representation drift if segment-size assumptions diverge. This is the correct layering: the wire protocol speaks in logical identifiers, not filesystem artifacts.
Shifting work from walreceiver into the archiver. In v4, walreceiver itself was calling XLogArchiveForceDone() on receipt of upstream notifications. v5 moves this toward the archiver process. This is architecturally cleaner because:
- walreceiver is latency-sensitive (it's on the WAL receive path); filesystem operations to rename .ready→.done don't belong there.
- The archiver already owns the archive_status/ directory conceptually; centralizing writes there avoids two processes racing on the same flag files.
Shared memory channel between walreceiver and archiver. Previously, the only IPC between these was implicit (via the filesystem). A dedicated shmem structure lets walreceiver hand off (TLI, segno) notifications to the archiver cheaply, and allows the archiver to publish state (e.g., for pg_stat_archiver).
Force-done semantics for segments outside current history. v5 proposes that any segment not on the standby's timeline history should simply be XLogArchiveForceDone()'d, replacing the "awkward switchpoint calculation" in the timeline-switch branch.

The Four Open Design Questions

The OP explicitly calls out four unresolved decisions. Each has real architectural weight:

(1) pg_stat_archiver on standby: locally compute vs. mirror upstream. If the standby merely re-broadcasts the primary's stats, cascading replication breaks cleanly: a cascaded standby reports its grandparent's stats, which is meaningless. Computing locally (the OP's preference) makes pg_stat_archiver describe "what this node did," which is the semantically clean answer and is what monitoring tooling (Prometheus exporters, check_postgres, etc.) will expect. The "less if-else" argument is secondary but real: special-casing stat propagation based on archive_mode value invites bugs.

(2) Handling *.backup.ready and *.partial.ready on the standby. These files correspond to:

.backup — base backup labels produced by pg_backup_stop.
.partial — the tail segment renamed at timeline switch (post-9.5 behavior).

On a standby in archive_mode=shared, neither of these was produced by this node's own activity — they're artifacts of a primary-side operation. XLogArchiveForceDone() on them is safe iff the upstream is known to have archived them; otherwise the standby risks silently dropping archival responsibility. This is the subtle correctness question: force-done must be predicated on positive confirmation from upstream, not mere assumption.

(3) Switchpoint calculation at timeline switch. The current code apparently computes the exact LSN at which a timeline diverged and uses that to decide archive responsibility for segments spanning the switch. The OP argues this can be replaced with a simpler rule: "if a segment is not on my timeline history, force-done it." This is attractive — timeline history files are authoritative and already present on the standby — but it needs careful thought about the segment that contains the switchpoint (which legitimately belongs to both timelines under different names).

(4) Centralizing XLogArchiveForceDone in the archiver. Currently split between walreceiver and archiver. Consolidation in the archiver is the right direction (per point 2 above) — it makes the archiver the single writer of archive_status/.

Thread State

This thread fragment shows only the OP's contributions (with two near-duplicate resends of the v5 presentation, likely due to a mail delivery retry). No committer has responded in the provided messages. The questions posed are substantive design questions that require input from people familiar with the archiver/walreceiver code — historically Fujii Masao, Michael Paquier, Kyotaro Horiguchi, and Robert Haas have owned pieces of this area.

Implications

If accepted, this feature addresses a long-standing operational pain point that large fleet operators (Yandex, the likely employer given the gmail handle and operational scale described, as well as Cloud providers) have been working around externally. The design is sound in direction but the four open questions are exactly the ones that will determine whether the patch is committable: each touches the invariant that no WAL segment is ever marked .done unless it has been successfully archived somewhere. Violating that invariant silently breaks PITR, which is why this feature has historically been difficult to land.