Streaming Replication and WAL Archive Interactions: The Shared Archive Mode Revival
Core Problem: WAL Archival Gaps in HA Clusters
In PostgreSQL HA deployments, there is a well-known race between streaming replication and WAL archival. The primary streams WAL to standbys asynchronously with respect to archival: a WAL segment can be fully replicated (and even replayed on the standby) before archive_command/archive_library on the primary has successfully copied it to the archive. If the primary crashes permanently at that moment, the cluster is promoted to a standby, but the unarchived segments on the lost primary are gone. The surviving standby has the data (in pg_wal/), but from the archive's perspective there is a gap in the PITR timeline.
Andrey Borodin reports this affects 1–2% of clusters during large datacenter outages at Yandex — a concrete, quantified operational pain point at very large fleet scale (tens of thousands of clusters, per Mikhail Zhilin's later message). Harinath Kanchu confirms the same pattern at Apple. This is not a theoretical concern: it is a recurring, measurable data-loss risk for PITR/backup guarantees.
Why Existing Workarounds Are Insufficient
-
archive_mode=always— Each standby independently archives every segment. This works but is wasteful: in cloud object-store backup tools (WAL-G in particular), the archive path involvesGETto check whether the segment is already archived, then decrypt-and-compare, then potentiallyPUT. Andrey notes that switching toHEADwould be cheaper but still costs per-segment API calls across every standby. At fleet scale this is a significant infrastructure cost. -
External coordination (Patroni, PGConsul, custom logic) — HA tooling tries to re-archive from the standby after failover, but the WAL may already have been recycled from
pg_wal/because the standby has no notion that "this segment is not yet in the archive, don't recycle it." This is precisely the invariant the server must maintain; it cannot be bolted on externally without races. -
Distributed coordination in the archive tool — Would require consensus infrastructure inside archive_command implementations, duplicating what the server already knows.
The architectural insight is that only the server knows which WAL segments are safely archived, so only the server can correctly gate WAL recycling on that fact. Hence the revival of Heikki Linnakangas' ~2015 patch introducing archive_mode=shared.
The archive_mode=shared Design
The semantic contract of shared mode:
- The primary performs archiving as usual.
- The primary communicates archival progress to standbys over the replication stream (initially as WAL filenames; v5 shrinks this to TLI + segno for efficiency).
- Standbys use this information to maintain their own
archive_status/directory — marking segments.doneonly when the upstream confirms archival. XLogArchiveCheckDone()on the standby then correctly refuses to allow recycling of segments that are not yet confirmed archived upstream.- On failover, the new primary already has an accurate
archive_status/and can resume archiving from exactly the right point — no gap.
This is architecturally cleaner than always because archival work is not duplicated; the standby only tracks state, not bytes.
Key Technical Issues Raised
1. The XLogArchiveCheckDone() Early-Return Bug (Jaroslav Novikov)
Jaroslav identifies a subtle but critical correctness issue. The current logic:
if (!XLogArchivingAlways() &&
GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
return true;
says: if we are in archive recovery and not using archive_mode=always, treat WAL as deletable without consulting .ready/.done markers. This is correct for archive_mode=on (nobody archives during recovery, so blocking on markers would deadlock recycling). But for archive_mode=shared it is exactly wrong: shared mode depends on the .ready/.done files as the mechanism that defers recycling until the upstream confirms archival. The proposed fix:
if (!XLogArchivingAlways() &&
XLogArchiveMode != ARCHIVE_MODE_SHARED &&
GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
return true;
This touches a fundamental invariant: the early return existed because archive status files were meaningless on standbys; shared mode changes them to be authoritative on standbys. The fix is straightforward but illustrates how introducing a new archive mode ripples through assumptions scattered across xlog.c.
2. Timeline Switch Handling
Timeline switches are the traditional minefield for any cross-node WAL bookkeeping. Segments on the old timeline that the standby never fully received — or segments that diverge at the switch point — need careful treatment. Mikhail Zhilin (v5) questions whether the awkward switchpoint calculation is needed at all, proposing that segments not in the server's timeline history should simply be XLogArchiveForceDone()'d. This is a reasonable simplification: if a segment is off our history, we are not responsible for archiving it; forcing .done prevents those files from wedging recycling forever.
Andrey's second patch in the original split specifically addresses timeline switches in archive-status propagation, suggesting the area is known to be delicate.
3. Directory-Scan Cost
The third patch avoids scanning archive_status/ when unnecessary. In WAL-G-style setups where archival is fast and segments churn quickly, repeatedly readdir()'ing archive_status/ on every checkpoint becomes expensive. This is pure optimization but matters at scale.
4. v5 Structural Changes (Mikhail Zhilin)
Mikhail's v5 moves the design in several directions:
- Compact wire format: send
(TLI, segno)instead of full filenames. Cheaper and version-stable. - Shared memory between walreceiver and archiver: decouples receiving the upstream archival report from acting on it. The walreceiver's hot path does not need to do filesystem work.
- Archiver-centric
XLogArchiveForceDone: consolidates responsibility for marking segments done into one process rather than split between walreceiver and archiver.
The open questions Mikhail raises are all about where responsibility lives:
- pg_stat_archiver on the standby: should it reflect the standby's own view (own rows, cascading-friendly) or be a pass-through of the upstream's stats? Mikhail prefers native per-node stats for monitoring sanity and to avoid if-else branches. This is the right call — monitoring tools already expect
pg_stat_archiverto describe the local node. .backup.ready/.partial.readyon standby: can be force-done'd, since these are artifacts only the primary meaningfully archives.- Centralizing force-done in the archiver: cleaner locus of responsibility; walreceiver just publishes upstream state into shared memory and the archiver reacts.
Related Proposal: Expose last_archived_wal on Standby
Harinath Kanchu's request is pragmatic and lower-risk than full shared mode: just surface the primary's last_archived_wal on the standby (e.g., via keep-alive piggybacking, visible in pg_stat_wal_receiver). This would let external HA tools make correct decisions without the server-side archive_status/ plumbing. It is essentially the observability half of shared mode without the enforcement half. If shared mode stalls, this alone would close a significant fraction of the operational gap.
Architectural Significance
This thread represents the convergence of three things:
- A 10+ year-old Heikki patch that was correct in concept but never finished, now reopened with production evidence from multiple large operators (Yandex, Apple, and by extension Greenplum's downstream fork).
- Cross-pollination with Greenplum: the open-gpdb fork has carried a working variant. Upstreaming the refined version is a realistic path.
- A trend toward making PostgreSQL's HA story self-sufficient: reducing the surface area that external tooling must implement (and implement incorrectly).
The real committer-level design question is whether shared mode's added complexity in xlogarchive.c/xlog.c — with its invariant changes to archive_status/ semantics on standbys — is justified vs. the simpler "expose archival LSN on the standby" route. Shared mode is strictly more powerful (it prevents gaps rather than merely reporting that they will happen), but it touches delicate code paths around timeline switches and recycling.
Participant Weight
- Heikki Linnakangas (original patch author, committer) — has not yet re-engaged in this revival; his historical design is the baseline.
- Andrey Borodin — well-known contributor, WAL-G maintainer, Yandex. Brings both code and large-fleet operational data.
- Mikhail Zhilin — drove v5 with substantive architectural refactoring (shared memory, compact wire format). The most technically consequential post in the thread.
- Jaroslav Novikov — identified the
XLogArchiveCheckDone()correctness issue; Yandex colleague of Andrey. - Harinath Kanchu (Apple) — corroborating operator, advocate for the minimal variant.