Streaming replication and WAL archive interactions

First seen: 2026-02-12 06:56:10+00:00 · Messages: 6 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

Streaming Replication and WAL Archive Interactions: The Shared Archive Mode Revival

Core Problem: WAL Archival Gaps in HA Clusters

In PostgreSQL HA deployments, there is a well-known race between streaming replication and WAL archival. The primary streams WAL to standbys asynchronously with respect to archival: a WAL segment can be fully replicated (and even replayed on the standby) before archive_command/archive_library on the primary has successfully copied it to the archive. If the primary crashes permanently at that moment, the cluster is promoted to a standby, but the unarchived segments on the lost primary are gone. The surviving standby has the data (in pg_wal/), but from the archive's perspective there is a gap in the PITR timeline.

Andrey Borodin reports this affects 1–2% of clusters during large datacenter outages at Yandex — a concrete, quantified operational pain point at very large fleet scale (tens of thousands of clusters, per Mikhail Zhilin's later message). Harinath Kanchu confirms the same pattern at Apple. This is not a theoretical concern: it is a recurring, measurable data-loss risk for PITR/backup guarantees.

Why Existing Workarounds Are Insufficient

  1. archive_mode=always — Each standby independently archives every segment. This works but is wasteful: in cloud object-store backup tools (WAL-G in particular), the archive path involves GET to check whether the segment is already archived, then decrypt-and-compare, then potentially PUT. Andrey notes that switching to HEAD would be cheaper but still costs per-segment API calls across every standby. At fleet scale this is a significant infrastructure cost.

  2. External coordination (Patroni, PGConsul, custom logic) — HA tooling tries to re-archive from the standby after failover, but the WAL may already have been recycled from pg_wal/ because the standby has no notion that "this segment is not yet in the archive, don't recycle it." This is precisely the invariant the server must maintain; it cannot be bolted on externally without races.

  3. Distributed coordination in the archive tool — Would require consensus infrastructure inside archive_command implementations, duplicating what the server already knows.

The architectural insight is that only the server knows which WAL segments are safely archived, so only the server can correctly gate WAL recycling on that fact. Hence the revival of Heikki Linnakangas' ~2015 patch introducing archive_mode=shared.

The archive_mode=shared Design

The semantic contract of shared mode:

This is architecturally cleaner than always because archival work is not duplicated; the standby only tracks state, not bytes.

Key Technical Issues Raised

1. The XLogArchiveCheckDone() Early-Return Bug (Jaroslav Novikov)

Jaroslav identifies a subtle but critical correctness issue. The current logic:

if (!XLogArchivingAlways() &&
    GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
    return true;

says: if we are in archive recovery and not using archive_mode=always, treat WAL as deletable without consulting .ready/.done markers. This is correct for archive_mode=on (nobody archives during recovery, so blocking on markers would deadlock recycling). But for archive_mode=shared it is exactly wrong: shared mode depends on the .ready/.done files as the mechanism that defers recycling until the upstream confirms archival. The proposed fix:

if (!XLogArchivingAlways() &&
    XLogArchiveMode != ARCHIVE_MODE_SHARED &&
    GetRecoveryState() == RECOVERY_STATE_ARCHIVE)
    return true;

This touches a fundamental invariant: the early return existed because archive status files were meaningless on standbys; shared mode changes them to be authoritative on standbys. The fix is straightforward but illustrates how introducing a new archive mode ripples through assumptions scattered across xlog.c.

2. Timeline Switch Handling

Timeline switches are the traditional minefield for any cross-node WAL bookkeeping. Segments on the old timeline that the standby never fully received — or segments that diverge at the switch point — need careful treatment. Mikhail Zhilin (v5) questions whether the awkward switchpoint calculation is needed at all, proposing that segments not in the server's timeline history should simply be XLogArchiveForceDone()'d. This is a reasonable simplification: if a segment is off our history, we are not responsible for archiving it; forcing .done prevents those files from wedging recycling forever.

Andrey's second patch in the original split specifically addresses timeline switches in archive-status propagation, suggesting the area is known to be delicate.

3. Directory-Scan Cost

The third patch avoids scanning archive_status/ when unnecessary. In WAL-G-style setups where archival is fast and segments churn quickly, repeatedly readdir()'ing archive_status/ on every checkpoint becomes expensive. This is pure optimization but matters at scale.

4. v5 Structural Changes (Mikhail Zhilin)

Mikhail's v5 moves the design in several directions:

The open questions Mikhail raises are all about where responsibility lives:

Related Proposal: Expose last_archived_wal on Standby

Harinath Kanchu's request is pragmatic and lower-risk than full shared mode: just surface the primary's last_archived_wal on the standby (e.g., via keep-alive piggybacking, visible in pg_stat_wal_receiver). This would let external HA tools make correct decisions without the server-side archive_status/ plumbing. It is essentially the observability half of shared mode without the enforcement half. If shared mode stalls, this alone would close a significant fraction of the operational gap.

Architectural Significance

This thread represents the convergence of three things:

  1. A 10+ year-old Heikki patch that was correct in concept but never finished, now reopened with production evidence from multiple large operators (Yandex, Apple, and by extension Greenplum's downstream fork).
  2. Cross-pollination with Greenplum: the open-gpdb fork has carried a working variant. Upstreaming the refined version is a realistic path.
  3. A trend toward making PostgreSQL's HA story self-sufficient: reducing the surface area that external tooling must implement (and implement incorrectly).

The real committer-level design question is whether shared mode's added complexity in xlogarchive.c/xlog.c — with its invariant changes to archive_status/ semantics on standbys — is justified vs. the simpler "expose archival LSN on the standby" route. Shared mode is strictly more powerful (it prevents gaps rather than merely reporting that they will happen), but it touches delicate code paths around timeline switches and recycling.

Participant Weight