Bug? pg_rewind produces unusable but starting database with standby recovery

First seen: 2026-03-18 11:59:35+00:00 · Messages: 2 · Participants: 1

Latest Update

2026-05-18 · claude-opus-4-6

Bug Analysis: pg_rewind Produces Unusable but Starting Database with Standby Recovery

Core Problem

This thread identifies a subtle and dangerous bug in PostgreSQL's recovery infrastructure when pg_rewind is used: a rewound standby can complete startup and accept connections despite having incomplete WAL, only to fail later when users attempt to query data. This is a silent data availability failure — the server appears healthy but is actually in an inconsistent state.

Why This Matters Architecturally

PostgreSQL's recovery system has an invariant: a server should not complete recovery and accept connections unless it has reached a consistent recovery point — the LSN at which the database state is guaranteed to be self-consistent (all incomplete transactions rolled back, all committed transactions fully applied). The existing code checks for "WAL ends before consistent recovery point" but the conditions for triggering this error are too narrow, allowing certain edge cases to slip through.

The failure mode is particularly insidious because:

  1. The server starts normally — monitoring systems see a healthy instance
  2. Errors only surface when specific tuples/pages are accessed that require WAL replay that never happened
  3. This violates the principle of fail-fast, turning a recoverable operational issue into a potential data integrity problem

Technical Root Cause

How the Bug Manifests

The scenario requires WAL segments to be present on the target but absent from what pg_rewind copies. This happens when:

  1. The source server has recycled certain WAL segments (normal operation)
  2. The target server still has those segments (due to WAL summarizer delaying recycling, or asymmetric wal_keep_size / max_wal_size settings)
  3. pg_rewind doesn't detect the missing WAL because the segments still exist on the target — it doesn't exit early with an error
  4. During recovery, the target has gaps in its WAL stream but the existing recovery checks don't catch this particular gap configuration

The WAL summarizer (summarize_wal, available since PG17) exacerbates this by keeping WAL segments around longer on the target, making the asymmetry more likely without any explicit configuration difference between source and target.

The "Backup-from-Replica" Connection

The author notes this could also affect the "backup-from-replica" scenario, where similar conditions of WAL segment availability asymmetry can occur.

Proposed Solutions

Patch 0001: Relax Recovery Consistency Check Conditions

The fix modifies the conditions under which PostgreSQL emits the "WAL ends before consistent recovery point" error. The current conditions are too strict — they only trigger in a subset of cases where WAL is insufficient. By relaxing these conditions, the patch ensures that recovery properly fails (rather than silently succeeding) when the available WAL doesn't reach the consistent recovery point.

This is described as a simple fix: the infrastructure for detecting and reporting this error already exists, it just needs broader triggering conditions. The comment in the existing code apparently already documents the intent to catch these cases — the implementation was simply incomplete.

Patch 0002: Fix pg_rewind's minRecoveryPoint Race Condition

The second patch addresses a race condition in pg_rewind itself where:

  1. pg_rewind traverses WAL files to build a file list
  2. pg_rewind then queries the minRecoveryPoint LSN
  3. Between steps 1 and 2, new WAL segments can be created
  4. The resulting minRecoveryPoint may reference WAL that wasn't captured in step 1

The fix reverses the order: capture minRecoveryPoint before traversing WAL files. This ensures the stated recovery point is always achievable with the copied WAL data.

This is a TOCTOU (time-of-check-to-time-of-use) bug — classic in systems that query state non-atomically. The fix establishes proper ordering to maintain the invariant that minRecoveryPoint ≤ max(available WAL).

Interaction Between Patches

While independent (either can be applied alone), they complement each other:

Patch 0001 actually exposed 0002 — once the recovery check was broadened, it caught cases where pg_rewind itself was producing inconsistent output. This is a classic example of how better error detection reveals upstream bugs.

Version Impact

Open Questions

As of the last message, this thread has received no responses from other hackers. Key questions that would need community input: