Bug Analysis: pg_rewind Produces Unusable but Starting Database with Standby Recovery
Core Problem
This thread identifies a subtle and dangerous bug in PostgreSQL's recovery infrastructure when pg_rewind is used: a rewound standby can complete startup and accept connections despite having incomplete WAL, only to fail later when users attempt to query data. This is a silent data availability failure — the server appears healthy but is actually in an inconsistent state.
Why This Matters Architecturally
PostgreSQL's recovery system has an invariant: a server should not complete recovery and accept connections unless it has reached a consistent recovery point — the LSN at which the database state is guaranteed to be self-consistent (all incomplete transactions rolled back, all committed transactions fully applied). The existing code checks for "WAL ends before consistent recovery point" but the conditions for triggering this error are too narrow, allowing certain edge cases to slip through.
The failure mode is particularly insidious because:
- The server starts normally — monitoring systems see a healthy instance
- Errors only surface when specific tuples/pages are accessed that require WAL replay that never happened
- This violates the principle of fail-fast, turning a recoverable operational issue into a potential data integrity problem
Technical Root Cause
How the Bug Manifests
The scenario requires WAL segments to be present on the target but absent from what pg_rewind copies. This happens when:
- The source server has recycled certain WAL segments (normal operation)
- The target server still has those segments (due to WAL summarizer delaying recycling, or asymmetric
wal_keep_size/max_wal_sizesettings) pg_rewinddoesn't detect the missing WAL because the segments still exist on the target — it doesn't exit early with an error- During recovery, the target has gaps in its WAL stream but the existing recovery checks don't catch this particular gap configuration
The WAL summarizer (summarize_wal, available since PG17) exacerbates this by keeping WAL segments around longer on the target, making the asymmetry more likely without any explicit configuration difference between source and target.
The "Backup-from-Replica" Connection
The author notes this could also affect the "backup-from-replica" scenario, where similar conditions of WAL segment availability asymmetry can occur.
Proposed Solutions
Patch 0001: Relax Recovery Consistency Check Conditions
The fix modifies the conditions under which PostgreSQL emits the "WAL ends before consistent recovery point" error. The current conditions are too strict — they only trigger in a subset of cases where WAL is insufficient. By relaxing these conditions, the patch ensures that recovery properly fails (rather than silently succeeding) when the available WAL doesn't reach the consistent recovery point.
This is described as a simple fix: the infrastructure for detecting and reporting this error already exists, it just needs broader triggering conditions. The comment in the existing code apparently already documents the intent to catch these cases — the implementation was simply incomplete.
Patch 0002: Fix pg_rewind's minRecoveryPoint Race Condition
The second patch addresses a race condition in pg_rewind itself where:
- pg_rewind traverses WAL files to build a file list
- pg_rewind then queries the
minRecoveryPointLSN - Between steps 1 and 2, new WAL segments can be created
- The resulting
minRecoveryPointmay reference WAL that wasn't captured in step 1
The fix reverses the order: capture minRecoveryPoint before traversing WAL files. This ensures the stated recovery point is always achievable with the copied WAL data.
This is a TOCTOU (time-of-check-to-time-of-use) bug — classic in systems that query state non-atomically. The fix establishes proper ordering to maintain the invariant that minRecoveryPoint ≤ max(available WAL).
Interaction Between Patches
While independent (either can be applied alone), they complement each other:
- 0001 makes the server properly detect and reject incomplete recovery states (defense in depth)
- 0002 prevents pg_rewind from creating those incomplete states in the first place (root cause fix)
Patch 0001 actually exposed 0002 — once the recovery check was broadened, it caught cases where pg_rewind itself was producing inconsistent output. This is a classic example of how better error detection reveals upstream bugs.
Version Impact
- The WAL summarizer path affects PG17+
- The asymmetric WAL configuration path can affect earlier versions (pre-17)
- The pg_rewind race condition (0002) is likely present in all versions with pg_rewind
Open Questions
As of the last message, this thread has received no responses from other hackers. Key questions that would need community input:
- Are there other scenarios where the relaxed conditions in 0001 might produce false positives (rejecting valid recoveries)?
- Does the reordering in 0002 introduce any new edge cases (e.g., capturing a minRecoveryPoint that's too old)?
- Should there be additional WAL integrity validation in pg_rewind beyond ordering fixes?