Technical Analysis: Unbounded Wait to Reach Consistent Recovery State After pg_rewind
Core Problem
After pg_rewind is used to resynchronize a former primary with a new primary, the rewound node can get stuck indefinitely waiting to reach a "consistent recovery state" — the point at which it can accept read-only connections. This is a liveness bug with potentially unbounded wait time when the new primary is idle.
Architectural Context
When PostgreSQL starts recovery (whether as a standby or after pg_rewind), it must replay WAL until it reaches the minRecoveryPoint — the LSN beyond which the data directory is guaranteed to be consistent. Only after lastReplayedEndRecPtr >= minRecoveryPoint will the system declare consistency and accept read-only queries (checked in CheckRecoveryConsistency()).
The minRecoveryPoint is a critical invariant: it must represent the end LSN of an actual WAL record, because lastReplayedEndRecPtr can only advance to record-end positions. This is the fundamental contract that pg_rewind violates.
The Violation
pg_rewind calls pg_current_wal_insert_lsn() on the source (new primary) to determine how far ahead the source's WAL has progressed. This function calls GetXLogInsertRecPtr(), which returns the current WAL insert pointer — the position where the next record will be written. When the source is idle and happens to be at the beginning of a new WAL segment (or page), this pointer points just past the page header (e.g., 0/04000028 for a segment-start long page header of size 40 bytes).
This value is then written into the rewound node's control file as the minRecoveryPoint. The problem: no WAL record ends at this position. It's a position inside the page header, not a record boundary. The recovery code compares lastReplayedEndRecPtr against this value, and since no record will ever have its end pointer equal to 0/04000028, the standby must wait for the next real WAL record to be written on the primary that ends at or beyond this position.
Why This Is Subtle
The bug manifests only when:
- The source primary is idle at exactly the start of a WAL page/segment after pg_rewind
- No new WAL activity arrives (idle primary)
- The
archive_timeoutorwal_sender_timeouthasn't triggered new WAL yet
In the reporter's reproduction, a 4+ minute wait occurred until a background RUNNING_XACTS record was eventually emitted. With the primary fully stopped, the wait is unbounded.
Proposed Solutions
Solution 1: Fix the Producer (pg_rewind) — Preferred
Replace GetXLogInsertRecPtr() with GetXLogInsertEndRecPtr() in pg_rewind's source-side LSN acquisition. GetXLogInsertEndRecPtr() returns the end position of the last completed WAL record, which by definition satisfies the invariant that minRecoveryPoint must be a record-end LSN.
Advantages:
- Fixes the root cause at the source
- Maintains the architectural invariant globally
- No downstream code needs to understand page-header edge cases
- Simple, surgical fix
Tradeoff: Older pg_rewind binaries running against newer servers would still exhibit the bug.
Solution 2: Defense-in-Depth in Recovery (v1 Patch)
Adjust minRecoveryPoint in CheckRecoveryConsistency() when it detects the value is exactly SizeOfXLogShortPHD or SizeOfXLogLongPHD past a page boundary — move it back to the page start. This effectively acknowledges that we're past the header and no record can end there.
Advantages:
- Protects against older pg_rewind versions
- Defense-in-depth
Disadvantages:
- Treats the symptom, not the cause
- Adds complexity to recovery code that must be understood by future maintainers
- Needs careful audit of all other
minRecoveryPointcomparison sites
Recommended Approach (per thread consensus)
Both fixes together: Solution 1 as the primary fix, Solution 2 as a backward-compatibility guard, with appropriate documentation explaining why page-header positions are the only "non-record-end" values that can appear in minRecoveryPoint.
Key Technical Details
Who Else Sets minRecoveryPoint?
- pg_basebackup: Uses
backup-end record's EndRecPtr— always a valid record-end LSN ✓ - UpdateMinRecoveryPoint (during recovery): Uses buffer LSNs, which are set from record-end LSNs by construction ✓
- pg_rewind: Uses
pg_current_wal_insert_lsn()— can return non-record-end position ✗
The WAL Page Header Size Issue
- Long page headers (at segment start):
SizeOfXLogLongPHD= 40 bytes → positions ending in0x28 - Short page headers (other pages):
SizeOfXLogShortPHD= 24 bytes → positions ending in0x18
The reporter's logs consistently show 0/02000028 and 0/04000028 — the segment-start long header offset, confirming the diagnosis.
Back-Patching Considerations
This bug affects all supported versions where pg_rewind exists (9.5+). Since it's a liveness bug that can cause unbounded unavailability of read-only connections on a rewound standby, back-patching is warranted. The fix (using GetXLogInsertEndRecPtr()) is minimal and low-risk.