2026-05-20 · claude-opus-4-6

Technical Analysis: Unbounded Wait to Reach Consistent Recovery State After pg_rewind

Core Problem

After pg_rewind is used to resynchronize a former primary with a new primary, the rewound node can get stuck indefinitely waiting to reach a "consistent recovery state" — the point at which it can accept read-only connections. This is a liveness bug with potentially unbounded wait time when the new primary is idle.

Architectural Context

When PostgreSQL starts recovery (whether as a standby or after pg_rewind), it must replay WAL until it reaches the minRecoveryPoint — the LSN beyond which the data directory is guaranteed to be consistent. Only after lastReplayedEndRecPtr >= minRecoveryPoint will the system declare consistency and accept read-only queries (checked in CheckRecoveryConsistency()).

The minRecoveryPoint is a critical invariant: it must represent the end LSN of an actual WAL record, because lastReplayedEndRecPtr can only advance to record-end positions. This is the fundamental contract that pg_rewind violates.

The Violation

pg_rewind calls pg_current_wal_insert_lsn() on the source (new primary) to determine how far ahead the source's WAL has progressed. This function calls GetXLogInsertRecPtr(), which returns the current WAL insert pointer — the position where the next record will be written. When the source is idle and happens to be at the beginning of a new WAL segment (or page), this pointer points just past the page header (e.g., 0/04000028 for a segment-start long page header of size 40 bytes).

This value is then written into the rewound node's control file as the minRecoveryPoint. The problem: no WAL record ends at this position. It's a position inside the page header, not a record boundary. The recovery code compares lastReplayedEndRecPtr against this value, and since no record will ever have its end pointer equal to 0/04000028, the standby must wait for the next real WAL record to be written on the primary that ends at or beyond this position.

Why This Is Subtle

The bug manifests only when:

The source primary is idle at exactly the start of a WAL page/segment after pg_rewind
No new WAL activity arrives (idle primary)
The archive_timeout or wal_sender_timeout hasn't triggered new WAL yet

In the reporter's reproduction, a 4+ minute wait occurred until a background RUNNING_XACTS record was eventually emitted. With the primary fully stopped, the wait is unbounded.

Proposed Solutions

Solution 1: Fix the Producer (pg_rewind) — Preferred

Replace GetXLogInsertRecPtr() with GetXLogInsertEndRecPtr() in pg_rewind's source-side LSN acquisition. GetXLogInsertEndRecPtr() returns the end position of the last completed WAL record, which by definition satisfies the invariant that minRecoveryPoint must be a record-end LSN.

Advantages:

Fixes the root cause at the source
Maintains the architectural invariant globally
No downstream code needs to understand page-header edge cases
Simple, surgical fix

Tradeoff: Older pg_rewind binaries running against newer servers would still exhibit the bug.

Solution 2: Defense-in-Depth in Recovery (v1 Patch)

Adjust minRecoveryPoint in CheckRecoveryConsistency() when it detects the value is exactly SizeOfXLogShortPHD or SizeOfXLogLongPHD past a page boundary — move it back to the page start. This effectively acknowledges that we're past the header and no record can end there.

Advantages:

Protects against older pg_rewind versions
Defense-in-depth

Disadvantages:

Treats the symptom, not the cause
Adds complexity to recovery code that must be understood by future maintainers
Needs careful audit of all other minRecoveryPoint comparison sites

Recommended Approach (per thread consensus)

Both fixes together: Solution 1 as the primary fix, Solution 2 as a backward-compatibility guard, with appropriate documentation explaining why page-header positions are the only "non-record-end" values that can appear in minRecoveryPoint.

Key Technical Details

Who Else Sets minRecoveryPoint?

pg_basebackup: Uses backup-end record's EndRecPtr — always a valid record-end LSN ✓
UpdateMinRecoveryPoint (during recovery): Uses buffer LSNs, which are set from record-end LSNs by construction ✓
pg_rewind: Uses pg_current_wal_insert_lsn() — can return non-record-end position ✗

The WAL Page Header Size Issue

Long page headers (at segment start): SizeOfXLogLongPHD = 40 bytes → positions ending in 0x28
Short page headers (other pages): SizeOfXLogShortPHD = 24 bytes → positions ending in 0x18

The reporter's logs consistently show 0/02000028 and 0/04000028 — the segment-start long header offset, confirming the diagnosis.

Back-Patching Considerations

This bug affects all supported versions where pg_rewind exists (9.5+). Since it's a liveness bug that can cause unbounded unavailability of read-only connections on a rewound standby, back-patching is warranted. The fix (using GetXLogInsertEndRecPtr()) is minimal and low-risk.

[BUG] Take a long time to reach consistent after pg_rewind

Latest Update