pg_rewind and Diverging Timelines with Identical TLI Numbers
The Core Problem
PostgreSQL's timeline identifier (TLI) is a monotonically increasing integer that gets bumped every time a standby is promoted or point-in-time recovery (PITR) forks history. The TLI is fundamental to how pg_rewind, streaming replication, and archive recovery determine whether two servers share a common history, and where their histories diverged. The implicit assumption baked into the design is: if two servers are on the same TLI, they share the same WAL history for that timeline (modulo how far each has progressed).
Mats Kindahl, using a TLA+ model of streaming replication with TLC model checking, discovered a scenario where this invariant can be violated silently. The scenario requires two crashes during promotion sequences — uncommon but not impossibly so in production clusters with automated failover.
The Divergence Scenario
The failure mode exploits the fact that TLI allocation is local: when a server promotes, it reads the latest timeline history file it knows about and increments. If two servers independently promote from the same parent timeline without having seen each other's promotion, they will both pick the same next TLI number.
- S1 (primary, TLI 1) crashes.
- S1 restarts, performs end-of-recovery promotion, writes
XLOG_END_OF_RECOVERYswitching to TLI 2, and writes some records W1. - S1 crashes again before W1 or the TLI-2 history reaches any standby.
- S2 (standby, still on TLI 1) is promoted. Since S2 has never seen TLI 2, it also writes
XLOG_END_OF_RECOVERYfor TLI 2 and begins writing W2. - Now two physically distinct "TLI 2" histories exist, diverging immediately at the TLI-1 promotion LSN. Call them TLI 2.a (S1's) and TLI 2.b (S2's).
- S1 recovers and attempts to rejoin the cluster using
pg_rewindagainst S2.
Why pg_rewind Is Fooled
pg_rewind's core algorithm compares the source's and target's timeline history files to find the last common ancestor timeline, then identifies the divergence LSN and rewinds all blocks modified on the target after that LSN. The comparison is purely by TLI number. When both servers report "I'm on TLI 2", pg_rewind concludes they share all of TLI 2 up to the smaller of their current insert LSNs and only rewinds blocks beyond that point. But in this scenario TLI 2 itself is different history on the two nodes — W1 on S1 and W2 on S2 are physically distinct records with possibly overlapping LSNs.
The result is silent data divergence: after rewind and re-attachment as a standby, S1 retains blocks dirtied by W1 that were never part of S2's history, and may also fail to truncate/rewind blocks that would have been modified differently by W2. Replication proceeds without error because the LSN cursor advances normally, but the on-disk state is inconsistent with the primary.
The Proposed Fix
The patch introduces a timeline UUID — a unique identifier generated at the moment a new timeline is created — stored in the timeline history file. The invariant becomes: two timelines are "the same" iff both TLI and UUID match.
Evolution Across the Two Patch Versions
- v1 (2026-04-30): Added the UUID to both
XLOG_END_OF_RECOVERYWAL records and to.historyfiles. The intuition being that when a standby streams from the new primary, it would see the UUID in the EOR record and could detect mismatch even without fetching the history file. - v2 (2026-05-01): Mats found the UUID in the EOR record was unnecessary; it suffices to put it in the history file alone. This is architecturally cleaner — timeline identity is a property recorded in history files, not in the WAL stream itself, so the change is localized to timeline-history machinery and
pg_rewind's comparison logic. The v2 also adds a test exercising divergence going back three promotions, verifying the recursive "walk back until TLI and UUID match" logic.
Algorithmic Change in pg_rewind
The traditional algorithm walks both history files and finds the deepest TLI entry that appears in both. With the patch, it finds the deepest entry where both TLI and UUID match. If a TLI entry matches by number but not UUID, that TLI is treated as divergent — rewind must go further back, to the parent of the mismatched TLI, and use that parent's fork LSN as the divergence point.
Architectural Implications and Tradeoffs
Compatibility
Adding a UUID field to history files is an on-disk format change. History files are text format (LSN\tTLI\treason-style), so the patch presumably extends them with an extra column. Old history files without UUIDs need a compatibility path — typically, absence of UUID is treated as "legacy timeline, match by TLI only", which preserves current (broken) behavior for upgrades but allows new timelines to benefit from the fix. Mixed-version clusters during rolling upgrades would need care.
Why This Wasn't Caught Earlier
The scenario requires: (a) a promotion that fails to propagate its EOR before crashing, (b) a second independent promotion from the same parent TLI, and (c) the first node then trying to rejoin as a standby. This is a double-fault scenario that most HA tooling (Patroni, repmgr) would arguably prevent by fencing the crashed former primary. However, fencing is not bulletproof, and PostgreSQL's own correctness guarantees should not depend on external orchestration. The TLA+ model's value here is exactly in finding such low-probability but high-impact interleavings.
Alternative Designs Not Taken
One could imagine avoiding collisions by making TLI itself globally coordinated (e.g., derived from a system identifier plus counter), but this would break a lot of existing tooling and archive layouts (WAL filenames embed TLI as a fixed-width hex number). Attaching a separate UUID is minimally invasive — the TLI still serves its role as ordering/naming key, while the UUID disambiguates identity.
Another alternative is to use the system identifier of the promoting node plus the LSN of the fork point as a natural composite key. This avoids a new random identifier but couples timeline identity to the cluster's ControlFile system identifier, which in scenarios involving pg_resetwal or cloning could itself be ambiguous. A dedicated UUID is more robust.
Scope Beyond pg_rewind
Although the report frames this as a pg_rewind bug, the same TLI-collision logic affects:
- Archive recovery: a standby pulling WAL from an archive could pick up WAL segments from the "wrong" TLI 2 if archives are shared or merged.
- Streaming replication startup: the walreceiver's TLI negotiation could accept a primary on a divergent TLI 2.
pg_basebackupand base backup matching against archived WAL.
A complete fix likely needs the UUID check propagated into all these paths, not just pg_rewind. The patch as described focuses on pg_rewind and history-file handling; whether it also hardens walreceiver and recovery is a question for review.
Assessment
Mats Kindahl is not a core committer, but arrives with a formal model (TLA+/TLC) that demonstrates the bug is real rather than hypothetical — this lends significant weight. The diagnosis is sound and matches known folklore about TLI collisions in split-brain-ish scenarios. The proposed fix is conceptually correct and minimally invasive.
Open questions a reviewer (likely Heikki Linnakangas, Michael Paquier, or Álvaro Herrera given their history with timeline and rewind code) would raise:
- On-disk compat: how are old history files handled, and what's the upgrade story?
- Propagation completeness: does the UUID check cover walreceiver, archive recovery, and basebackup, or only
pg_rewind? - UUID generation: what source of randomness, and is it logged/auditable?
- Replay of existing EOR records: the v2 simplification (no UUID in WAL) means a standby can't detect the divergence from the WAL stream alone; it must fetch and compare history files. Is that sufficient in all replication paths?
- TAP test coverage: the three-promotion regression test is good; does it also cover streaming-replication-based attach (not just pg_rewind)?
The fact that this was discovered by model checking rather than field reports suggests it is genuinely rare in practice, but the silent-corruption character of the failure makes it a correctness issue worth fixing regardless of frequency.