Improve Conflict Detection When Replication Origins Are Reused
The Core Problem: Stale Origin IDs Causing Silent Conflict Misses
This thread addresses a subtle but architecturally significant bug in PostgreSQL's logical replication conflict detection mechanism. The issue stems from the intersection of two subsystems: replication origins (the identity tracking mechanism for replicated tuples) and commit_ts (the commit timestamp SLRU that stores per-transaction origin metadata).
How Conflict Detection Works Today
In logical replication, each subscription is assigned a ReplOriginId (a small integer, uint16) that is stored alongside the commit timestamp in the commit_ts SLRU. When the apply worker processes an incoming change that conflicts with an existing tuple (e.g., an UPDATE on a row that already exists), it checks whether tuple_origin == current_origin. If they match, the system assumes the tuple was written by this same subscription — meaning it's "our own" data — and skips raising an update_origin_differ conflict.
The Reuse Problem
ReplOriginId values are allocated from a limited namespace and are reused after a replication origin is dropped (via DROP SUBSCRIPTION). The dangerous sequence is:
- Subscription
sub1getsroident = 1, replicates rows into tablet1 sub1is dropped — origin ID1is freed- New subscription
sub2is created, getsroident = 1(reused) - Updates arrive for rows previously written by
sub1 - Conflict detection sees
tuple_origin (1) == current_origin (1)→ no conflict raised
This is a false negative: the system believes the row belongs to the current subscription when it actually belongs to a completely different (now-defunct) subscription. This becomes genuinely dangerous when sub2 connects to a different publisher than sub1 did — real data conflicts are silently swallowed.
Why This Matters Architecturally
This bug exposes a fundamental design tension: PostgreSQL uses a compact integer ID for origin tracking (good for SLRU storage efficiency) but provides no mechanism to distinguish between different temporal uses of the same ID. The commit_ts SLRU retains stale origin data indefinitely after a subscription is dropped, creating a ghost reference problem. This is particularly concerning as multi-master and bidirectional replication topologies become more common — silent conflict misses can lead to data divergence that's extremely difficult to detect and repair after the fact.
The thread also references related issues with tablesync worker origins ([1]), suggesting this is part of a broader class of problems around origin lifecycle management.
Proposed Solutions
Approach 1: Scrub Stale Origins from commit_ts SLRU on DROP SUBSCRIPTION
Mechanism: When a subscription is dropped and its replication origin is freed, scan the entire commit_ts SLRU and replace all occurrences of the old origin ID with InvalidRepOriginId (0). This ensures that any future subscription reusing the same ID will see origin 0 on old tuples, which will correctly differ from the new subscription's origin, triggering conflict detection.
Technical Implications:
-
Performance cost at DROP time: The
commit_tsSLRU can be very large on busy systems. A full scan requires reading every page, checking every entry, and rewriting modified pages. This is an O(n) operation over the entire transaction history window, which could makeDROP SUBSCRIPTIONunacceptably slow on large installations. -
Crash safety gap: This is the critical flaw. The SLRU scrubbing is not WAL-logged. If the server crashes partway through the scan, some entries retain the stale origin ID. After restart, the reuse problem silently returns. Making this crash-safe would require either:
- WAL-logging each modified SLRU page (expensive, generates significant WAL volume)
- A recovery-time replay mechanism to re-scrub after crash
- Both options add substantial complexity
-
Lock contention: Modifying commit_ts pages while concurrent transactions may be reading them raises questions about SLRU buffer locking and potential contention.
Approach 2: Store Origin Creation Timestamp (Preferred)
Mechanism: Add a creation timestamp to each replication origin's metadata. During conflict detection, when tuple_origin == current_origin, perform an additional check: if the tuple's commit timestamp is ≤ the origin's creation time, it must have been written by a previous incarnation of this origin ID, so raise a conflict.
Technical Implications:
-
Catalog schema change: This requires adding a column to
pg_replication_origin(or the underlying shared memory state). Since replication origin creation is already WAL-logged, the timestamp can be naturally persisted and recovered. -
Runtime overhead: Minimal — one additional timestamp comparison in the conflict detection hot path, only when
tuple_origin == current_origin(which is the common case for self-originated tuples, but the comparison is cheap). -
False positives at boundary: If a tuple was committed at the exact same microsecond as the origin creation, the
<=comparison will incorrectly flag it as a conflict. This is an extremely narrow window and a false positive (raising a conflict that isn't real) is far safer than a false negative (missing a real conflict). The author correctly identifies this as an acceptable tradeoff. -
Upgrade path: Existing origins created before the schema change won't have a creation timestamp. A reasonable default (e.g., epoch or
InvalidTimestamp) would need to be chosen — setting it to epoch would mean all tuples appear to predate origin creation, potentially generating false positives until old data ages out. This needs careful thought. -
No DROP-time cost: Unlike Approach 1, there's no expensive operation at subscription drop time.
Analysis of Design Tradeoffs
The two approaches represent a classic systems design tradeoff between eager cleanup (Approach 1: fix the data when the origin is freed) and lazy detection (Approach 2: detect the problem at query time using metadata).
Approach 2 is clearly superior for several reasons:
- Crash safety is inherent rather than requiring additional engineering
- No pathological performance cases (the SLRU scan in Approach 1 has unbounded cost)
- The additional metadata is small (one timestamp per origin, and the origin namespace is small)
- The false-positive edge case is harmless in practice (microsecond-level collision is rare, and conflict over-detection is safe)
The main risk with Approach 2 is the catalog/schema change and upgrade handling, but PostgreSQL regularly handles such changes across major versions.
Relationship to Broader Issues
The referenced threads [1] and [2] discuss related problems with tablesync origins — the temporary replication origins created during initial table synchronization in logical replication. These origins can similarly cause stale-reference problems. The origin reuse fix proposed here may synergize with solutions for the tablesync issue, as both stem from the same fundamental problem: origin IDs lack temporal disambiguation.
This also touches on the broader question of whether ReplOriginId should be a richer data type or whether the origin lifecycle needs more formal state management (e.g., tombstoning rather than immediate ID reuse).