Preserving Replication Origin OIDs in pg_upgrade
The Core Architectural Problem
This thread addresses a silent data-correctness hazard at the intersection of three PostgreSQL subsystems: pg_upgrade, logical replication origins, and commit timestamps (pg_commit_ts). The bug manifests as spurious update_origin_differs conflicts after a major-version upgrade of a logical replication subscriber, and in the worst case causes the subscriber to attribute row modifications to the wrong upstream publisher.
Why roidents are embedded in pg_commit_ts
When track_commit_timestamp is enabled, each committed transaction's SLRU record in pg_commit_ts stores not just the commit time but also a RepOriginId (a 2-byte roident). This is the mechanism by which conflict detection on the subscriber side — specifically the update_origin_differs and delete_origin_differs conflict types introduced for logical replication conflict detection — determines whether a local row was last modified by the local node or by some remote origin. Crucially, the roident stored in the SLRU is a numeric identifier, not the textual origin name (pg_<suboid>). The mapping from roident → roname lives in the pg_replication_origin catalog.
The breakage under pg_upgrade
The chain of fragility is:
pg_upgradepreserves relfilenodes, TOAST OIDs, relation OIDs, type OIDs, etc., but historically does not preserve subscription OIDs.- A subscription's replication origin is conventionally named
pg_<suboid>(seeApplyWorkerMain/replorigin_by_nameusage). If the suboid changes, the origin name changes. - During
CREATE SUBSCRIPTIONon the new cluster,replorigin_create()allocates a fresh roident by scanningpg_replication_originfor the lowest unused 2-byte ID. The order of assignment depends on the orderCREATE SUBSCRIPTIONruns and on prior allocations — it is not stable across upgrades. - Meanwhile,
pg_commit_tswas proposed (in the sibling thread referenced as [1]) to be copied byte-for-byte from the old cluster to preserve conflict-detection metadata.
The result is a semantic mismatch: SLRU records say "roident 1 wrote this row" meaning subA in the old cluster, but the new cluster's catalog says roident 1 is subB. Conflict detection will incorrectly fire (or fail to fire). Ajin's opening message frames this crisply with the "swap" scenario — the most dangerous case because it converts silent metadata into actively wrong conflict verdicts.
Design Evolution: Two Competing Approaches
Approach 1 (v1): Special-case subscription origins
Ajin's initial patch took a surgical approach:
pg_dumpallemits non-subscription origins with their roidents and LSNs via a newbinary_upgrade_create_replication_origin()support function.pg_dumprecords the old roident alongside each subscription's dump entry.- On restore,
CREATE SUBSCRIPTIONis told (viabinary_upgrade_set_next_replorigin_oid()) to skip its normal origin creation and instead adopt the preserved roident. binary_upgrade_replorigin_advance()restores the replication progress LSN.
This treats subscription-associated origins as a distinct class from user-created origins (e.g., those created manually via pg_replication_origin_create() for custom replication solutions like pglogical or bidirectional setups).
Approach 2 (v2/v3): Preserve subscription OIDs, then everything falls out
Kuroda-san's response reframes the problem: if subscription OIDs were preserved across pg_upgrade, the origin name pg_<suboid> would be stable, but — as Shveta correctly pushes back — name stability alone does not imply roident stability, because roident allocation is independent and order-dependent.
Shveta's counterexample is important and technically precise: with two subscriptions at roidents 2 and 3 (because roident 1 had been used and dropped), re-creation from scratch would allocate from 1 upward, yielding 1 and 2. Same names, different numeric IDs, same bug.
Vignesh then produces the missing piece: a rebased patch that preserves subscription OIDs themselves. Ajin's v3 composes these:
- 0001 (Vignesh): Preserve subscription OIDs through pg_upgrade (analogous to how relation and type OIDs are preserved, using the existing
binary_upgrade_set_next_*_pg_*_oidmachinery). - 0002 (Ajin): With OIDs preserved, origin names are automatically stable. Now
pg_dumpallcan emit all replication origins — subscription-associated and user-created alike — uniformly with their(roname, roident, remote_lsn)triple, recreating them on the new cluster via a single code path.CREATE SUBSCRIPTIONis told not to create its own origin (since it already exists from the dumpall phase).
This is architecturally cleaner: it eliminates the special case, reduces the number of pg_upgrade support functions, and ensures the roident preservation is a consequence of the same generic mechanism that handles manually-created origins.
Key Technical Tradeoffs and Subtleties
-
OID preservation as a precondition. Kuroda's reference to the earlier thread [1] notes that subscription-OID preservation had been previously proposed but rejected for lack of motivation. The conflict-detection correctness argument here is the "strong motivation" that was missing — this is an important procedural point. Preserving subscription OIDs has minor implications (the OID namespace is shared;
GetNewOidWithIndexonpg_subscriptionmust accept the preserved value), but no downsides surface in the thread. -
Non-subscription origins matter too. The v3 approach correctly handles user-created origins (e.g., those used by logical replication extensions, custom apply workers, or bidirectional configurations). The v1 approach already handled them; v3 unifies them with subscription origins. This is important because
pg_commit_tsrecords reference any roident, not just subscription-derived ones. -
LSN position (
remote_lsn) preservation. Origin state isn't just an OID — it includes the replication progress LSN (replorigin_session_origin_lsn/ the value advanced bypg_replication_origin_advance). Both approaches preserve this viabinary_upgrade_replorigin_advance(). Without it, the subscriber would re-request changes from the publisher starting at an earlier LSN, causing duplicate apply and conflicts. -
Ordering of restore steps. In v3: dumpall creates origins (with preserved roident+name+LSN) → per-database CREATE SUBSCRIPTION runs but is instructed to skip origin creation → subscriptions are re-enabled. The skip is necessary because
CREATE SUBSCRIPTIONwould otherwise tryreplorigin_create()on an already-existing name. -
Coupling with the pg_commit_ts migration patch. This patch is only useful if
pg_commit_tsis actually being migrated (the sibling thread [1]). Without that migration, there are no stale roident references to worry about. The two patches are logically a unit.
Participant Dynamics
- Ajin Cherian (Fujitsu) — patch author, driving the design. His willingness to rebase onto Vignesh's OID-preservation patch and subsume his own special-case logic shows good design judgment.
- Hayato Kuroda (Fujitsu) — senior reviewer; his observation that subscription-OID preservation would simplify the problem catalyzed the redesign. He also provided the historical context on the earlier rejected OID-preservation proposal.
- Vignesh C — produced the subscription-OID-preservation patch (0001), which became the foundation. His contribution is mechanically small but structurally load-bearing.
- Shveta Malik — provided the critical correctness check on Kuroda's suggestion, identifying that name preservation ≠ roident preservation. This sharpened the requirement: we need OID preservation plus explicit roident preservation via binary-upgrade functions.
All four participants are active in the logical replication area, and the discussion converges quickly (within ~8 days from proposal to v3) — suggesting broad agreement on the problem and solution shape.
Assessment
The v3 design is the right one. It:
- Fixes a real correctness bug (not just a cosmetic issue).
- Generalizes cleanly via existing pg_upgrade conventions (
binary_upgrade_set_next_*). - Removes the asymmetry between subscription-associated and user-created origins.
- Is a natural companion to the pg_commit_ts migration work.
Open questions not fully resolved in the visible thread: behavior when the new cluster already has a conflicting roident (shouldn't happen on a fresh target, but worth asserting), and whether the preserved-origin creation should be gated on track_commit_timestamp being enabled on the old cluster (arguably always preserve, since the overhead is negligible and extensions may rely on origin identity).