[PATCH] Preserve replication origin OIDs in pg_upgrade

First seen: 2026-04-28 11:19:38+00:00 · Messages: 17 · Participants: 6

Latest Update

2026-06-01 · claude-opus-4-6

Monthly Summary: Preserve Replication Origin OIDs in pg_upgrade (May 2026)

Problem Statement

When pg_commit_ts SLRU data is migrated byte-for-byte during pg_upgrade, the embedded 2-byte RepOriginId values become semantically incorrect if replication origin OIDs are not preserved. This causes spurious update_origin_differs / delete_origin_differs conflicts after upgrading a logical replication subscriber, and in the worst case attributes row modifications to the wrong upstream publisher.

The root cause is a three-way mismatch: pg_upgrade historically doesn't preserve subscription OIDs → origin names (pg_<suboid>) change → replorigin_create() assigns fresh roidents in unpredictable order → SLRU records reference stale roident values.

Design Evolution

v1–v2: Special-case subscription origins

Ajin's initial approach treated subscription-associated origins differently from user-created ones, using separate code paths and special pg_dump/pg_dumpall support functions.

v3: Unified approach via subscription OID preservation

After Kuroda observed that preserving subscription OIDs would stabilize origin names, and Shveta correctly noted that name stability alone doesn't guarantee roident stability, the design converged on:

v4–v5: Implementation hardening

Addressed review findings including silent OID truncation (uint16 vs Oid mismatch), dead code removal, redundant lock cycling (eliminated replorigin_create_with_reploriginid entirely by inlining), and parallel safety annotations.

v6: Current state under review

Remaining issues identified but not yet resolved in a new version.

Key Design Decisions

  1. Backward compatibility from PG17+ (not PG19+): Ajin argues that restricting to PG19+ would leave subscriptions without origins when upgrading from PG17/18, since the new code suppresses CREATE SUBSCRIPTION's normal origin creation.

  2. Direct arguments over global-variable convention: Since binary_upgrade_create_replication_origin is itself the creation call (not a side-channel before CREATE), passing (roname, roident) directly is cleaner than the set-global-then-create pattern used for tables/types.

  3. Single-function architecture: v5 eliminated the intermediate replorigin_create_with_reploriginid function, consolidating all work into one function with one lock acquisition.

Outstanding Bugs (as of month-end)

  1. Origin name truncation (Shveta): Function declares NAME type (64-byte fixed) for the origin name parameter, but pg_replication_origin.roname is TEXT (unbounded). Origins with names >63 characters are silently truncated during upgrade. Fix: change parameter to TEXT type.

  2. Pre-existing origins on target cluster (Shlok): If the new cluster already has replication origins, upgrade fails late (during "Performing Upgrade") with a hard error. Needs a pre-flight consistency check during the early validation phase.

  3. Minor cleanup: Unnecessary includes, unexplained Assert on TOAST relid, and insufficient code comments explaining the PG17 version gate.

Month Trajectory

The thread progressed rapidly from problem identification through three major revision cycles. The architectural design was settled by mid-month (v3), with subsequent work focused on implementation correctness. Two significant bugs remain (name truncation, missing pre-flight check) requiring a v7. The patch is approaching committable shape but needs one more revision cycle.

History (1 prior analysis)
2026-06-01 · claude-opus-4-6

v7: Addressing All Outstanding Review Items

Ajin posts v7, which is a direct response to the review findings from Shlok and Shveta (covered in prior analyses #5 and #6). No new design decisions or architectural changes — this is a pure fix-up revision addressing all identified issues.

Fixes Confirmed in v7

  1. Pre-flight check for pre-existing origins on target cluster (Shlok's finding): Added. The new cluster is now verified during the "Performing Consistency Checks" phase to have no replication origins, matching the pattern used for replication slots.

  2. Unnecessary includes removed (Shveta's access/skey.h and catalog/indexing.h): Confirmed removed.

  3. Unexplained Assert removed (Shveta's Assert(!OidIsValid(rel->rd_rel->reltoastrelid))): Confirmed removed.

  4. PG17 version gate rationale (Shveta's question): Ajin provides the justification inline — PG17 introduced migration of pg_subscription_rel and remote LSN, so it's the natural boundary. Without origin migration from PG17+, you'd get an inconsistent state: subscription relation state and remote LSN are migrated but the corresponding replication origin doesn't exist.

  5. Origin name truncation bug fixed (Shveta's NAME→TEXT finding): Ajin confirms the parameter type is changed to handle origin names the same way pg_replication_origin_create does, eliminating the silent truncation of names longer than 64 characters.

Assessment

This is a clean revision cycle — all five outstanding items addressed, no new issues raised yet. The patch is now past the point where any reviewer has identified unresolved problems. The next step would be either another review pass looking for new issues, or committer attention.