Technical Analysis: Migration of pg_commit_ts Directory During pg_upgrade
Core Problem
PostgreSQL's pg_upgrade utility does not migrate the pg_commit_ts directory when upgrading a cluster. This directory stores commit timestamps for transactions when track_commit_timestamp is enabled — data that maps transaction IDs (XIDs) to their commit timestamps and replication origin IDs. Users who rely on this feature lose all commit timestamp history upon every major version upgrade, forcing them to manually copy the directory and run pg_resetwal — a fragile, undocumented workaround.
Architectural Context
The pg_commit_ts subsystem (introduced in PostgreSQL 9.5) stores fixed-size records mapping XIDs to (timestamp, replication_origin_id) tuples in SLRU-managed segment files under pg_commit_ts/. The control file tracks oldestCommitTsXid and newestCommitTsXid to bound the valid range.
During pg_upgrade, the tool uses pg_resetwal to set these control file fields in the new cluster. Previously, both were set to chkpnt_nxtxid (effectively invalidating all old commit timestamp data). The patch modifies this to preserve the old cluster's actual oldest/newest values when the data is migrated.
Proposed Solution (Evolution)
Initial Approach (v1-v7)
- Copy
pg_commit_tsdirectory from old to new cluster during upgrade - Pass correct oldestCommitTsXid/newestCommitTsXid to
pg_resetwalinstead of zeroing them out - Check
track_commit_timestampsetting on the new cluster — error out if disabled but old cluster has commit_ts data
Key Technical Changes
- In
get_control_data(): Parse oldestCommitTsXid/newestCommitTsXid from the old cluster's control data - In the resetwal invocation: Use preserved values instead of
chkpnt_nxtxid - New check function: Query
pg_settingson the new cluster to verifytrack_commit_timestamp = on - File copy: Use
copy_subdir_files("pg_commit_ts", "pg_commit_ts")when data exists
Critical Complication: Replication Origin ID Mismatch
Masahiko Sawada identified a fundamental flaw: the commit timestamp records store roident (replication origin ID), but pg_upgrade reassigns OIDs to subscriptions, causing origin IDs to change. This means:
- A tuple committed via subscription
mysub(origin 1) on the old cluster - After upgrade,
mysubmight get origin 2, while origin 1 now belongs to a different subscription - This causes false
update_origin_differsconflicts in logical replication
This is not merely a cosmetic issue — it can cause incorrect conflict detection behavior that defeats one of the primary use cases for migrating commit timestamps in the first place.
Later Approaches (v12-v14)
- v12: Added
--pg-commit-tscommand-line flag making migration opt-in - v13-v14: Attempted to also migrate/fix
pg_replication_originmappings via direct catalog UPDATE (architecturally problematic) - Emerging consensus: The patch should be split — replication origin migration handled separately via a proper binary-upgrade-mode SQL function (similar to
binary_upgrade_add_sub_rel_state())
Key Design Decisions & Disagreements
Error vs. Warning on Parameter Mismatch
- Amit Kapila: Should error out if old cluster has commit_ts data but new cluster has
track_commit_timestamp=off(analogous tocheck_new_cluster_replication_slots()) - Masahiko Sawada: Migration isn't mandatory; a warning suffices since users may intentionally want to disable it
- Resolution: The early patches used
pg_fatal(), but Sawada's argument introduced the opt-in flag approach
Mandatory vs. Optional Migration
- Early consensus (Amit, Maxim): Migration should be automatic with error on misconfiguration
- Later (after Sawada's replication origin issue): Migration became opt-in via command-line flag
- Kuroda suggested this might be too ad-hoc and the root cause (origin ID stability) should be fixed first
How to Detect New Cluster Configuration
- Kuroda initially suggested using
pg_controldataoutput (which showstrack_commit_timestamp setting) - Patch author noted that if the new cluster hasn't been started yet, the control file won't reflect
postgresql.confchanges — requiring a live query topg_settings
Use Cases Established
- Audit/forensics: Knowing when any record was last modified (patch author's primary use case — 2-year retention for detecting application errors)
- Logical replication conflict detection:
update_origin_differsanddelete_origin_differsconflicts rely on commit timestamps surviving upgrade - External system integration: Pulling records modified within a time window
Hayato Kuroda demonstrated use case #2 works correctly (when origin IDs are stable) through a detailed pub-sub upgrade test.
Current Status
The patch is at v14 with unresolved architectural concerns:
- Direct UPDATE of
pg_replication_originis not acceptable (must use binary-upgrade-mode functions) - A separate patch by Ajin has been proposed to handle replication origin migration properly
- The commit_ts migration patch should likely wait for or coordinate with that work
- The opt-in
--pg-commit-tsflag is debatable as a permanent interface
Technical Subtleties
- Version gating: Commit timestamps only exist in PG ≥ 9.5, so the patch gates on
major_version >= 905 - Zero initialization: Global
ClusterInfostructs are zero-initialized, making explicit zeroing of commit_ts fields unnecessary (Kuroda noted this) - Boolean GUC representation:
pg_settings.settingalways shows "on"/"off" for booleans regardless of how the user specified the value (true/yes/1 all normalize to "on") - SLRU segment files: The copy is a simple file-level copy since SLRU segments are self-contained and XID-indexed