Proposal: Conflict log history table for Logical Replication

First seen: 2025-08-05 12:24:01+00:00 · Messages: 311 · Participants: 9

Latest Update

2026-05-07 · opus 4.7

Round Update: Schema Rename Policy, Permissions Edge Cases, and Minor Cleanup

This round is largely consolidation rather than new architectural ground. The substantive items:

1. Schema Rename Debate — Settled in Favor of Permissiveness

Shveta re-opened the pg_conflict schema rename question with two concrete arguments:

Amit's ruling: don't make pg_conflict a special case. Other reserved schemas (including pg_toast, pg_catalog) allow rename-by-superuser, and the patch should not diverge. "We can try to prevent hard-coding schema names where possible but not sure we can guarantee that nothing related to pg_conflict schema won't break" — i.e., best-effort on the server side, but no hard block on rename.

Dilip accepted this and added "Analyze and avoid hardcoding the 'pg_conflict' schema name wherever possible" to the open-items list. He noted pg_toast/pg_catalog are themselves hardcoded in places like listPartitionedTables(), and that pushing the schema name into pg_subscription "is not a route we should go" — client-side psql describe output will need some other mechanism (TBD).

2. Permission Error-Message Audit

Shveta enumerated the four permission paths for ALTER TABLE on a conflict log table and documented the error each produces:

Role Owns sub? Error
Plain user permission denied for schema pg_conflict
Has pg_create_subscription No must be owner of table pg_conflict_NNNNN
Has pg_create_subscription Yes permission denied: "..." is a system catalog
Superuser N/A permission denied: "..." is a system catalog

She flagged that case 2 arguably should also produce the "system catalog" error since even if the user owned it, the ALTER would still be blocked. Amit reviewed the code path (RangeVarCallbackForAlterRelation) and confirmed the two-stage check (ownership/superuser first, then operation permission) is by design and matches the DROP case. No change — the current ordering is the PostgreSQL norm.

3. Vignesh's Minor Review Pass

Four minor comments, all cosmetic/defensive:

4. Open Items Named for Next Revision (Dilip)

  1. Purge hardcoded pg_conflict literals on server side; accept client-side hardcoding as unavoidable.
  2. Change how the CLT is displayed in \dRs+ (details not yet specified).
  3. Ownership transfer on ALTER SUBSCRIPTION OWNER TO — Dilip has a PoC but is still validating corner cases (this was Vignesh's earlier bug).

Amit also posted top-up cosmetic patches for Dilip to fold in.

Bottom Line

No design pivots. The pg_conflict schema model is holding up under late-stage scrutiny; the only policy question (should rename be blocked?) was answered "no, consistent with other reserved schemas." The remaining work is a cleanup pass and the known-open ownership-transfer fix.

History (1 prior analysis)
2026-05-06 · opus 4.7

Conflict Log History Table for Logical Replication

The Core Problem

PostgreSQL's logical replication conflict handling is a relatively young feature. Conflict detection landed in PG17, and while apply workers now emit structured conflict information, that information goes exclusively to the server log as plain text. For operators running active-active or bidirectional topologies, this creates three concrete pain points:

  1. Unqueryable: Parsing postgresql.log for conflict analysis is brittle — text formats change, and log rotation destroys history.
  2. Unstructured: Critical attributes (conflicting tuples, LSN, commit timestamps, origin) are buried inside free-form DETAIL: lines.
  3. Inaccessible to tooling: External monitors, resolution scripts, and audit systems cannot consume conflict data directly — they must scrape text.

The architectural significance is that without structured conflict capture, PostgreSQL cannot evolve toward richer features like automatic conflict resolution, conflict statistics, or cross-node consistency auditing. Commercial solutions (BDR's bdr.conflict_history, pgactive, Oracle GoldenGate) all provide a dedicated conflict table — this proposal brings that capability in-tree.

Design Evolution: Catalog vs. User Table vs. System-Managed Table

The thread's longest-running design axis was where the conflict history lives. Three models were evaluated:

Option A — System Catalog

Shveta's early instinct was a catalog table, noting that pg_statistic already grows with user activity without being problematic. Rejected because: (a) catalogs aren't designed for ever-growing data and lack purge semantics; (b) catalogs are hard to extend with user-directed partitioning strategies; (c) pg_dump/upgrade semantics for ever-growing catalog rows are awkward.

Option B — User Table (created internally)

This dominated the middle of the thread. Amit Kapila laid out the rationale early: ever-growing data, need for user-controlled retention/partitioning, alignment with the parallel "COPY error table" proposal [Tom Lane's thread]. A user-provided table name was contemplated (similar to slot_name), with internal auto-creation as the first-version simplification.

This raised several thorny sub-problems:

  • FOR ALL TABLES publication contamination: The conflict table would be picked up and replicated, which makes no sense (node-local data, and schema may change across major versions).
  • Ownership/permissions: If the subscription owner owns the table, they can DROP/ALTER it, breaking logging. Sawada-san raised a security concern — if run-as-owner semantics apply, who inserts?
  • Dependency lifecycle: If the subscription is dropped, does the table go too?

Multiple mitigation mechanisms were prototyped:

  • HEAP_INSERT_NO_LOGICAL flag to skip WAL decoding of these rows
  • user_catalog_table reloption — rejected because it changes decoding semantics (combo CIDs logged)
  • A new NON_PUBLISHABLE_TABLE reloption — deferred in favor of thread [1]'s EXCLUDE clause
  • Scanning pg_subscription at pg_get_publication_tables() time

Option C — Dedicated pg_conflict Schema (accepted)

Amit Kapila's pivotal proposal: create a new system-reserved schema pg_conflict, analogous to pg_toast. The table is placed there with auto-generated name pg_conflict_log_<subid>. This single design choice resolved nearly every open issue simultaneously:

  • Protection from user modification: Tables in pg_conflict are treated as system catalogs via IsSystemClass extension (IsConflictClass), blocking ALTER/DROP/TRUNCATE/INSERT/UPDATE by default.
  • Publication exclusion: is_publishable_class() rejects relations in pg_conflict namespace, with no runtime per-change scan needed.
  • pg_dump semantics: pg_dump can be taught to ignore the schema for normal dumps but preserve data across upgrades (important for audit/regulatory retention — Amit's point (c)).
  • Cross-database safety: Each database has its own pg_conflict schema, mirroring pg_toast.

The cost: a new built-in namespace with a hardcoded OID (1382) in pg_namespace.dat.

The Dependency Question

An extended sub-debate concerned whether to record a pg_depend entry between subscription and conflict log table. The fundamental obstacle: subscriptions are shared objects (in pg_subscription, a shared catalog), but the conflict table is database-local. Normally shared→local dependencies go in pg_shdepend, but pg_shdepend was designed for the reverse semantics (the shared referenced object cannot be dropped while local dependents exist, but local dependents can be dropped independently). DEPENDENCY_INTERNAL semantics require the opposite — the dependent cannot be dropped independently.

Dilip's analysis identified that findDependentObjects() distinguishes between outermost drops and recursive drops, and doDeletion() refuses shared objects entirely. A small patch was proposed to treat DEPENDENCY_INTERNAL on shared referenced objects specially. Amit framed this as architecturally sound because:

  • Tablespaces need pg_shdepend (cross-DB visibility during DROP TABLESPACE)
  • Subscriptions are pinned to subdbid, so DROP SUBSCRIPTION always runs in the owning database where pg_depend is directly accessible

Ultimately with the pg_conflict schema providing protection via IsSystemClass, the dependency became mostly cosmetic — it still makes drops explicit via performDeletion(), but isn't load-bearing for user protection.

The Multi-Unique-Conflict Data Model

A surprisingly difficult schema question was how to represent multiple_unique_conflicts — where one remote tuple collides with multiple local tuples via different unique indexes. Options considered:

  1. Multiple rows per conflict (initial prototype): bloats the table, breaks correspondence with pg_stat_subscription_stats counters (1 conflict = 1 counter increment).
  2. Two-table header/detail (Amit's proposal): normalized, but requires JOIN for every query.
  3. Array of conflict tuples (initial array-of-JSON attempt): each local field (xid, commit_ts, origin) becomes its own array column — messy.
  4. Single JSONB/JSON array of objects (accepted): one row per conflict, local_conflicts column holds [{"xid":..., "commit_ts":..., "origin":..., "key":..., "tuple":...}, ...].

Amit's market analysis was the decisive argument — active-active systems routinely experience multi-unique-key conflicts (user profiles colliding on both email and SSN; SKU collisions during data migration; Ops retry loops that hit ID error → fix → email error → fix → phone error). So the "this is rare" argument (initially Dilip's) was overruled, and the array-of-objects format was chosen.

The final schema:

relid, schemaname, relname, conflict_type,
remote_xid, remote_commit_lsn, remote_commit_ts, remote_origin,
replica_identity (JSON), remote_tuple (JSON),
local_conflicts (JSON[])

Subscription Options: From conflict_log_table to conflict_log_destination

The subscription option evolved through several generations:

  1. conflict_log_table='name' — user-provided table name, with NONE to disable.
  2. Two options: conflict_log_format + conflict_log_name — allow future extensibility to XML/file output.
  3. conflict_log_destination = log | table | all (with bitmask implementation). Sawada-san noted that with destination=table, conflicts wouldn't reach monitoring tools watching the server log, so a combined mode was needed.
  4. Debate on whether to support comma-separated multi-destination lists (deferred — only two destinations exist, all suffices).

When destination is table only, the server log still emits a brief "Conflict details logged to internal table with OID %u" message so monitors can still detect conflicts occurred.

Transaction Semantics for Conflict Inserts

For conflicts that raise ERROR (notably multiple_unique_conflicts), naively inserting the log row in the same transaction would roll it back with the apply transaction. The solution:

  1. prepare_conflict_log_tuple() builds the heap tuple and stashes it in MyLogicalRepWorker->conflict_log_tuple.
  2. For elevel < ERROR, insert immediately.
  3. For elevel >= ERROR, the PG_CATCH block in start_apply() opens a fresh transaction and inserts the deferred tuple before the worker exits.
  4. HEAP_INSERT_NO_LOGICAL flag is passed to heap_insert() so these tuples don't generate logical decoding output (belt-and-braces alongside the pg_conflict namespace exclusion).

This also drives why heap_create_with_catalog() is used directly rather than SPI — SPI would be affected by default_tablespace, default_toast_compression, event triggers, and utility hooks, making debugging harder and behavior less predictable.

Permissions Model (Late-Breaking)

Nisha discovered that non-superuser subscription owners couldn't SELECT from their own conflict log table because pg_conflict schema usage is blocked at parse time (LookupExplicitNamespaceaclcheck_error). Amit's solution in the final iterations:

  • Grant USAGE on pg_conflict to pg_create_subscription role.
  • Allow SELECT/DELETE/TRUNCATE on tables to table owners.
  • Block INSERT/UPDATE via pg_class_aclmask_ext() tweak (apply worker bypasses via direct heap_insert()).
  • Ownership of the CLT must track subscription ownership (Vignesh's bug report: ALTER SUBSCRIPTION OWNER TO left CLT owner stale).

Participant Weight and Domain Expertise

  • Dilip Kumar (patch author): drove implementation, made most design calls. Senior contributor with prior logical replication patches.
  • Amit Kapila (committer, logical replication subsystem owner): the decisive voice on every major design pivot — user vs. catalog table, the pg_conflict schema proposal, the multi-unique-conflict JSON array, upgrade semantics. His opinions consistently carried the day.
  • Masahiko Sawada (committer): raised the critical security concern about table insertion privilege (run-as-owner parallel), and the user_catalog_table cross-major-version compatibility argument.
  • Shveta Malik: most thorough reviewer on behavioral correctness — discovered the Assert failure in multiple_unique_conflicts, the key_tuple vs. replica_identity representation gap, the SELECT-FOR-UPDATE hole, and multiple schema-qualification issues.
  • Peter Smith: primary code-style/consistency reviewer across ~15 review rounds; drove the bitmask enum representation and consistent CLT terminology.
  • Vignesh C: caught the lock-level bug (RowExclusiveLock vs AccessExclusiveLock), the rename-publication-bypass, pg_dump dependency cycle, the ownership-transfer bug, and the pg_conflict schema rename case.
  • Bharath Rupireddy: raised the alternative of streaming to a separate log file — challenged but ultimately rejected.
  • Alastair Turner: suggested the typed-table + dependency approach early on (influenced the eventual pg_conflict design).
  • Nisha Moond: late-stage testing, found the NULL-column apply-worker crash and permissions gap.

Key Architectural Insights

  1. Mirroring pg_toast: By treating pg_conflict exactly like pg_toast (reserved schema, IsSystemClass membership, internal dependency, auto-create/drop with owning object), the patch inherits a well-understood pattern rather than inventing new machinery.

  2. Bitmask destination enum: Although the thread oscillated on sparse-array waste, the bitmask form (CONFLICT_LOG_DEST_LOG=0x01, CONFLICT_LOG_DEST_TABLE=0x02, CONFLICT_LOG_DEST_ALL=0x03) future-proofs for additional destinations without cascading if changes.

  3. Cross-major-version upgrade is non-negotiable: Amit's argument that regulatory/audit retention means conflict history must survive pg_upgrade shaped the table placement decision — an ordinary catalog table wouldn't naturally survive upgrade, while a pg_conflict-schema user-relation does (with pg_dump special-casing).

  4. The ever-growing problem isn't solved here: Retention/partitioning is deferred to future patches. The DELETE/TRUNCATE permissions granted to subscription owners are the user-level escape hatch.