[PATCH] Fix stale relation close in sequence synchronization

First seen: 2026-04-28 11:24:24+00:00 · Messages: 16 · Participants: 5

Latest Update

2026-05-06 · opus 4.7

Core Problem: Stale Relation Pointer in Sequence Sync Worker

The bug resides in the logical replication sequence synchronization code path (copy_sequences() and its helper get_and_validate_seq_info() in the tablesync infrastructure). Sequence sync is a relatively recent capability in logical replication that allows sequence values to be copied from publisher to subscriber in batches.

The Defect

copy_sequences() iterates over tuples returned by the publisher's pg_get_sequence_data() SRF, calling get_and_validate_seq_info() for each row. This helper is expected to:

  1. Open the local sequence relation (returning it via an output parameter Relation *sequence_rel).
  2. Validate the remote data against local catalog state.
  3. Return a status code such as COPYSEQ_OK, COPYSEQ_SKIPPED, etc.

The bug: when pg_get_sequence_data() returns NULL columns (which it does when the publisher role lacks privileges on the sequence, or the sequence was concurrently dropped, or it's another session's temp sequence), get_and_validate_seq_info() takes an early-return path with COPYSEQ_SKIPPED before assigning the output Relation parameter. The caller's sequence_rel variable — declared outside the loop in the original code — therefore still holds the Relation pointer from the previous successfully-processed row. copy_sequences() then calls table_close() on that stale pointer, double-closing a relation that was already closed in the prior iteration. On assert-enabled builds this trips:

TRAP: failed Assert("rel->rd_refcnt > 0"); relcache.c

Why It Matters Architecturally

This is a classic output-parameter contract violation: a function with multiple return paths must either (a) guarantee all paths initialize out-params, or (b) the caller must not consume the out-param on failure paths. Here neither invariant held, and the bug only surfaces when the skipped row is not the first in the batch — which is why it escaped testing. The reproduction requires a carefully crafted privilege setup: a replication role that can read some sequences in the publication but not others, triggering pg_get_sequence_data()'s privilege check to emit NULLs rather than error out. This NULL-on-lack-of-privilege behavior of pg_get_sequence_data() is itself a deliberate design so a single inaccessible sequence does not break an entire sequence batch.

Proposed Solutions and Evolution

Two complementary fixes were considered:

  1. Defensive initialization: Clear *sequence_rel = NULL at entry to get_and_validate_seq_info() and sequence_rel = NULL after table_close() in the caller. This guarantees the out-param contract.
  2. Tightening variable lifetime: Move the sequence_rel declaration inside the per-tuple loop so C scoping rules eliminate any carry-over. Hayato Kuroda advocated this — the underlying problem is a "wrong lifetime" for the variable.

The committed fix (v4) combined both: scope the variable to the loop body (eliminating the bug class structurally) while retaining the NULL-init at function entry as defense-in-depth. In subsequent review Vignesh noted the explicit NULL-init in the helper was redundant once the scoping fix was in place; Ayush kept it as insurance but this was later trimmed.

Secondary Issue: Misleading Error Message

report_sequence_errors() previously logged "missing sequence on publisher" for any skipped sequence. Vignesh pointed out this misdiagnoses privilege failures as missing objects. Ayush's v5 generalized the wording to "missing or inaccessible sequence on publisher" and renamed internal variables missing_*unavailable_*.

Amit Kapila pushed v4 rather than v5, explicitly noting that even v5 was incomplete: sequence sync is also skipped for another session's temporary sequences on the publisher, so the taxonomy of "why was this skipped" is broader than "missing vs inaccessible." He left the door open for a more comprehensive message-generalization patch rather than block the crash fix on bikeshedding the wording.

Buildfarm Follow-up

After commit, the drongo animal (Windows) failed. The cause was the TAP test's newly-created regress_seq_repl role not being permitted by pg_hba.conf on Windows builds that don't use UNIX sockets. The fix followed the established pattern from commit def0ce3370689b939c6d7a3c3eb824d69989ef6e: pass auth_extra => [ '--create-role' => 'regress_seq_repl' ] to $node->init(), which causes pg_regress's pg_hba setup to include the role. This is a well-known pitfall for TAP tests that introduce new login roles.

Key Design Decisions and Tradeoffs

Participant Dynamics