Core Problem: Stale Relation Pointer in Sequence Sync Worker
The bug resides in the logical replication sequence synchronization code path (copy_sequences() and its helper get_and_validate_seq_info() in the tablesync infrastructure). Sequence sync is a relatively recent capability in logical replication that allows sequence values to be copied from publisher to subscriber in batches.
The Defect
copy_sequences() iterates over tuples returned by the publisher's pg_get_sequence_data() SRF, calling get_and_validate_seq_info() for each row. This helper is expected to:
- Open the local sequence relation (returning it via an output parameter
Relation *sequence_rel). - Validate the remote data against local catalog state.
- Return a status code such as
COPYSEQ_OK,COPYSEQ_SKIPPED, etc.
The bug: when pg_get_sequence_data() returns NULL columns (which it does when the publisher role lacks privileges on the sequence, or the sequence was concurrently dropped, or it's another session's temp sequence), get_and_validate_seq_info() takes an early-return path with COPYSEQ_SKIPPED before assigning the output Relation parameter. The caller's sequence_rel variable — declared outside the loop in the original code — therefore still holds the Relation pointer from the previous successfully-processed row. copy_sequences() then calls table_close() on that stale pointer, double-closing a relation that was already closed in the prior iteration. On assert-enabled builds this trips:
TRAP: failed Assert("rel->rd_refcnt > 0"); relcache.c
Why It Matters Architecturally
This is a classic output-parameter contract violation: a function with multiple return paths must either (a) guarantee all paths initialize out-params, or (b) the caller must not consume the out-param on failure paths. Here neither invariant held, and the bug only surfaces when the skipped row is not the first in the batch — which is why it escaped testing. The reproduction requires a carefully crafted privilege setup: a replication role that can read some sequences in the publication but not others, triggering pg_get_sequence_data()'s privilege check to emit NULLs rather than error out. This NULL-on-lack-of-privilege behavior of pg_get_sequence_data() is itself a deliberate design so a single inaccessible sequence does not break an entire sequence batch.
Proposed Solutions and Evolution
Two complementary fixes were considered:
- Defensive initialization: Clear
*sequence_rel = NULLat entry toget_and_validate_seq_info()andsequence_rel = NULLaftertable_close()in the caller. This guarantees the out-param contract. - Tightening variable lifetime: Move the
sequence_reldeclaration inside the per-tuple loop so C scoping rules eliminate any carry-over. Hayato Kuroda advocated this — the underlying problem is a "wrong lifetime" for the variable.
The committed fix (v4) combined both: scope the variable to the loop body (eliminating the bug class structurally) while retaining the NULL-init at function entry as defense-in-depth. In subsequent review Vignesh noted the explicit NULL-init in the helper was redundant once the scoping fix was in place; Ayush kept it as insurance but this was later trimmed.
Secondary Issue: Misleading Error Message
report_sequence_errors() previously logged "missing sequence on publisher" for any skipped sequence. Vignesh pointed out this misdiagnoses privilege failures as missing objects. Ayush's v5 generalized the wording to "missing or inaccessible sequence on publisher" and renamed internal variables missing_* → unavailable_*.
Amit Kapila pushed v4 rather than v5, explicitly noting that even v5 was incomplete: sequence sync is also skipped for another session's temporary sequences on the publisher, so the taxonomy of "why was this skipped" is broader than "missing vs inaccessible." He left the door open for a more comprehensive message-generalization patch rather than block the crash fix on bikeshedding the wording.
Buildfarm Follow-up
After commit, the drongo animal (Windows) failed. The cause was the TAP test's newly-created regress_seq_repl role not being permitted by pg_hba.conf on Windows builds that don't use UNIX sockets. The fix followed the established pattern from commit def0ce3370689b939c6d7a3c3eb824d69989ef6e: pass auth_extra => [ '--create-role' => 'regress_seq_repl' ] to $node->init(), which causes pg_regress's pg_hba setup to include the role. This is a well-known pitfall for TAP tests that introduce new login roles.
Key Design Decisions and Tradeoffs
- Scope reduction over defensive nulling: The community preferred narrowing variable scope as the primary remedy — a structural fix — rather than relying purely on discipline around out-param initialization. This is consistent with broader PostgreSQL style guidance.
- Test strategy: Initial test used
CREATE SUBSCRIPTIONagainst a second connection string. Vignesh pushed forALTER SUBSCRIPTION ... SET CONNECTION+REFRESH SEQUENCESto avoid spawning a full new apply worker and resyncing all relations — a meaningful TAP test runtime saving. This reflects the reviewer's experience with the cost structure of logical replication test setup. - Message generality vs. shipping the fix: Amit chose correctness-first (ship v4's crash fix) over a more polished message (v5), recognizing that message-wording is orthogonal to the bug and has its own design question (the full set of skip reasons).
Participant Dynamics
- Ayush Tiwari — reporter and patch author; responsive to review, turned around 5 revisions in 3 days.
- Hayato Kuroda (Fujitsu) — confirmed reproduction, suggested the lifetime-scoping approach that became the structural fix.
- Vignesh C — original author of the sequence sync feature (hence CC'd by Kuroda); provided the most detailed review covering test efficiency, naming consistency (
regress_s5), comment quality, and error-message semantics. His domain ownership here gives his feedback substantial weight. - Amit Kapila — committer; made the final call on what to push and what to defer, and handled the buildfarm follow-up commit.
- Ajin Cherian — flagged the drongo buildfarm failure promptly.