pg_createsubscriber: Fix incorrect handling of cleanup flags

First seen: 2025-05-05 03:45:35+00:00 · Messages: 6 · Participants: 3

Latest Update

2026-05-11 · opus 4.7

pg_createsubscriber: Cleanup Flag Handling Bug

The Core Problem

pg_createsubscriber is a utility that converts a physical standby into a logical replication subscriber. During this conversion, it must create several objects on both the primary (publisher) and the target (subscriber), including:

Because this multi-step operation can fail at any point, the tool uses an atexit cleanup handler (cleanup_objects_atexit()) to roll back partial state. Two boolean flags in the per-database dbinfo structure gate this cleanup:

These flags exist because the tool must distinguish between objects it created (which it owns and should drop on failure) versus pre-existing objects the user asked it to interact with (e.g., via --remove=publications, which drops user-owned publications on the subscriber side after physical replication completes).

The Bug

The functions drop_publication() and drop_replication_slot() are reused for multiple distinct purposes:

  1. Dropping the tool-created publication/slot on the primary during cleanup
  2. Dropping replicated publications on the subscriber (they arrive via the base backup and must be removed)
  3. Dropping user-requested publications on the subscriber (--remove=publications)
  4. Dropping the physical replication slot on the primary (different slot, not the logical one)
  5. Dropping failover-synced slots on the subscriber

The bug: regardless of which of these scenarios is executing, the drop functions unconditionally set made_publication = false / made_replslot = false on failure. So if the tool fails to drop a subscriber-side replicated publication, it flips made_publication to false — and then when cleanup_objects_atexit() later fires due to a subsequent failure, it sees the flag as false and skips dropping the publication the tool created on the primary. Result: orphaned publications and logical replication slots on the primary, which a user then must clean up manually.

Design Debate and Resolution

Nisha's initial v1 patch added conditional logic to only reset the flags when the drop was occurring on the primary for the tool-created object. David Johnston (a committer-adjacent active reviewer with strong opinions on interface design) pushed back with two observations:

  1. The real problem is overloading: the drop functions mutate state belonging to a specific named object, but are invoked for many differently-named objects. The fix should narrow the mutation to only occur when the name being dropped matches the tracked object — or better, eliminate the mutation entirely.

  2. The "don't retry" semantics are pointless: resetting the flag to false exists to prevent cleanup_objects_atexit() from re-attempting a drop that just failed. But since atexit cleanup is by definition the last action before process termination, "don't retry" is a non-concern. The worst case of leaving the flag set is a duplicate error message — harmless.

David's preferred (simpler) option: just delete the *made_publication = false and dbinfo->made_replslot = false assignments altogether. Nisha empirically validated this by attempting to induce recursion or repeated-cleanup scenarios and found none — the cleanup path does not re-enter these drops in a way that matters.

The v2+ patches adopt this minimal approach. Fujii Masao (committer) reviewed v4 favorably and suggested the natural follow-up: since made_publication is no longer written inside drop_publication(), the parameter itself should be removed from the function signature — a small API cleanup that reflects the reduced responsibility of the function.

Architectural Insight

This bug is a textbook example of feature creep in a helper function eroding its invariants. drop_publication() began, presumably, as a cleanup helper tied to the tool-created publication (hence the flag mutation made sense — "I dropped it, so clear the 'needs dropping' flag"). As the function was reused for --remove=publications and for dropping replicated publications on the subscriber, the original invariant silently broke, because the mutation now occurred in contexts where the flag had no semantic connection to the object being dropped.

The fix is notable for what it doesn't do: it doesn't add more conditional logic or more parameters. It removes state mutation from a function that no longer has a legitimate reason to perform it, shrinking the function's contract to match its actual multi-purpose usage. The atexit handler becomes the sole authority on whether cleanup is needed, which is the correct locus of that decision.

Rebasing History

The patch sat for nearly a year between v2 (May 2025) and v3 (April 2026), requiring a rebase after commit d6628a5 touched pg_createsubscriber, and another rebase a month later for v4. This suggests the patch was not contentious but was low-priority relative to other pg_createsubscriber work happening concurrently.