Technical Analysis: Fix NULL Dereference in Subscription REFRESH on Concurrent DROP
Core Problem
This thread addresses a crash bug (segfault) in PostgreSQL's logical replication subscription management code. The issue manifests when ALTER SUBSCRIPTION ... REFRESH PUBLICATION is executed concurrently with a DROP TABLE (or DROP SEQUENCE) on a table that is part of the subscription.
Architectural Context
PostgreSQL's logical replication uses subscriptions to track which publications (and their underlying tables/sequences) a subscriber node should replicate. When a user issues ALTER SUBSCRIPTION ... REFRESH PUBLICATION, the system must reconcile the current state of published tables with what the subscription knows about. This involves:
- Collecting a list of OIDs for locally-subscribed relations (
subrel_local_oids) - Iterating over those OIDs to check origin information
- Calling
get_rel_name()to resolve OIDs to relation names for diagnostic/error messages
The Race Condition
The critical vulnerability lies in the lack of relation-level locks during the iteration in check_publications_origin_tables() (and the analogous check_publications_origin_sequences()). The sequence of events is:
- Session A begins
ALTER SUBSCRIPTION ... REFRESH PUBLICATIONand collectssubrel_local_oids— a list of OIDs for relations currently in the subscription. - Session B concurrently executes
DROP TABLEon one of those relations, removing it from the catalog. - Session A continues iterating and calls
get_rel_name(oid)for the now-dropped relation. Since the catalog entry no longer exists,get_rel_name()returnsNULL. - That
NULLpointer is passed directly toquote_literal_cstr(), which unconditionally dereferences it, causing a segmentation fault.
This is a classic TOCTOU (Time-of-Check-to-Time-of-Use) race condition. The OID list represents a snapshot that becomes stale between collection and use.
Why This Matters Architecturally
- Server crash from a race condition in DDL: Any unprivileged user who can issue
ALTER SUBSCRIPTION ... REFRESH PUBLICATIONcan trigger this if concurrent DDL is happening. This is a reliability and availability concern. - Pattern recurrence: The same unsafe pattern (
get_rel_name()→ use without NULL check) likely exists elsewhere in the codebase; this fix establishes the correct defensive pattern for subscription code. - Lock granularity tradeoff: The broader architectural question is whether the code should acquire
AccessShareLockon each relation before resolving names (preventing concurrent DROP) or simply handle the NULL case gracefully. The patch author chose the latter — skip the relation if it's gone — which is the lighter-weight approach and appropriate here since a dropped table is no longer relevant to the subscription refresh.
Proposed Solution
The patch adds NULL checks after calls to get_rel_name() and get_namespace_name() in both:
check_publications_origin_tables()check_publications_origin_sequences()
If the relation name resolves to NULL (indicating the relation was dropped concurrently), the code simply skips that relation and continues processing the rest. This is semantically correct because:
- A dropped relation cannot be part of any publication anymore.
- The subscription refresh will naturally remove it from the subscription's relation set.
- There's no useful diagnostic or error to emit about a relation that no longer exists.
Alternative Approaches Not Taken
- Acquiring locks on each relation: Would prevent the race entirely but adds overhead and potential deadlock risk during what should be a lightweight metadata operation.
- Re-checking the OID list inside a single snapshot: More complex and still wouldn't guarantee the relation exists by the time the name is resolved without holding a lock.
Key Technical Insights
The fix is minimal and defensive. It follows the same pattern used elsewhere in PostgreSQL where catalog lookups on OIDs may return NULL for concurrently-dropped objects (e.g., pg_stat views, autovacuum workers). The principle is: if a system catalog lookup returns NULL for an OID that was valid moments ago, treat the object as gone and proceed gracefully.
Assessment
This appears to be a straightforward, low-risk bug fix for a genuine crash scenario. The patch is small in scope and follows established PostgreSQL patterns for handling concurrent DDL. It would likely be back-patched to all supported versions where the affected code exists (likely PG15+ where check_publications_origin_tables/sequences was introduced as part of the subscription origin checking infrastructure).