Fix race condition in pg_get_publication_tables with concurrent DROP TABLE

First seen: 2026-04-22 19:31:18+00:00 · Messages: 25 · Participants: 5

Latest Update

2026-05-06 · opus 4.7

Race Condition in pg_get_publication_tables with Concurrent DROP TABLE

The Core Problem

pg_get_publication_tables() is a set-returning function (SRF) that backs the pg_publication_tables system view and, critically, is consumed by logical replication tablesync workers to decide which relations to synchronize. It is implemented in the classic value-per-call SRF style: on the first call (SRF_IS_FIRSTCALL) it materializes a List *table_infos of table OIDs into funcctx->user_fctx; on each subsequent call it pulls the next OID out of the list and constructs a row.

The race window is structural:

  1. First call: The list of OIDs is built by walking pg_publication/pg_publication_rel (and, for FOR ALL TABLES, scanning pg_class) without holding locks on the target relations. This is deliberate — taking AccessShareLock on potentially thousands of tables just to report metadata would be prohibitively expensive and would block DDL.
  2. Subsequent calls: For each OID, when no explicit column list is published, the code calls table_open(relid, AccessShareLock) in order to enumerate publishable columns via RelationGetDescr(rel) and emit them as an int2vector in the attrs output column.

If a DROP TABLE commits between step 1 and step 2, table_open() raises ERROR: could not open relation with OID nnnnn. In workloads with churny schemas or FOR ALL TABLES publications, this turns a simple SELECT * FROM pg_publication_tables — and worse, internal tablesync queries — into an unreliable operation.

Bharath's archaeology pinpointed the regression to commit b7ae03953690 (PG16, "Ignore dropped and generated columns from the column list"), which introduced the table_open() in this path. PG15 and earlier used only get_rel_namespace() and syscache lookups, both of which tolerate missing relations by returning InvalidOid/false. Hence the fix is PG16+ only.

Design Decisions and Tradeoffs

Decision 1: Skip vs. error

All reviewers quickly converged that the correct semantics are skip silently. The function returns a point-in-time snapshot; tables created after the first call are already not reported, so symmetrically not reporting tables dropped after the first call is consistent. Callers (including the view definition itself) already JOIN pg_class and thus naturally filter stale OIDs. This is the standard MVCC-ish "best effort enumeration" contract shared by many catalog-reporting SRFs.

Decision 2: How to skip an SRF row cleanly

The v1 patch manually did funcctx->call_cntr++; continue; to advance the SRF without emitting a tuple. Shveta flagged this as an API abuse — SRF internals should only be mutated through SRF_* macros, and the only precedent in the tree (hstore_svals) literally has a comment "ugly ugly ugly. why no macro for this?".

Bertrand proposed the accepted alternative: store both the List and a private curr_idx inside a small state struct attached to funcctx->user_fctx (mirroring pg_timezone_abbrevs_zone()). This decouples "which row to emit next" from the SRF machinery's call_cntr, so continue in a skip path is trivially correct. Bertrand further refined this by suggesting the curr_idx++ be placed immediately after list_nth(), before any skip checks, so future skip conditions cannot accidentally create an infinite loop.

Bharath considered and rejected converting the function to a materialized SRF (InitMaterializedSRF + tuplestore_putvalues). That would eliminate the per-call state problem entirely but balloons the diff, and — importantly for backpatching — the materialization helper APIs differ across PG14/15/16+, complicating back-branch patches.

Decision 3: Partial fix concern (column-list path)

Evan Li raised a substantive objection: try_table_open() is only invoked in the no column list branch. If a publication row specifies an explicit column list, the function reads columns from pg_publication_rel without ever opening the relation — so it could still emit rows for a concurrently dropped table.

Bharath and Shveta's rebuttal is architecturally correct and worth unpacking:

Evan accepted this ("Fair") but usefully pointed out that the existing comment /* Skip if the relation has been dropped */ is misleading because the actual check is on schemaid. Shveta's reformulation makes the intent explicit:

/*
 * get_rel_namespace() returns InvalidOid if the relation no longer exists
 * (e.g., dropped concurrently). Skip such entries.
 */
if (!OidIsValid(schemaid = get_rel_namespace(relid)))
    continue;

Decision 4: Testability — polling vs. injection point

v1 used a hundreds-of-tables polling-based TAP test. Bertrand correctly diagnosed it as racy in the other direction: the drops could complete before the first poll, making the test pass even against an unpatched server (demonstrated by inserting pg_sleep(2)). Bharath switched to an INJECTION_POINT named pg-get-publication-tables-after-list-built, fired right after the OID list is materialized in the first SRF call. This deterministically pauses execution in the window where the race is possible, letting the test drop a relation and then unblock the SRF.

A secondary portability issue surfaced on FreeBSD's debug_parallel_query=regress build: forced parallelism causes the injection point to fire inside parallel workers too, breaking the test. v6 disables forced parallel query for this test case.

Key Technical Insights

  1. SRF state discipline: funcctx->call_cntr is an opaque cursor for the SRF machinery. When a function needs "skip semantics," it should maintain its own index in user_fctx rather than mutating call_cntr. The pg_timezone_abbrevs_zone() pattern is the idiomatic template.

  2. Catalog SRFs are inherently snapshot-like: Any function that enumerates catalog objects and does follow-up work on each one faces this race. The defensive primitive is try_relation_open/try_table_open, which returns NULL for a dropped relation instead of erroring. This pattern appears throughout the backend (e.g., get_rel_name behavior, many pg_stat_* functions).

  3. Dependency-driven catalog cleanup is a soft guarantee: The column-list path survives without try_table_open() specifically because DROP TABLE cascades through pg_depend and removes pg_publication_rel rows — but that guarantee holds only because the reading query sees a consistent catalog snapshot via syscache/MVCC. The table_open() path is different because pg_class lookup through the relcache can still find a valid-looking entry at scan time, only to fail when opening.

  4. Backpatch scoping via regression archaeology: Identifying b7ae03953690 as the offending commit narrows the backpatch target to PG16+. Understanding why earlier branches are unaffected (only OID-tolerant syscache lookups used) is essential for a correct back-branch story.

  5. Injection points as the preferred race-reproduction mechanism: This thread is a textbook case for the INJECTION_POINT infrastructure: racy tests that pass accidentally are worse than no test, and timing-based polling hacks mask real bugs. The injection point gives deterministic reproduction at the exact window the fix targets.

Participant Dynamics

Bharath drives the patch and defends the scope against over-engineering (rejecting both the materialized-SRF rewrite and the defensive open-on-column-list-path). Shveta and Bertrand provide the substantive review — Bertrand contributes the two key design improvements (state struct, injection point) and Shveta catches the SRF-internals abuse and naming/comment clarity issues. Evan's column-list concern is ultimately addressed by argument rather than code change, but his follow-up on the misleading comment leads to a real improvement. Ajin's contribution is cosmetic (include ordering). No committer is visible in the excerpted thread; the work is done at the reviewer level and is presumably being prepared for a logical-replication-area committer to push.