[PATCH] Release replication slot on error in SQL-callable slot functions

First seen: 2026-05-09 20:45:37+00:00 · Messages: 23 · Participants: 5

Latest Update

2026-06-01 · claude-opus-4-6

Monthly Summary: Release Replication Slot on Error in SQL-Callable Slot Functions (May 2026)

Problem Statement

PostgreSQL's SQL-callable replication slot functions (pg_replication_slot_advance, pg_logical_slot_get_changes, pg_logical_slot_peek_changes, pg_logical_slot_get_binary_changes, pg_logical_slot_peek_binary_changes, pg_create_logical_replication_slot, pg_copy_logical_replication_slot) can leak an acquired replication slot when an ERROR is thrown between ReplicationSlotAcquire() and ReplicationSlotRelease().

The bug manifests when the error is caught by a PL/pgSQL EXCEPTION block (which uses subtransactions). The session continues with MyReplicationSlot still pointing at the acquired slot, meaning:

The root cause is architectural: replication slots have no ResourceOwner integration and no subtransaction abort callback. The SQL-callable wrappers assume straight-line execution to the release call — an assumption violated by exception handling.

Proposed Fix

The patch wraps the error-prone regions of all affected SQL-callable functions in PG_TRY { ... } PG_CATCH { ReplicationSlotRelease(); PG_RE_THROW(); } PG_END_TRY(), guaranteeing slot release on any error path before propagation to exception handlers.

Patch Evolution (v1 → v4)

Version Date Key Changes
v1 Pre-May Initial PG_TRY/PG_CATCH wrapping for pg_replication_slot_advance
v2 2026-05-25 Extended to all 7 affected functions; addressed Fujii's review (NULL guard, acquire inside PG_TRY, temporary slot drop in error path)
v3 2026-05-25 Fixed invalid test for pg_copy_logical_replication_slot (uses nonexistent_plugin to trigger error after acquisition); added test coverage for pg_logical_slot_get_changes_guts()
v4 2026-05-26 Cosmetic cleanup; patch format fixes

Key Review Issues Resolved

  1. Temporary slot leak (Fujii): ReplicationSlotRelease() doesn't auto-drop RS_TEMPORARY slots — error path must explicitly drop them
  2. NULL guard (Fujii): PG_CATCH must check MyReplicationSlot != NULL before calling release
  3. Acquire placement (Fujii): ReplicationSlotAcquire() itself can throw after setting the global, so it must be inside PG_TRY
  4. Invalid test case (Shveta): Original test for pg_copy_logical_replication_slot failed before acquisition — replaced with nonexistent_plugin trigger

Current Status

The patch is converging. Shveta's final review of v3/v4 raised only cosmetic issues and explicitly stated no further technical objections. The patch covers all affected entry points in both slotfuncs.c and logicalfuncs.c. The fix is considered back-patchable to all supported branches.

Alternative Designs Discussed (Not Taken)

History (1 prior analysis)
2026-06-01 · claude-opus-4-6

Incremental Update: v5/v6 Patches, Compilation Fix, LWLock Safety Concern, and Architectural Debate on PG_TRY Placement

New Technical Issue: LWLock Self-Deadlock Risk (Hou Zhijie)

Hou Zhijie raises a substantive correctness concern about the entire PG_TRY/PG_CATCH approach: if an ERROR is thrown while the backend already holds an LWLock (e.g., ReplicationSlotControlLock), and the PG_CATCH block then calls ReplicationSlotRelease() or ReplicationSlotDropPtr() which attempt to acquire the same LWLock, the backend will self-deadlock. LWLocks do not track per-backend ownership, so the backend would block on itself indefinitely and become uninterruptible.

This is a fundamental concern about whether calling ReplicationSlotRelease() in a PG_CATCH block is safe in all error scenarios. Hou acknowledges he doesn't have a better alternative but asks the community to evaluate whether the scenario is rare enough to be acceptable, and suggests documenting the risk in comments.

New Affected Function: pg_create_physical_replication_slot()

Fujii Masao identifies that pg_create_physical_replication_slot() may have the same vulnerability if it throws an error after ReplicationSlotCreate(). He also asks about pg_copy_physical_replication_slot(), pg_drop_replication_slot(), and ALTER_REPLICATION_SLOT.

Satya responds that ALTER_REPLICATION_SLOT is not exploitable from SQL because it only runs in a walsender context where session termination on error prevents the dangling state. The physical slot functions are addressed in v5.

Shveta confirms via debugging that pg_copy_physical_replication_slot() is indeed affected — forcing create_physical_replication_slot() to fail during a copy operation leaves the assert-triggering dangling state.

v5 Patch (2026-05-28)

Extends coverage to pg_create_physical_replication_slot() and incorporates Shveta's earlier cosmetic/test changes. However, v5 has a compilation failure.

v6 Patch (2026-05-29) — Compilation Fix

v5 called ReplicationSlotDropAcquired() without the required bool try_disable argument, which was added by upstream commit 2af1dc89282 after the patch was originally generated. v6 fixes these three call sites.

Architectural Debate: Where Should PG_TRY Live?

Two reviewers converge on a structural criticism of v5/v6:

Hou Zhijie observes that placing PG_CATCH inside create_logical_replication_slot() while the caller (pg_copy_logical_replication_slotcopy_replication_slot()) can still error out before releasing the slot creates incomplete coverage. He suggests catching errors at the caller level instead.

Shveta Malik takes the opposite (and ultimately complementary) position: the PG_TRY/PG_CATCH should live inside the internal helper functions (create_logical_replication_slot() and create_physical_replication_slot()) rather than in their SQL-callable callers. Her reasoning:

  1. Automatic coverage of all callerscreate_logical_replication_slot() is called from multiple places; internal handling means callers don't need to duplicate cleanup logic.
  2. Consistency — the current patch inconsistently places PG_TRY in the caller for physical slots but inside the function for logical slots.
  3. Future-proofing — she also argues ReplicationSlotCreate() itself should be inside the PG_TRY block, since if it ever gains post-creation logic that can throw, the current placement would leave an unsafe state.

This is a meaningful design divergence about the correct abstraction boundary for error cleanup in the slot subsystem.

Test Regression

Fujii reports that v4 + Shveta's cosmetic diff causes test failures because the .out expected-output file wasn't updated. Shveta provides a corrected top-up patch, which Satya incorporates into v5.