Fix race during concurrent logical decoding activation

First seen: 2026-05-28 09:09:13+00:00 · Messages: 5 · Participants: 2

Latest Update

2026-06-01 · claude-opus-4-6

Fix Race During Concurrent Logical Decoding Activation

Background: Dynamic Logical Decoding Toggle

This thread addresses a race condition in a newer PostgreSQL feature that allows logical decoding to be dynamically enabled/disabled based on the presence of logical replication slots — the "Toggle logical decoding dynamically based on logical slot presence" feature. This feature allows effective_wal_level to be promoted from replica to logical at runtime when a logical slot is created, and demoted back when the last logical slot is dropped, without requiring a server restart.

The activation process (EnableLogicalDecoding()) involves:

  1. Setting shared memory flags (xlog_logical_info, logical_decoding_enabled)
  2. Issuing a WAL barrier to ensure all backends start including logical decoding information in WAL records
  3. Writing a logical-decoding status WAL record
  4. Emitting process barriers so all backends pick up the new state

The deactivation path (DisableLogicalDecoding()) reverses this process, while abort_logical_decoding_activation() handles cleanup when activation is interrupted (e.g., by query cancellation).

The Race Condition

The bug manifests when two sessions concurrently create logical replication slots when no slots previously existed (i.e., effective_wal_level is replica):

  1. Session 1 begins creating a logical slot and enters EnableLogicalDecoding(). It sets xlog_logical_info = true but has not yet set logical_decoding_enabled = true (blocked at injection point mid-activation).
  2. Session 2 creates another logical slot. Since Session 1 hasn't completed, Session 2 also runs EnableLogicalDecoding(), which completes successfully, setting logical_decoding_enabled = true.
  3. Session 1 is cancelled. It enters abort_logical_decoding_activation(), which contains the assertion:
    Assert(!LogicalDecodingCtl->logical_decoding_enabled);
    
    This assertion fails because Session 2 already set logical_decoding_enabled = true.

The fundamental problem is that abort_logical_decoding_activation() assumes it is the only activator and that no one else could have completed activation in the meantime.

Proposed Solutions

Solution 1: Serialization with Condition Variable (Evan's initial patch)

Evan's first approach adds an activation_in_progress flag to shared memory, protected by LogicalDecodingControlLock, with a condition variable. When a second session wants to activate logical decoding, it waits on the CV until the first activation completes or aborts.

Rejected because: It adds a new synchronization mechanism that could introduce its own race conditions, and is more complex than necessary. Sawada noted this was similar to older discarded approaches.

Solution 2: Delegate Abort Cleanup to DisableLogicalDecoding() (Sawada's approach, accepted)

The key insight: the complexity arises because abort_logical_decoding_activation() and DisableLogicalDecoding() clear the xlog_logical_info flag in different ways. The simpler fix is to make the abort path not immediately clean up xlog_logical_info. Instead:

This eliminates the race because:

Key Technical Design Decisions

Critical Sections and Shared Memory Updates

A discussion emerged about whether shared memory assignments (LogicalDecodingCtl->xlog_logical_info = false, etc.) need to be inside critical sections. Evan argued they don't strictly need to be (critical sections primarily protect against elog interruptions). Sawada argued for keeping them inside critical sections for consistency with EnableLogicalDecoding(). The final patch keeps them in critical sections for consistency.

Log Message Ordering Under Lock

A subtle design trade-off: should the "logical decoding is disabled" log message be emitted while still holding LogicalDecodingControlLock?

Sawada prevailed, adding comments explaining why the log is emitted under the lock. Evan acknowledged this is a reasonable trade-off.

pending_disable Flag Cleanup

Evan caught a bug in Sawada's initial v2: in the new else branch of DisableLogicalDecoding() that handles the transient state, LogicalDecodingCtl->pending_disable also needs to be cleared. Without this, a stale pending_disable = true could cause incorrect behavior on subsequent operations.

Test Infrastructure

The fix includes regression tests in 051_effective_wal_level.pl using injection points:

Evan's final review caught a test reliability issue: wait_for_log("canceling statement due to user request") could match a log line from a previous cancellation in the same test file. The fix: capture the log offset before the cancel and pass it to wait_for_log().

Current Status

The patch is at v3, with only minor cosmetic fixes remaining (grammar: "appear" → "appears", and the log offset fix). The core design is agreed upon and the patch appears ready for committer review.