Monthly Summary: synchronized_standby_slots Behavior Inconsistent with Quorum-Based Synchronous Replication (May 2026)
Overview
This thread addresses a fundamental availability mismatch in PostgreSQL's logical replication failover infrastructure. The synchronized_standby_slots GUC enforces ALL-of-N semantics (every listed physical slot must catch up before logical decoding proceeds), which conflicts with synchronous_standby_names' support for ANY M-of-N (quorum) semantics. In quorum-based HA deployments, a single standby failure blocks all logical consumers indefinitely even though synchronous commits continue to succeed.
Problem Statement
In a typical 3-node HA deployment:
synchronous_standby_names = 'ANY 1 (standby1, standby2)'
synchronized_standby_slots = 'sb1_slot, sb2_slot'
If standby1 goes down, synchronous commits succeed via standby2, but logical decoding blocks in WaitForStandbyConfirmation() waiting for sb1_slot — causing silent WAL accumulation and potential disk-full scenarios.
Proposed Solution
Extend synchronized_standby_slots to accept quorum/priority syntax mirroring synchronous_standby_names:
- Plain list (
slot1, slot2): ALL-mode (backward compatible) - ANY N (...): Quorum — proceed once N slots catch up
- FIRST N (...): Priority — wait for first N valid slots in order
Key Technical Debates Resolved
Quorum Safety and Failover Correctness
Ashutosh Sharma raised concerns about logical replicas ending up ahead of a new primary after failover with quorum semantics. Amit Kapila argued (with consensus) that this mirrors the existing synchronous_standby_names situation — failover orchestrators are responsible for selecting the most-caught-up standby.
Independent GUC Configuration
A proposal to default synchronized_standby_slots to match synchronous_standby_names was rejected because they operate in different namespaces (slot names vs. application names), synchronous replication doesn't require slots, and tools like pg_receivewal can appear in sync standby names.
Parser Design
A disagreement between Ashutosh Sharma (local helper function to detect syntax) and Hou Zhijie (modify shared syncrep grammar to emit SYNC_REP_DEFAULT for bare lists) was resolved in favor of the grammar approach, which avoids bugs like slot names starting with "first" being misidentified as the FIRST keyword.
Testing Active-but-Lagging Slots
After exploring recovery_min_apply_delay, pg_wal_replay_pause(), and SIGSTOP (all inadequate), the solution uses psql as a replication client via START_REPLICATION SLOT physical <lsn> — acquires the slot without sending feedback, creating a deterministic active-but-lagging condition (~6 seconds vs. 60-140 seconds).
Bug Discovery: Duplicate Slot Entries
Shveta Malik identified a correctness bug where duplicate slot names in quorum/priority mode are counted multiple times:
ALTER SYSTEM SET synchronized_standby_slots = 'FIRST 2 (standby_1, standby_1, standby_2, standby_3)';
This allows decoding to proceed with only standby_1 caught up (counted twice). Unlike synchronous_standby_names which waits on distinct walsender processes, the slot code iterates name strings without deduplication. Ashutosh Sharma acknowledged this will be fixed in the next patch version.
Final Patch Structure
- 0001: Refactors syncrep parser to introduce
SYNC_REP_DEFAULTfor bare lists (distinguishes implicit from explicit FIRST) - 0002: Adds
ANY Nquorum semantics tosynchronized_standby_slots - 0003: Adds
FIRST Npriority syntax support
Current Status
Awaiting next patch version addressing the duplicate slot deduplication bug and minor cosmetic feedback from Shveta Malik's review.