2026-05-22 · claude-opus-4-6

Optimize LISTEN/NOTIFY: Deep Technical Analysis

Core Problem

PostgreSQL's LISTEN/NOTIFY mechanism suffers from a fundamental scalability bottleneck: SignalBackends() has no knowledge of which backend listens on which channel. When a transaction commits with pending NOTIFYs, it must send kill(pid, SIGUSR1) to every registered listener in the database, regardless of whether that backend cares about the notified channel. This results in O(N) system calls and, more critically, O(N) context switches as idle backends wake up, scan the queue, find nothing relevant, and go back to sleep.

The degradation is dramatic and well-documented:

0 idle listeners: ~9,100 TPS
10 idle listeners: ~6,200 TPS (32% degradation)
100 idle listeners: ~2,000 TPS (78% degradation)
1,000 idle listeners: ~238 TPS (97% degradation)

This is a thundering herd problem. The existing implementation behaves like an Ethernet hub (broadcasting to all ports) when many workloads need switch-like behavior (targeted delivery). Real-world systems like Concourse CI heavily use per-session channels for worker queues, making this a practical bottleneck.

Solution Architecture (Final Committed Form - v35)

The committed solution introduces two complementary optimizations:

1. Targeted Signaling via Shared Channel Hash

A lazily-created dshash table (backed by DSA - dynamic shared memory area) maps (database OID, channel name) → array of listening backend ProcNumbers. When NOTIFY commits, SignalBackends() looks up only the backends that are actually listening on the notified channels, rather than iterating all listeners.

Key design decisions:

dshash over fixed shared memory: Tom Lane directed this choice, noting that fixed-size hash tables require guessing capacity at startup (problematic), whereas dshash grows dynamically and doesn't consume resources if LISTEN/NOTIFY is never used.
Lazy initialization: The DSA area and dshash are created on first LISTEN, with handles stored in AsyncQueueControl for other backends to attach.
No multicast threshold GUC: Early versions had notify_multicast_threshold limiting tracked listeners per channel. The final version tracks all listeners, eliminating the dual code path.

2. Direct Advancement

Even with targeted signaling, idle backends on unrelated channels would still need periodic waking to advance their queue positions (required for SLRU truncation). Direct advancement eliminates this:

PreCommit_Notify captures queueHeadBeforeWrite and queueHeadAfterWrite while holding the heavyweight lock that serializes all NOTIFY writers.
The range [before, after) contains only notifications from this transaction (guaranteed by the heavyweight lock).
SignalBackends() can directly move idle backends' queue positions forward over this range without waking them, since the channel hash proves they're not interested.

Critical correctness mechanism: A per-backend isAdvancing flag prevents direct advancement of backends that are currently reading the queue, avoiding the race condition where truncation could remove pages a backend still has referenced in local variables.

3. Two-Phase Staging Pattern (OOM Safety)

Tom Lane identified that Exec_ListenCommit runs in a critical section (post-clog-commit), where errors cause PANIC. The solution uses three-layer state management:

pendingListenActions (local hash): Records LISTEN/UNLISTEN intents during PreCommit_Notify (where OOM is safe)
localChannelTable (local hash): Backend's committed listening state, used for fast IsListeningOn() checks
globalChannelTable (dshash): Cluster-wide shared state consulted by SignalBackends()

PreCommit_Notify does all allocation-heavy work (creating hash entries, growing listener arrays). AtCommit_Notify only flips boolean flags in pre-allocated entries, making post-commit processing allocation-free.

4. Wake-Laggards Preservation

The final version retains the existing QUEUE_CLEANUP_DELAY behavior: backends that fall too far behind (regardless of channel interest) get signaled to advance the global tail pointer. This prevents queue exhaustion and maintains the existing robustness guarantee that has worked for 25+ years.

Key Technical Debates and Design Evolution

The Advisory Position Race Condition

Arseniy Mukhin discovered a critical bug in the "direct advancement" approach: if a notifier advances a listener's shared pos while the listener has a stale local copy, asyncQueueAdvanceTail() could truncate pages the listener is still reading.

Solutions explored:

Advisory position field (Tom Lane's suggestion, adopted): A separate advisoryPos that suggests skip-ahead but isn't used for truncation decisions.
Split pos into read-pos and done-pos (Joel's alt2): Separate "next to read" from "definitively done with."
Only update shared pos atomically (alt3, ultimately adopted in simplified form): Backend holds an isAdvancing flag; notifiers don't touch advancing backends.

The final solution (v35) simplifies to: don't direct-advance backends that have isAdvancing set. This eliminates the need for advancingPos entirely.

Background Worker vs. Inline Checking

Joel initially proposed a notify_bgworker to wake lagging backends. Tom Lane rejected this firmly: "The existing code does not have any provision that guarantees a lost signal will eventually be re-sent... AFAIR we've had zero complaints about that in 25+ years." The check-for-laggards logic was kept inline in SignalBackends().

Eager Wake vs. Lazy Wake of Uninterested Backends

Tom Lane's most significant architectural intervention in v35: when a backend is not interested in current notifications and can't be direct-advanced (because it's between the old head and some prior position), don't eagerly wake it. Apply the same QUEUE_CLEANUP_DELAY threshold used for cross-database backends. This preserves the anti-thundering-herd property.

Joel's benchmarks showed this matters significantly on macOS (3x throughput difference) but negligibly on Linux, suggesting OS-level differences in context switch cost.

Local List vs. Hash Table for listenChannels

Heikki Linnakangas noted the existing linked-list listenChannels is O(n) for lookups. The patch converts it to a local HTAB (localChannelTable), making IsListeningOn() O(1). This matters when backends listen on thousands of channels.

pendingNotifyChannels Deduplication

Chao Li identified that building the list of unique channels for SignalBackends() was O(N²) when a transaction issues thousands of NOTIFYs with different channels. The solution integrates with the existing AsyncExistsPendingNotify deduplication: when notification count exceeds MIN_HASHABLE_NOTIFIES, a channelSet (hash table of unique channel names) is maintained alongside the notifications list.

Platform-Specific Observations

Tom Lane's benchmarking revealed a striking platform difference:

Linux (RHEL 8, Xeon): v34 and v35 perform nearly identically (~20,000 TPS)
macOS (M4 Pro): v34 achieves ~14,600 TPS but v35 drops to ~4,650 TPS

The difference is attributed to macOS's sleep/wakeup implementation being less optimized than Linux's. The QUEUE_CLEANUP_DELAY gate in v35 causes more latency on macOS because wakeups are more expensive there. Tom concluded this is a macOS kernel issue, not something to optimize in async.c.

Locking Architecture

The patch maintains a strict lock ordering hierarchy to prevent deadlocks:

Heavyweight lock on "database 0": Serializes all NOTIFY writers (existing)
NotifyQueueLock (LWLock, exclusive): Protects queue head/tail and per-backend status
dshash partition locks: Per-partition locks within globalChannelTable

SignalBackends holds NotifyQueueLock while consulting the dshash (shared lock on partition). LISTEN/UNLISTEN operations only need dshash partition locks, enabling concurrent LISTEN operations on different channels.

The partitioned nature of dshash means LISTEN/UNLISTEN on different channels proceeds fully in parallel - a critical property for workloads with many distinct channels.

Optimize LISTEN/NOTIFY

Latest Update