Optimize LISTEN/NOTIFY

First seen: 2025-07-12 22:35:21+00:00 · Messages: 127 · Participants: 8

Latest Update

2026-05-22 · claude-opus-4-6

Optimize LISTEN/NOTIFY: Deep Technical Analysis

Core Problem

PostgreSQL's LISTEN/NOTIFY mechanism suffers from a fundamental scalability bottleneck: SignalBackends() has no knowledge of which backend listens on which channel. When a transaction commits with pending NOTIFYs, it must send kill(pid, SIGUSR1) to every registered listener in the database, regardless of whether that backend cares about the notified channel. This results in O(N) system calls and, more critically, O(N) context switches as idle backends wake up, scan the queue, find nothing relevant, and go back to sleep.

The degradation is dramatic and well-documented:

This is a thundering herd problem. The existing implementation behaves like an Ethernet hub (broadcasting to all ports) when many workloads need switch-like behavior (targeted delivery). Real-world systems like Concourse CI heavily use per-session channels for worker queues, making this a practical bottleneck.

Solution Architecture (Final Committed Form - v35)

The committed solution introduces two complementary optimizations:

1. Targeted Signaling via Shared Channel Hash

A lazily-created dshash table (backed by DSA - dynamic shared memory area) maps (database OID, channel name) → array of listening backend ProcNumbers. When NOTIFY commits, SignalBackends() looks up only the backends that are actually listening on the notified channels, rather than iterating all listeners.

Key design decisions:

2. Direct Advancement

Even with targeted signaling, idle backends on unrelated channels would still need periodic waking to advance their queue positions (required for SLRU truncation). Direct advancement eliminates this:

Critical correctness mechanism: A per-backend isAdvancing flag prevents direct advancement of backends that are currently reading the queue, avoiding the race condition where truncation could remove pages a backend still has referenced in local variables.

3. Two-Phase Staging Pattern (OOM Safety)

Tom Lane identified that Exec_ListenCommit runs in a critical section (post-clog-commit), where errors cause PANIC. The solution uses three-layer state management:

  1. pendingListenActions (local hash): Records LISTEN/UNLISTEN intents during PreCommit_Notify (where OOM is safe)
  2. localChannelTable (local hash): Backend's committed listening state, used for fast IsListeningOn() checks
  3. globalChannelTable (dshash): Cluster-wide shared state consulted by SignalBackends()

PreCommit_Notify does all allocation-heavy work (creating hash entries, growing listener arrays). AtCommit_Notify only flips boolean flags in pre-allocated entries, making post-commit processing allocation-free.

4. Wake-Laggards Preservation

The final version retains the existing QUEUE_CLEANUP_DELAY behavior: backends that fall too far behind (regardless of channel interest) get signaled to advance the global tail pointer. This prevents queue exhaustion and maintains the existing robustness guarantee that has worked for 25+ years.

Key Technical Debates and Design Evolution

The Advisory Position Race Condition

Arseniy Mukhin discovered a critical bug in the "direct advancement" approach: if a notifier advances a listener's shared pos while the listener has a stale local copy, asyncQueueAdvanceTail() could truncate pages the listener is still reading.

Solutions explored:

  1. Advisory position field (Tom Lane's suggestion, adopted): A separate advisoryPos that suggests skip-ahead but isn't used for truncation decisions.
  2. Split pos into read-pos and done-pos (Joel's alt2): Separate "next to read" from "definitively done with."
  3. Only update shared pos atomically (alt3, ultimately adopted in simplified form): Backend holds an isAdvancing flag; notifiers don't touch advancing backends.

The final solution (v35) simplifies to: don't direct-advance backends that have isAdvancing set. This eliminates the need for advancingPos entirely.

Background Worker vs. Inline Checking

Joel initially proposed a notify_bgworker to wake lagging backends. Tom Lane rejected this firmly: "The existing code does not have any provision that guarantees a lost signal will eventually be re-sent... AFAIR we've had zero complaints about that in 25+ years." The check-for-laggards logic was kept inline in SignalBackends().

Eager Wake vs. Lazy Wake of Uninterested Backends

Tom Lane's most significant architectural intervention in v35: when a backend is not interested in current notifications and can't be direct-advanced (because it's between the old head and some prior position), don't eagerly wake it. Apply the same QUEUE_CLEANUP_DELAY threshold used for cross-database backends. This preserves the anti-thundering-herd property.

Joel's benchmarks showed this matters significantly on macOS (3x throughput difference) but negligibly on Linux, suggesting OS-level differences in context switch cost.

Local List vs. Hash Table for listenChannels

Heikki Linnakangas noted the existing linked-list listenChannels is O(n) for lookups. The patch converts it to a local HTAB (localChannelTable), making IsListeningOn() O(1). This matters when backends listen on thousands of channels.

pendingNotifyChannels Deduplication

Chao Li identified that building the list of unique channels for SignalBackends() was O(N²) when a transaction issues thousands of NOTIFYs with different channels. The solution integrates with the existing AsyncExistsPendingNotify deduplication: when notification count exceeds MIN_HASHABLE_NOTIFIES, a channelSet (hash table of unique channel names) is maintained alongside the notifications list.

Platform-Specific Observations

Tom Lane's benchmarking revealed a striking platform difference:

The difference is attributed to macOS's sleep/wakeup implementation being less optimized than Linux's. The QUEUE_CLEANUP_DELAY gate in v35 causes more latency on macOS because wakeups are more expensive there. Tom concluded this is a macOS kernel issue, not something to optimize in async.c.

Locking Architecture

The patch maintains a strict lock ordering hierarchy to prevent deadlocks:

  1. Heavyweight lock on "database 0": Serializes all NOTIFY writers (existing)
  2. NotifyQueueLock (LWLock, exclusive): Protects queue head/tail and per-backend status
  3. dshash partition locks: Per-partition locks within globalChannelTable

SignalBackends holds NotifyQueueLock while consulting the dshash (shared lock on partition). LISTEN/UNLISTEN operations only need dshash partition locks, enabling concurrent LISTEN operations on different channels.

The partitioned nature of dshash means LISTEN/UNLISTEN on different channels proceeds fully in parallel - a critical property for workloads with many distinct channels.