Add Wait Events for Server Logging Destination Writes
Core Problem
PostgreSQL's observability infrastructure has a blind spot in the logging subsystem. When a backend calls write(2) to emit log output — either through the syslogger pipe (when logging_collector = on) or directly to stderr/console — these writes are not wrapped in wait events. This means that if the write blocks (due to a full pipe buffer, a slow log device, or a slow console), pg_stat_activity.wait_event shows NULL.
The architectural significance is subtle but important: NULL in wait_event is conventionally interpreted as "the backend is on CPU doing useful work." This creates a false positive for CPU activity when the backend is actually stalled on I/O in the logging path. In production scenarios with high log volume or slow log destinations, this can mask the true cause of backend latency. DBAs investigating performance issues would see backends appearing to burn CPU cycles when they're actually blocked waiting for log writes to complete.
This is particularly insidious because:
- Logging writes happen inside the error-reporting path — any allocation or complex instrumentation risks recursion or deadlock.
- Pipe writes can block indefinitely — the syslogger pipe has a finite kernel buffer; if syslogger falls behind, backends block on
write(2). - The stall is proportional to log volume — high-verbosity logging configurations (common during debugging) are exactly when observability matters most.
Proposed Solution
The patch introduces two new WaitEventIO events:
| Event | Covers | Context |
|---|---|---|
IO/SysloggerWrite |
write(2) to the syslogger pipe |
write_pipe_chunks() in elog.c |
IO/StderrWrite |
write(2) to stderr; WriteConsoleW() on Windows |
write_console() in elog.c |
The implementation uses the existing pgstat_report_wait_start() / pgstat_report_wait_end() helpers, which are:
- Allocation-free — they just write to shared memory (
MyProc->wait_event_info) - Safe to call from inside error reporting — no risk of recursive elog calls
- Minimal overhead — a single atomic store on entry and exit
Patch Structure
- 0001: Adds both wait event constants and wraps the
write(2)calls inwrite_pipe_chunks()andwrite_console(). Also adds the required#include "utils/wait_event.h"toelog.c(fixed in v2 after CI caught a missing include). - 0002: Separately covers the Windows
WriteConsoleW()path, split out because it's platform-specific code.
Design Decisions and Tradeoffs
-
Wrapping only the leaf write call: The instrumentation is placed tightly around just the
write(2)/WriteConsoleW()system call, not around the entire log emission sequence. This ensures the wait event precisely reflects time spent blocked in the kernel, not time spent formatting the message. -
Two separate events rather than one generic "LogWrite": This allows operators to distinguish between the two fundamentally different blocking scenarios — pipe backpressure (syslogger) vs. device I/O (stderr). The remediation for each is different (tune syslogger vs. fix the underlying log device).
-
Splitting Windows into a separate patch: The author expressed uncertainty about this but it's reasonable — it keeps platform-specific
#ifdef WIN32changes reviewable separately and allows 0001 to be committed independently if there are objections to the Windows approach.
Empirical Validation
The author provided before/after sampling data from a synthetic workload (8 backends emitting large RAISE LOG lines, sampled every 50ms for 20 seconds):
- Syslogger pipe path: On master, 100% of samples show
NULL(invisible wait). With the patch, 99.1% correctly showIO/SysloggerWrite. - Stderr path: On master, 100%
NULL. With patch, 90.7% showIO/StderrWrite(the remaining 9.3% NULL is expected — some samples catch the backend doing actual CPU work between writes).
This demonstrates the patch surfaces previously invisible wait time without introducing measurable overhead.
Architectural Context
This patch fits into a broader multi-year effort in PostgreSQL to eliminate "dark" wait states. Historically, many blocking operations showed NULL in pg_stat_activity, and successive releases have added granular wait events (LWLock, IO, IPC, etc.). The logging path was an overlooked gap because logging is often considered "just works" infrastructure — until it doesn't, at which point the lack of observability makes diagnosis difficult.
The choice to classify these as WaitEventIO (rather than WaitEventIPC) is correct: even though the syslogger pipe is inter-process communication, the blocking behavior is I/O-like (kernel buffer exhaustion), and the fix (faster log destination) is an I/O concern.