Add errdetail() with PID and UID about source of termination signal

First seen: 2026-02-18 07:32:25+00:00 · Messages: 54 · Participants: 10

Latest Update

2026-05-06 · opus 4.7

Analysis: Adding errdetail() with PID/UID of Signal Sender

Core Problem and Motivation

When a PostgreSQL backend is terminated via SIGTERM (e.g., from pg_terminate_backend(), pg_ctl stop, or an external kill), the user-visible message has historically been terse:

FATAL: terminating connection due to administrator command

This leaves DBAs with no forensic information about who sent the signal. In multi-tenant or shared environments, distinguishing between a legitimate pg_ctl stop, a rogue administrator, an OOM-killer, or a misbehaving monitoring script requires correlating timestamps across multiple logs. The proposal from Jakub Wartak aims to surface the sender's PID and UID directly in the error detail, leveraging POSIX SA_SIGINFO / siginfo_t to extract si_pid and si_uid inside the signal handler.

Architecturally this matters because signals are one of the few cross-process communication channels in PostgreSQL that previously carried no provenance into the logging subsystem. Adding this plumbing turns an opaque event into an auditable one.

Evolution of the Design — Three Successive Architectures

v1–v6: Direct globals set inside the wrapper signal handler

The early patches added HAVE_SA_SIGINFO autoconf/meson probes and extended wrapper_handler() in src/port/pqsignal.c to take the siginfo_t *info, void *context signature. The wrapper stashed info->si_pid and info->si_uid into file-scope globals (proc_die_sender_pid, proc_die_sender_uid, later ProcDieSenderPid/Uid), which die() / ProcessInterrupts() / SyncRepWaitForLSN() would then read and format into errdetail().

Review cycles refined several secondary concerns:

Andrew Dunstan committed v6 on 2026-04-07.

Andres's objection — layering violation

Shortly after commit, Andres Freund raised three serious objections:

  1. Translatability: conditional errdetail() appended via the proc_die_sender_pid == 0 ? 0 : errdetail(...) trick produces a separate translatable string for every callsite.
  2. Layering: pqsignal.c's wrapper_handler has no business knowing that SIGTERM specifically means "process death" and thus no business setting ProcDieSenderPid. Not every SIGTERM handler wants this (e.g., some auxiliary processes have custom handlers).
  3. Signal nesting: On platforms where signal handlers can be interrupted by other handlers, globals set by the wrapper can be overwritten before the inner handler reads them — a classic async-signal race.

Andres's verdict: "nowhere near ready to have been committed." This is notable because Andres is a senior committer with deep expertise in PostgreSQL's signal/latch infrastructure (author of much of the latch and async notification code), so his architectural objections carried decisive weight.

The rewrite: extending SIGNAL_ARGS itself

Andres sketched the correct fix: rather than smuggle info through globals, change the signal handler ABI so every handler receives a platform-agnostic pg_signal_info * as a second argument. Key components:

typedef struct pg_signal_info {
    pid_t pid;   /* 0 if unknown */
    uid_t uid;   /* only meaningful when pid != 0 */
} pg_signal_info;

#define SIGNAL_ARGS int postgres_signal_arg, const pg_signal_info *pg_siginfo

The wrapper in pqsignal.c translates platform-specific siginfo_t (when SA_SIGINFO is available) into the postgres-neutral struct, or passes a zeroed one otherwise. Individual handlers like die() can then set ProcDieSenderPid/Uid from their argument — no globals shared across the handler boundary, no nesting hazard, and pqsignal.c stops knowing anything domain-specific.

This required an ecosystem-wide change: every pass of SIG_IGN / SIG_DFL to pqsignal() now produces a function-pointer type mismatch warning. The resolution (suggested by Chao) was to define:

#define PG_SIG_DFL (pqsigfunc) (pg_funcptr_t) SIG_DFL
#define PG_SIG_IGN (pqsigfunc) (pg_funcptr_t) SIG_IGN

The intermediate cast through pg_funcptr_t silences -Wcast-function-type-mismatch from Clang. Every callsite across the tree was updated from SIG_IGN/SIG_DFL to PG_SIG_IGN/PG_SIG_DFL.

Windows / mingw portability pain

Jakub did the heavy lifting on Windows. Several compounding issues:

Post-commit fallout

OpenBSD: si_pid always zero for SIGTERM

The buildfarm's OpenBSD members failed the new TAP test. Tom Lane reproduced locally on OpenBSD 7.7: HAVE_SA_SIGINFO is defined, SA_SIGINFO is set, but info->si_pid and si_uid arrive as zero even for pg_terminate_backend() from a different session. Jakub confirmed this is intentional in the OpenBSD kernel — sys/kern/kern_sig.c:initsiginfo() zeros the struct and only waitid(2) / dowait*() populate si_pid. This matches OpenBSD's general posture of minimizing cross-process information leakage.

Resolution: make the regex in the TAP test match the DETAIL line as optional rather than required. No runtime code change — the feature simply degrades gracefully on platforms that don't populate the fields.

Information-disclosure concern

Evan Chao raised a late concern: exposing UID 501 (or any OS-level UID) to an ordinary unprivileged client is a potential information leak in shared hosting. Proposed fix: use errdetail_log() instead of errdetail(), which routes the detail to the server log only, hiding it from the client. Jakub concurred. This is the last unresolved thread item in the captured discussion.

Key Takeaways

  1. Signal handler ABIs are a commitment. The v6 commit attempted to retrofit new information through globals; the proper fix required changing SIGNAL_ARGS itself and touching every signal registration in the tree. This is the kind of change that tends to get rejected later in review exactly as Andres did.
  2. siginfo_t portability is weaker than the feature-test macro suggests. HAVE_SA_SIGINFO being defined says nothing about whether the kernel bothers to fill si_pid/si_uid. OpenBSD's behavior is a reminder that POSIX is a floor, not a contract.
  3. Function-pointer type-checking in modern Clang (-Wcast-function-type-mismatch) is increasingly strict and interacts badly with SIG_IGN/SIG_DFL being void(*)(int) when you extend the handler signature. The pg_funcptr_t double-cast pattern is the escape hatch.
  4. The "fix for a feature committed before feature freeze" exception (invoked by Andrew when Kirill Reshke questioned the post-freeze rework commit) is a legitimate but easily abused mechanism; in this case the rework arguably went well beyond a fix and was probably only acceptable because the alternative was reverting entirely.