[PATCH] Don't call ereport(ERROR) from recovery target GUC assign hooks

First seen: 2026-04-13 08:21:39+00:00 · Messages: 8 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

Analysis: Moving recovery target conflict detection out of GUC assign hooks

The architectural problem

PostgreSQL's GUC subsystem (src/backend/utils/misc/guc.c) is built around a strict contract documented in its README: assign hooks must not fail. The contract exists because set_config_option and ProcessConfigFile commit GUC state changes in a specific sequence, and the rollback path is not designed to unwind from the middle of an assign callback. Validation that can fail belongs in a check_hook, which runs earlier and returns a boolean.

The five recovery_target_* GUCs (recovery_target, recovery_target_lsn, recovery_target_name, recovery_target_time, recovery_target_xid) violate this contract. Each assign hook implements mutual exclusion: if any other recovery target is already set to a non-empty value, error_multiple_recovery_targets() calls ereport(ERROR). This was introduced by commit f2cbffc7a618 (cited by Michael Paquier). An XXX comment in the source has flagged this as "broken by design" for years, tolerated only because all five GUCs are PGC_POSTMASTER — a failure there aborts postmaster startup anyway, so the corrupted guc.c state is never observed.

The violation has two theoretical consequences:

  1. Ordering dependencies — the outcome depends on which of the conflicting GUCs is assigned first, which is fragile.
  2. Inconsistent internal state — the error unwinds with guc.c mid-transaction; if any of these GUCs were ever promoted to a lower context (SIGHUP/SUSET) the latent bug would surface.

Why cross-GUC validation fundamentally doesn't fit assign hooks

Michael Paquier's comment crystallizes the core design lesson: GUC hooks operate on a single variable at a time; they are not the right place for cross-variable invariants. A check_hook sees only the new value of one GUC — it cannot reliably reason about combinations because other GUCs may not have been parsed yet (ProcessConfigFile applies values in textual order). By the time all values are known, the hooks have already fired. Any attempt to enforce "at most one of these five may be set" inside a per-variable hook is structurally unsound.

The correct pattern is a post-load validation pass, which is exactly what validateRecoveryParameters() in xlogrecovery.c already is — it runs from InitWalRecovery() in the startup process, after all GUCs are fully loaded, and already performs other cross-parameter checks for recovery configuration.

The v1 patch and its two regressions

The v1 approach was mechanically simple: delete the conflict check from each assign hook, add a loop in validateRecoveryParameters() that uses GetConfigOption() to count non-empty recovery target GUCs, FATAL if > 1. Two behavioral regressions surfaced in review:

Regression 1 (Greg): deferred error detection

validateRecoveryParameters() short-circuits when ArchiveRecoveryRequested is false (no recovery.signal/standby.signal). Under v1, a misconfigured postgresql.conf with two conflicting recovery targets would boot successfully, and the conflict would only surface later when an operator dropped a signal file to start PITR. The original behavior caught the mistake at pg_ctl start time. Shin's commit message framed this as "arguably more correct" (since the GUCs are inert outside recovery), but Greg pushed back: from an operator's perspective, losing early detection of a config error is a real downgrade regardless of whether the GUC is "effective."

Regression 2 (Fujii): empty-string clobber

Fujii Masao identified a more serious, subtler bug. Each original assign hook contained logic roughly equivalent to:

if (newval_is_nonempty) {
    if (any_other_target_set) error_multiple_recovery_targets();
    recoveryTarget = RECOVERY_TARGET_XID;  /* or whatever */
    recoveryTargetXid = parsed_value;
} else {
    recoveryTarget = RECOVERY_TARGET_UNSET;   /* <-- the clobber */
}

The else branch unconditionally resets recoveryTarget to UNSET when the GUC's new value is empty. Because GUCs are assigned in the order they appear in the config/command line, a sequence like:

recovery_target_xid = '9999'
recovery_target_time = ''

would: (1) set recoveryTarget = RECOVERY_TARGET_XID via the xid hook, then (2) the time hook fires with empty newval and hits its else branch, resetting recoveryTarget back to UNSET. In master this is masked because the multiple-target error path catches conflicts; but more importantly, an explicit empty assignment silently wipes an earlier valid setting.

v1 removed the conflict-checking code but kept the else UNSET clobber, turning a latent bug into an observable one: recovery_target_xid=9999 followed by recovery_target_time='' would start recovery with no target at all — a dangerous silent behavior for PITR where you'd expect recovery to stop at xid 9999 but instead it would replay to the end of WAL.

Michael Paquier initially argued this was fine ("empty == default == do nothing"), but Fujii pushed back that operator intent matters: if someone writes recovery_target_xid=9999, they expect that target to take effect. Michael conceded.

v2 resolution

Shin's v2 addresses both regressions:

  1. Conflict check factored into CheckRecoveryTargetConflicts(), called from validateRecoveryParameters() before its ArchiveRecoveryRequested early return. This restores early detection — the check runs at every startup regardless of whether recovery is requested, preserving master's user-visible behavior. Shin kept the check in the startup process rather than moving it to PostmasterMain (Greg had floated that option); this is the right call because startup process is where other recovery GUC validation already lives, keeping related logic co-located.

  2. Removed the else recoveryTarget = UNSET branch from each assign hook. Now an empty assignment to recovery_target_time no longer clobbers a previously-set target type. Combined with the post-load conflict check, semantics are preserved: multiple non-empty targets → FATAL; one non-empty target wins regardless of assignment order; empty strings are no-ops.

  3. Trivial fixes: maycan in errdetail (per error-style-guide), updated TAP test.

The v2 edge case Fujii identified

Fujii's final message exposes a remaining semantic wrinkle with v2:

postgres -D data -c "recovery_target_xid=700" -c "recovery_target_xid="

In master, the second (empty) assignment triggers the assign hook's else branch, resetting recoveryTarget to UNSET — so recovery_target_xid is effectively unset, matching naive expectation that "the last value on the command line wins."

Under v2, the hook no longer has the else branch. The question is what recoveryTargetXid ends up as. The GUC value itself is empty (guc.c stores the last assignment), but the module-local recoveryTargetXid variable was set to 700 on the first hook call and never reset. So recovery runs with xid=700 even though the user explicitly cleared the GUC.

This is the flip side of the v2 fix: by removing the else UNSET clobber, v2 correctly handles the cross-GUC case (xid=9999 then time='') but breaks the same-GUC case (xid=700 then xid=''). The correct resolution is probably to recompute the recoveryTarget* module-local state from current GUC values at the start of validateRecoveryParameters() rather than relying on assign hooks to maintain it incrementally. That is, treat the assign hooks as pure GUC storage and derive all recovery state in one pass at validation time — fully decoupling the module state from the GUC assignment sequence.

As of the thread's last message, this remains open. The fix is conceptually clear but requires more restructuring than v2 attempted.

Weight of opinions

Broader implication

This thread is a textbook case of why GUC hooks should be "dumb": store, parse, range-check. Any logic involving multiple GUCs, or even reasoning about sequence-of-assignments semantics within one GUC, belongs in a well-defined post-load validation phase. The recovery_target_* fix is small, but the pattern — factor cross-cutting validation into a single pass that reads final GUC values — is applicable anywhere else guc.c has similar violations (a grep for ereport(ERROR inside assign_* functions would be worthwhile).