[PATCH] Don't call ereport(ERROR) from recovery target GUC assign hooks

First seen: 2026-04-13 08:21:39+00:00 · Messages: 11 · Participants: 5

Latest Update

2026-06-01 · claude-opus-4-6

Monthly Summary: Moving recovery target conflict detection out of GUC assign hooks (May 2026)

Problem Statement

PostgreSQL's five recovery_target_* GUCs (recovery_target, recovery_target_lsn, recovery_target_name, recovery_target_time, recovery_target_xid) violate the GUC subsystem's contract that assign hooks must not fail. Each assign hook calls ereport(ERROR) to enforce mutual exclusion (only one recovery target may be set at a time). This was introduced by commit f2cbffc7a618 and has been flagged with an XXX comment for years. It's tolerated only because these are PGC_POSTMASTER GUCs — failures abort startup before corrupted guc.c state is observable.

The fundamental issue: GUC hooks operate on a single variable at a time and are not the right place for cross-variable invariants. The correct location is a post-load validation pass, which validateRecoveryParameters() in xlogrecovery.c already provides.

Patch Evolution

v1: Naive refactoring

Moved the conflict check to validateRecoveryParameters() using GetConfigOption() to count non-empty targets. Two regressions identified:

  1. Deferred error detection (Greg): The check ran after an ArchiveRecoveryRequested early-return, meaning misconfigurations wouldn't surface until recovery was actually attempted rather than at startup.
  2. Empty-string clobber (Fujii): Removing conflict checks exposed a latent bug where an empty assignment to one GUC (e.g., recovery_target_time='') would reset recoveryTarget to UNSET, silently wiping a previously-set valid target from a different GUC.

v2: Addressing both regressions

v3: Guarded reset (key design breakthrough)

Introduced a conditional else branch in each assign hook:

else if (recoveryTarget == MY_TYPE)
    recoveryTarget = RECOVERY_TARGET_UNSET;

Each hook only resets recoveryTarget if its own target type is currently active. This elegantly handles both cases:

Added seven new TAP test cases in 003_recovery_targets.pl covering all combinations, using CLI path (since postgresql.conf deduplicates keys).

v4: Windows portability fix

CFBot caught POSIX single-quote issues on Windows. Fixed by dropping single quotes around GUC values in --options arguments and switching to ISO 8601 timestamps with T separator (no spaces, no quoting needed). No functional change.

Design Principles Established

Current Status

v4 is the latest patch version, addressing all identified functional issues and test portability. The approach is architecturally sound and has consensus among reviewers.

History (1 prior analysis)
2026-06-01 · claude-opus-4-6

New contribution: scott@scottray.io proposes UX improvements and extended test coverage (v1-0001 atop v4)

A new participant (Scott Ray) enters the thread with a concrete patch (v1-0001, layered on top of Shin's v4) addressing three distinct items:

1. ALTER SYSTEM + restart timing trap (new issue raised)

Scott identifies a real-world operational scenario not previously discussed: a user sets two conflicting recovery targets via ALTER SYSTEM, reloads (which logs "cannot be changed without restarting" but otherwise succeeds), and only hits the FATAL much later at next restart. The conflict is latent and invisible until the worst possible moment.

Scott asks whether CheckRecoveryTargetConflicts (or something similar) should also inspect pending GUC values during reload to surface the conflict early. This is architecturally interesting — it extends the "validate cross-GUC invariants in a dedicated pass" principle from startup-time to reload-time. However, it's explicitly framed as a follow-up question rather than something included in the patch.

2. Richer error detail in conflict message (implemented)

The patch improves the FATAL message by listing which recovery targets are actually set and their values, rather than just enumerating all five possible GUC names. This moves the conflict detail from generic ("at most one of these five...") to specific ("recovery_target_time = '2026-01-01 00:00:00', recovery_target_xid = '700'"). The generic text moves to a HINT. This is a straightforward UX improvement that helps operators diagnose misconfiguration without grepping config files.

3. Extended TAP test coverage for all five GUCs

The existing v4 tests only exercised recovery_target_xid for the cleared-then-set CLI scenario. Scott's patch adds analogous test cases for recovery_target, recovery_target_lsn, recovery_target_name, and recovery_target_time, ensuring all five GUCs have coverage for the same-GUC clearing behavior validated by v3's conditional else branch.

Assessment

This is a constructive review-with-patch contribution. The UX improvement (item 2) is the most likely to be accepted as-is. The ALTER SYSTEM timing trap (item 1) is a genuine usability gap but would require more invasive changes to the reload path and is appropriately scoped as a separate discussion. The test coverage extension (item 3) is straightforward and uncontroversial.