race condition when writing pg_control

2026-05-06 · opus 4.7

Core Problem: Two Distinct Race Conditions on pg_control

This thread starts as a narrowly scoped bug report but evolves, over six years, into a much deeper architectural discussion about how PostgreSQL reads and writes the pg_control file. Two fundamentally different bugs are covered, both producing the same user-visible symptom: FATAL: incorrect checksum in control file.

Bug #1 (2020): Missing ControlFileLock in xlog_redo()

The original bug Nathan Bossart reported is a classic shared-memory data race. CreateRestartPoint() runs in the checkpointer process and calls UpdateControlFile() which computes a CRC over ControlFileData and writes it out under ControlFileLock. However, xlog_redo() — running in the startup process — mutates fields of that same shared ControlFileData (notably checkPointCopy.nextFullXid and parameters from XLOG_PARAMETER_CHANGE) without taking ControlFileLock. If the checkpointer's CRC computation interleaves with the startup process's field mutation, the file is written with a checksum that does not match its contents. On the next crash restart, the postmaster refuses to start.

Thomas Munro's archeology pinpoints the regression: in commit 35af5422f64 (2006), RecoveryRestartPoint() ran UpdateControlFile() directly in the startup process immediately after the mutation, so there was no interleaving. Commit cdd46c76548 (2009) split restart points out to a separate process (the checkpointer/bgwriter), inadvertently introducing the race.

Fujii Masao then noted XLogReportParameters() had the same shape of bug — it writes the control file without the lock. Michael Paquier escalated the hygiene concern: the right fix is not just to plug individual holes but to assert that ControlFileLock is held inside UpdateControlFile() so future callers are forced to think about it.

The only legitimate unlocked caller is early in StartupXLOG() (near the backup_label check), before concurrent readers exist. The debate here was whether to (a) thread a boolean already_locked parameter through, or (b) just acquire the lock unconditionally in that one spot. Consensus (Thomas, Fujii-san weakly, Amul Sul, Nathan) favored just taking the lock — simpler, no measurable contention, no special case. Michael mildly preferred the parameter approach on purist grounds ("we don't need the lock"). Thomas pushed the locking fix + assertion as a single squashed commit in June 2020, back-patched.

Bug #2 (2024+): Torn Reads of pg_control (EXEC_BACKEND / basebackup)

In May 2024 Melanie Plageman observed the same FATAL message on a Windows buildfarm run. Nathan initially thought it unrelated to the 2020 fix, and Andres Freund reframed the problem entirely: the on-disk file can be read in a torn state even if every writer is perfectly locked. This is a file-system semantics problem, not an LWLock problem.

Tom Lane's reflex — "writes to pg_control had better be atomic, or we have big trouble" — is countered by Andres: on ext4 and NTFS, a concurrent read() can observe a partially-written file even when the write() is a single syscall of a sub-sector payload. POSIX's atomicity guarantees for concurrent read/write are famously ambiguous (Andres links the well-known cks.utoronto.ca post).

Thomas diagnosed the specific failure mode: LocalProcessControlFile() is invoked very early in every EXEC_BACKEND child (i.e., every forked-and-re-execed backend on Windows). That call read()s pg_control from disk without any interlock, precisely while the checkpointer may be rewriting it. NTFS + EXEC_BACKEND is the toxic combination that makes Windows buildfarm animals like culicidae flap.

Design Space for Fixing the Torn Read

Thomas lays out why the obvious fixes don't work:

"Just take ControlFileLock in the child" — impossible. The child is too early in startup to use LWLocks; shared memory attachment and lock infrastructure aren't ready.
"Have the postmaster take a copy under the lock and pass it via BackendParameters" — the postmaster itself cannot acquire LWLocks (it must never block on backend state), so it can't safely snapshot the live shared ControlFileData.
"Reorder startup so locks are available earlier" — risky; affects preload library ordering and children that don't attach to shared memory. Not back-patchable.

Thomas's "Proto-ControlFile" Proposal (May 2024)

Pass a copy of the control file that the postmaster read once at its own startup (before any checkpointer exists to race with it) through BackendParameters to each EXEC_BACKEND child. This "proto-controlfile" is used only for the handful of fields LocalProcessControlFile() extracts — WAL segment size, page size, checksum version — which are effectively immutable after initdb (or at least across a running cluster's lifetime). Once the child finishes early init, it attaches to the real shared-memory ControlFileData for all subsequent uses.

This elegantly mirrors what non-EXEC_BACKEND children get "for free" via fork() inheritance: a snapshot of postmaster memory at fork time.

Noah Misch's Critique and Refinement (July 2024)

Noah initially endorses the idea ("recreates what !EXEC_BACKEND backends inherit from postmaster"), then corrects himself: actually, !EXEC_BACKEND children inherit ControlFile as a pointer into shared memory, so they see live updates. The postmaster's own ControlFile->checkPointCopy->redo does mutate over time as checkpoints happen. So the proto-copy is not a perfect imitation of fork inheritance.

Noah identifies the real hazard: ControlFileData is a structural mishmash of three lifetimes:

initdb-time fields (page size, WAL segment size, checksum version) — safe to read from a proto-copy forever.
postmaster-start-time fields — safe if the proto-copy is captured after the last such change.
changes-anytime fields (checkpoint redo, nextXid, etc.) — reading these from a stale proto-copy is a latent bug waiting to happen.

He proposes a middle path: use the proto-copy but poison the changes-anytime fields with bogus values so any accidental read of them crashes loudly rather than silently using stale data. He also sketches a more ambitious refactor: never let ControlFile point to anything but NULL or shmem, and split the struct by lifetime.

Alexander Korotkov's Minimalist Counter-Proposal (May 2026)

Two years later, Korotkov observes that empirically, the only field needed from ControlFileData before shmem attachment is data_checksum_version. His patch passes just that single bool/enum through BackendParameters, eliminating the proto-copy entirely and therefore eliminating Noah's stale-read hazard class by construction. This is the tightest possible fix — no surface area for future bugs where someone reads a stale field from the proto-copy.

Alvaro Herrera, meanwhile, had rebased Thomas's original patch in Feb 2026 after noting culicidae was still failing as recently as January 2026, which pressured the thread back to life.

Why This Matters Architecturally

pg_control is the single most critical file in a PostgreSQL cluster. A corrupted CRC on restart means the cluster refuses to start — a hard outage requiring pg_resetwal (data loss) to recover. The bugs here span the full stack of concurrency concerns PostgreSQL has to manage:

In-memory concurrency (Bug #1): classic LWLock discipline, solved with the standard pattern + a defensive assertion.
Filesystem atomicity (Bug #2, writer side): POSIX provides no real guarantee for concurrent read/write on regular files; ext4 and NTFS both violate the naive mental model. The durable fix (referenced elsewhere) is write-to-tempfile + atomic rename for backend-visible reads, especially for pg_basebackup.
Process model asymmetry (Bug #2, reader side): EXEC_BACKEND is a second-class citizen in PostgreSQL's concurrency model; code that works because of fork() memory inheritance needs explicit plumbing under EXEC_BACKEND. This is a recurring source of Windows-only bugs.

The six-year tail of this thread illustrates how hard it is to land back-patchable fixes when the clean fix requires reordering process startup, and how buildfarm flakiness (culicidae) is both the symptom that keeps the issue alive and the main forcing function.

Authority and Stance

Thomas Munro (committer, recovery/WAL expert): drove both fixes, owns the proto-controlfile proposal, has the deepest diagnostic credibility here.
Michael Paquier (committer): pushed for the UpdateControlFile() assertion as a durable invariant, the most lasting contribution to code hygiene in the 2020 round.
Andres Freund (committer, low-level I/O expert): reframed Bug #2 from a locking problem to a filesystem-atomicity problem, correcting Tom Lane's initial dismissal.
Noah Misch (committer): provided the most rigorous analysis of the hazard classes in the proto-copy approach.
Tom Lane (committer): initially under-estimated the torn-read problem but deferred to Andres's evidence.
Nathan Bossart (became committer): reporter and author of Bug #1's fix; pragmatic preference for "just take the lock."
Alexander Korotkov (committer): proposed the minimalist data_checksum_version-only fix in 2026 — likely the path forward because it collapses the design-space debate.

Open Question at Thread's End

Whether to take Korotkov's surgical approach (pass only data_checksum_version), Thomas's proto-controlfile (with Noah's poisoning), or the larger refactor that splits ControlFileData by field lifetime. As of the last message, Korotkov's patch is the most recent concrete proposal and has the appealing property that it makes the stale-read hazard structurally impossible.

Latest Update