Changing the state of data checksums in a running cluster

First seen: 2024-07-03 06:41:01+00:00 · Messages: 137 · Participants: 15

Latest Update

2026-05-07 · opus 4.7

Round Update: Minor Commit Acknowledgment

Single message from Daniel Gustafsson acknowledging a report/patch/testing from two unnamed contributors (referenced as "both of you") and confirming that patches 0001-0003 were pushed with minor tweaks. No technical discussion, no new design decisions, no controversy. This is a routine commit-acknowledgment message closing out a small follow-up fix series on top of the already-committed online checksums feature.

The content of those 0001-0003 patches is not described in this message, so no substantive technical analysis can be extracted from the current round alone.

History (1 prior analysis)
2026-05-06 · opus 4.7

Online Data Checksums: Architectural Analysis

1. The Core Problem

PostgreSQL's data checksums detect silent data corruption by storing a 16-bit checksum in every data page header. Historically, this feature could only be enabled at initdb time, burning the decision into pg_control permanently. Changing this state on a running cluster was impossible without pg_checksums — a tool that requires the cluster to be offline, which for large production databases can mean many hours of downtime.

The feature's goal: change pg_control-level cluster state (data_checksum_version) while the cluster is running and serving traffic, without corruption, without blocking concurrent writes indefinitely, and with enough robustness to survive crashes and replica restarts.

This is architecturally much harder than it sounds because:

  1. Checksums are verified on read and written on write. Both paths are hot and exist across every backend.
  2. The current state must be atomically observable by all backends — enabling checksums early causes spurious verification failures; disabling them too early causes pages to be written without checksums that later get verified.
  3. Hint-bit writes don't generate WAL but do update checksums. The interaction with full_page_writes, FSM, and VM forks (none of which are fully WAL-logged) is delicate.
  4. Standbys must arrive at the correct state through WAL replay alone, without direct user control.
  5. Backup/restore (basebackup, PITR) can straddle state transitions.

2. Architectural Design

State Machine

Four states replace the old binary on/off:

  • off — neither written nor verified
  • inprogress-on — written but NOT verified (transitioning from off to on)
  • on — written and verified
  • inprogress-off — still written but NOT verified (transitioning from on to off)

The intermediate states exist because of the read/write asymmetry. When enabling: you must write checksums on all backends before any backend verifies them, and you must have checksummed every existing page before anyone verifies. When disabling: you must stop verifying on all backends before anyone is allowed to write a page without a checksum.

Synchronization via ProcSignalBarrier

The patch relies heavily on the ProcSignalBarrier infrastructure (invented partly to enable this work). Each state transition emits a barrier; all backends must absorb it before the next transition proceeds. A barrier-absorb function validates the local-to-global state transition using a declarative table of allowed (from, to) pairs.

Launcher/Worker Architecture

A "datachecksumsworker launcher" bgworker spawns a per-database worker. Each worker traverses every block of every relation, dirties the buffer, and lets the checkpointer/bgwriter compute and write checksums. It uses log_newpage_buffer() to WAL-log pages (generating massive WAL volume — a key operational cost).

Where the State Lives (the hardest part)

There are effectively four places tracking checksum state:

  1. LocalDataChecksumVersion — per-backend cache, updated via barriers
  2. XLogCtl->data_checksum_version — the authoritative current shmem value
  3. ControlFile->data_checksum_version — the durable value, lazily written
  4. The on-disk pg_control file

The patch went through a major redesign (March 2025, driven by Tomas Vondra's testing and Andres Freund's feedback) when it became clear that using ControlFile as the source of truth was fundamentally wrong. The control file represents "state at last checkpoint / safe recovery starting point" — not "current state." A replica could persist the control file ahead of WAL, then crash, then restart from an older redo point with a control file "from the future," tripping asserts and causing real corruption risks.

The fix: move the current value to XLogCtl, and only copy to ControlFile when a checkpoint naturally persists it. This matches the XLogCtl / ControlFile pattern used elsewhere.

The PITR Problem and Andres's Checkpoint-Record Fix

A later insight from Andres Freund (Dec 2025) was that even the XLogCtl approach was flawed for PITR and basebackup: a base backup copies pg_control last, so redo starts with the final checksum state but replays XLOG_CHECKSUMS records from earlier states, tripping the state-machine assertions.

Andres's proposal, eventually implemented: embed the checksum state in XLOG_CHECKPOINT_REDO / XLOG_CHECKPOINT_SHUTDOWN records. Recovery starts from the checkpoint's checksum state, so the WAL stream is self-consistent regardless of what pg_control says. This eliminated the need for forced checkpoints/restartpoints on every state change — a major operational win, especially for sync replication where forcing restartpoints blocks redo.

3. Key Technical Battles

Battle 1: Restartability (Historical)

Earlier versions supported resuming a partial checksum-enable across crashes. This ballooned complexity and repeatedly killed the patch. Daniel explicitly descoped this, and Bruce Momjian repeatedly voiced support: "saying 'yes' to every feature improvement can lead to failure." The committed version simply aborts back to off on any crash during transition.

Battle 2: Race in InitPostgres vs. State Changes

Tomas Vondra discovered (March 2025) a race: a backend could read XLogCtl->data_checksum_version before registering in ProcSignal, then miss the barrier, ending up with stale LocalDataChecksumVersion forever. This was a silent corruption bug — the backend would write pages without checksums in a cluster that believed checksums were on.

Fix: reorder InitPostgres so ProcSignalInit runs before InitLocalDataChecksumVersion, then tolerate seeing a redundant barrier for the initial value.

Battle 3: FSM/VM Forks

The VM and FSM forks are not fully WAL-logged. After crash+recovery, pages in these forks could have stale checksums. This manifested as recurring test failures and was eventually partially punted to a separate thread fixing VM WAL-logging. The FSM remains a known weakness (not a regression introduced by this patch).

Battle 4: Online Checkpoint Record Staleness (Post-Commit)

After commit, Tomas's aggressive TAP testing exposed that CreateCheckPoint was writing a stale data_checksum_version into XLOG_CHECKPOINT_ONLINE because the value was captured at checkpoint start but the record emitted at end — concurrent checksum transitions would cause redo to try an illegal state transition. Fix: don't update checksum state from online checkpoint records at all; only the XLOG_CHECKPOINT_REDO and XLOG2_CHECKSUMS records drive state.

Battle 5: Duplicate Launcher Race

Ayush Tiwari found post-commit that two rapid pg_enable_data_checksums() calls could spawn two launchers; the losing launcher's on_shmem_exit cleanup handler would wipe the winning launcher's state. Fix: defer installing the exit handler until after winning the race.

Battle 6: Unlogged Relations

Satyanarayana Narlapuram found that log_newpage_buffer() was called unconditionally during the checksum-enable scan, producing WAL for unlogged relations (which by definition shouldn't emit WAL for user data). The fix preserved INIT_FORKNUM WAL-logging (needed because recovery copies init forks to main forks to reset unlogged relations, so the standby/post-recovery init fork pages must have valid checksums).

Battle 7: Vacuum Cost Delay Throttling Broken

Satya also discovered that the cost-delay throttling was broken post-commit: the worker wrote to the GUC-backed VacuumCostDelay but vacuum_delay_point() read the separately-maintained vacuum_cost_delay, kept in sync only by VacuumUpdateCosts() which the worker never called. A classic drift bug from refactoring the vacuum cost API during the long review cycle.

4. Why This Matters Architecturally

Andres Freund's comment captures the broader value: "I think this is actually a good feature to build the infrastructure for features (i.e. dynamically reconfiguring the cluster while running) like this." The patch establishes a pattern for any future pg_control-level setting that needs to change online: state machine with inprogress-* states, ProcSignalBarrier for global synchronization, XLogCtl for current state, pg_control for durable state, checkpoint-record embedding for recovery consistency.

5. Testing Methodology

A notable feature of this thread is the unusually rigorous testing process. Tomas's stress tests (TAP + bash harnesses with pgbench, random restarts, fast/immediate shutdown mixes, primary+standby scenarios) repeatedly uncovered bugs that conventional review missed. Late in the cycle, ~80 injection points were added to enable deterministic interleaving tests. The post-commit TAP suite (TAP 010-014) tests concurrent state changes against checkpoints and crashes — this found several subtle race conditions that had shipped in the initial commit and were patched up before the release freeze closed.