Analysis: Reintroducing Spread Checkpoints for Online Checksum Changes
Background and Architectural Context
PostgreSQL 18 introduced the ability to enable or disable data checksums on a running cluster via pg_enable_data_checksums() / pg_disable_data_checksums(). Previously, toggling checksums required pg_checksums on an offline cluster or an initdb-time decision. The online facility works by:
- Persisting a state transition in the control file (e.g.,
inprogress-on,on,inprogress-off,off). - For enabling: scanning every relation, reading each page, computing the checksum, and writing it back (WAL-logged as an FPI so standbys/crash recovery learn the new checksum value).
- Forcing checkpoints at state transition boundaries so that the on-disk state is guaranteed to reflect the new checksum regime before moving to the next phase. Without these checkpoints, a crash could leave pages with old (now-invalid) checksums under a control-file state that demands checksum verification.
The checkpoints are therefore not optional — they are a correctness barrier between phases of the state machine. The only question is how aggressively they should be driven.
The Core Problem
In an earlier iteration of the patch series (v20251201, referenced via Daniel Gustafsson's message [1]), the API exposed a fast boolean parameter controlling whether the checkpoints issued during the transition used CHECKPOINT_FAST (equivalent to CHECKPOINT IMMEDIATE) or a spread checkpoint governed by checkpoint_completion_target. Somewhere between that version and what was committed, the fast parameter was dropped and the implementation hard-coded CHECKPOINT_FAST. Vondra notes there is no recorded rationale on the thread for this removal — it appears to have been either an oversight during a simplification pass or a deliberate minimization of surface area that was never revisited.
The hard-coded fast behavior is problematic for two reasons Vondra articulates:
1. Interaction with cost-based throttling of the rewrite phase
The online checksum worker honors vacuum_cost_limit / vacuum_cost_delay semantics, allowing operators to deliberately slow the page-rewrite phase to minimize I/O impact on production workloads. A fast (immediate) checkpoint at the end of such a throttled rewrite phase is architecturally inconsistent: the user has explicitly asked for low-impact background work, and the final checkpoint then dumps a large volume of dirty buffers (including all the just-rewritten pages) as fast as the I/O subsystem will accept them. On a large, busy system this single synchronous event can dwarf the disruption of the throttled rewrite itself and defeat the purpose of throttling.
2. The disable path has no amortizing rewrite work
When disabling checksums, no pages are rewritten and nothing is WAL-logged beyond the control-file state change. The entire cost of the operation is effectively the forced checkpoint(s). Arguing "the checkpoint is a small fraction of the total work" — which may be defensible for enabling on a huge database because the WAL volume from FPIs dominates — simply does not apply here. Forcing a fast checkpoint on a busy production system merely to flip a control-file flag is gratuitously disruptive.
The Proposed Patch
The patch is described as "mostly extracted from v20251201": it re-adds a fast boolean parameter to both pg_enable_data_checksums() and pg_disable_data_checksums(), threading it down to the RequestCheckpoint() calls so that CHECKPOINT_FAST is set conditionally rather than unconditionally. This is a small, mechanical change — the machinery already existed and was removed; the patch restores it.
Key Design Questions Raised
Default value of fast
Vondra chose fast=true to match:
- PG19's general direction of defaulting to fast checkpoints for admin-triggered operations.
- The precedent of
VACUUM, which does no throttling by default (cost-based delay is opt-in viavacuum_cost_delay > 0).
He suspects, however, that most production deployments would prefer fast=false. This is the classic tension between "safe/fast default for small systems and tests" versus "least-surprise for large production clusters where these functions are most likely to be invoked deliberately and carefully." No consensus is expressed in this single message; it is explicitly flagged as an open question.
TAP test coverage
Exercising fast=false in the test_checksums TAP suite would be valuable for coverage but painful for test runtime because spread checkpoints wait on checkpoint_timeout × checkpoint_completion_target. Vondra notes a workaround — aggressively lowering checkpoint_timeout in the test cluster (as he did in the TAP tests in [2]) — but is unconvinced it is worth the complexity, especially since the checkpoints remain synchronous from the caller's perspective regardless of speed.
Technical Insights and Implications
-
Checkpoints as correctness barriers, not just performance events. The online checksum state machine relies on checkpoints to serialize the visibility of control-file state transitions against the durability of on-disk page contents. Any discussion of "fast vs. spread" is purely about how the barrier is erected, not whether.
-
Asymmetry between enable and disable. The argument "FPI volume dwarfs checkpoint cost" is a property of the enable path only. Baking that assumption into the shared API by hard-coding
CHECKPOINT_FASTignores the disable path entirely. This is a good example of why API decisions should be driven by the worst-case caller, not the dominant one. -
Consistency with throttling philosophy. PostgreSQL has a well-established pattern (VACUUM, autovacuum, the checksum worker itself) of offering cost-based throttling for long-running maintenance. Removing the spread-checkpoint option creates an unthrottlable tail event at the end of an otherwise throttleable operation — a violation of that pattern.
-
Minor API surface, significant operational impact. The patch is small, but giving DBAs the ability to say "do this slowly" on a production cluster is exactly the kind of knob that determines whether a feature is usable in production at all. Features that must be invoked during maintenance windows because they cannot be throttled end-to-end are substantially less valuable than those that can run alongside live traffic.
Participant Dynamics
This is a single-message thread (as presented) authored by Tomas Vondra, a major committer with deep history in checkpoint, WAL, and storage work. He is essentially self-auditing a regression introduced between patch versions of a feature he was involved in reviewing, and proposing a restoration. The tone ("I don't buy that, for two reasons") indicates he is pushing back against an argument made earlier in the original thread — likely by whoever justified dropping the fast parameter — but he does not name that person and does not have a recorded counter-argument to quote.
The referenced message [1] points to Daniel Gustafsson (yesql.se), the primary author/maintainer of the online-checksums feature, whose response to this patch will likely determine its fate.