Monthly Summary: Online Data Checksums — May 2026
Overview
May 2026 saw continued post-commit stabilization of the online data checksums feature. The major commit had already landed, and this month's activity focused on identifying and fixing race conditions discovered through aggressive stress testing, plus routine patch pushes for minor follow-up fixes.
Key Developments
Promotion Race Condition Identified
Tomas Vondra identified a missing EmitProcSignalBarrier() call when a standby is promoted while in inprogress-on state. During promotion, StartupXLOG resets checksum state to off via direct XLogCtl update but never signals existing hot standby backends, leaving them with stale LocalDataChecksumVersion. While not a data corruption risk in practice (neither inprogress-on nor off triggers verification failures), it violates the architectural invariant that all state changes flow through the barrier mechanism. The same issue exists on the inprogress-off → off path.
Checkpointer/Worker Interleaving Non-Determinism
Tomas also reported that the checkpointer and datachecksum worker can interleave in ways that produce non-deterministic intermediate states. Specifically, checkpoint_redo can occur between XLogChecksums() WAL emission and the corresponding XLogCtl update, creating a window where WAL and shared memory disagree. A PoC fix using a new LWLock to serialize the critical section confirmed the hypothesis, though it was explicitly not proposed for commit (it caused deadlocks in some tests). No actual checksum failures or incorrect final states have been demonstrated from this race.
Minor Follow-up Patches Committed
Daniel Gustafsson pushed patches 0001-0003 (a small follow-up fix series) with minor tweaks, acknowledging contributions from two collaborators. The specific content of these patches was not detailed in the thread messages this month.
Open Questions
- The role of
DELAY_CHKPT_STARTin the protocol remains not fully understood by at least one reviewer (Tomas), suggesting documentation or code comments may need improvement.
Status
The feature is in post-commit hardening phase. Stress testing continues to uncover subtle race conditions in edge cases (promotion, crash recovery, checkpoint interleaving), but none identified this month represent data corruption risks. The overall architecture — state machine with barrier synchronization, XLogCtl as source of truth, checkpoint-record embedding for recovery consistency — remains sound.