log_checkpoints: count WAL segment creations from all processes

First seen: 2026-03-23 07:39:51+00:00 · Messages: 8 · Participants: 4

Latest Update

2026-05-18 · claude-opus-4-6

Deep Technical Analysis: Fixing WAL Segment Creation Accounting in log_checkpoints

The Core Problem

PostgreSQL's log_checkpoints output includes a field "WAL file(s) added" (internally ckpt_segs_added) that is supposed to tell operators how many new WAL segments were created during a checkpoint cycle. However, this counter has been fundamentally misleading: it only counts segments created via PreallocXlogFiles() — the preallocation path that runs at the end of a checkpoint. It completely misses WAL segments created through other code paths:

  1. Backend WAL growth — when backends extend the WAL by writing new records that cross segment boundaries, new segments are created via XLogFileInitInternal().
  2. Walreceiver — on standbys, the WAL receiver creates segments as it writes incoming WAL data.
  3. Timeline initialization — during timeline switches (e.g., promotion), XLogFileCopy() creates new segments.

On a write-heavy system, this means the reported value can show 0 or 1 even when dozens or hundreds of new WAL segments were actually created during the checkpoint interval. This has been a known deficiency, documented on Andres Freund's "Desired Changes" wiki page, making it a community-recognized issue.

The architectural significance is that log_checkpoints is a primary observability tool for DBAs monitoring WAL volume, storage consumption, and replication lag. An inaccurate counter here directly undermines capacity planning and incident response.

Proposed Solution Architecture

Shared-Memory Atomic Counter

The patch introduces a new pg_atomic_uint64 field, walSegmentsCreated, into the XLogCtlData shared memory structure. This is the central WAL control structure visible to all processes, making it the natural home for a cross-process counter.

The choice of pg_atomic_uint64 is deliberate and architecturally sound:

Instrumentation Points

The counter is incremented in two specific functions:

This covers all WAL segment creation paths in the system, replacing the narrow PreallocXlogFiles()-only counting.

Delta-Based Reporting

Rather than resetting the counter at each checkpoint (which would introduce a race condition), the patch uses a delta approach:

current = pg_atomic_read_u64(&XLogCtl->walSegmentsCreated);
CheckpointStats.ckpt_segs_added = (int)
    (current - XLogCtl->walSegsCreatedLastCheckpoint);
XLogCtl->walSegsCreatedLastCheckpoint = current;

The walSegsCreatedLastCheckpoint field stores the baseline value. This is a non-atomic regular field, which is safe because only the checkpointer process reads and writes it (checkpoints are serialized). The baseline starts at 0, meaning the first checkpoint after startup captures all segments created during recovery, including timeline-initialization segments created before the end-of-recovery checkpoint is requested.

Semantic Change

The patch explicitly changes the meaning of ckpt_segs_added: from "segments preallocated by the checkpointer" to "new WAL segments created since the previous successful checkpoint or restartpoint, by any process." This is a behavioral change but a correct one — the old semantics were misleading. The DTrace probe TRACE_POSTGRESQL_CHECKPOINT_DONE argument arg2 carries the new semantics with unchanged arity, maintaining binary compatibility.

Key Technical Debates

Integer Overflow Concern

Japin Li raised the question of whether the cast from uint64 delta to int could overflow. The patch author acknowledged it's practically impossible (2^31 segments at 16MB = 32TB of WAL in a single checkpoint interval), but added comments documenting the threshold. Huseyin Demir later correctly pointed out an inconsistency: the comments mention a 2^32 overflow threshold, but since ckpt_segs_added is int (signed 32-bit), the actual overflow threshold should be 2^31. This is a minor documentation bug but reflects careful review attention to numeric precision.

Atomicity of walSegsCreatedLastCheckpoint

Demir raised whether walSegsCreatedLastCheckpoint should also be atomic. Currently it doesn't need to be because only the checkpointer reads/writes it, but if it were ever exposed via a view (e.g., pg_stat_checkpointer), concurrent readers would need atomic access. This is a forward-looking design consideration — the current implementation is correct but not future-proofed for that specific extension.

Exposing to pg_stat_checkpointer

Demir proposed adding a wal_segments_created column to the pg_stat_checkpointer view, providing a companion patch. This is a natural extension: pg_stat_checkpointer already exposes other checkpoint-related statistics, and a cumulative WAL segments created counter would be valuable for monitoring dashboards and capacity planning without requiring log parsing. This would likely need the atomicity change mentioned above, since stats collector/view queries run from backend processes, not the checkpointer.

Testing

Zsolt Parragi's request for tests prompted the addition of test coverage in v2. Demir further suggested testing with wal_level = minimal, which exercises different WAL writing paths and could affect segment creation patterns. This reflects the principle that WAL-related changes need testing across multiple WAL configurations.

Documentation Cleanup

Japin Li identified unnecessary changes to the DTrace probe documentation and an indentation issue, both cleaned up in v3. This is typical of the community's attention to minimizing patch footprint — changes should be strictly relevant to the functional goal.

Patch Evolution

The patch has gone through four revisions with constructive review but no fundamental design objections, suggesting the approach is sound. The main open question is whether to expand scope to include the pg_stat_checkpointer view exposure (Demir's suggestion).

Architectural Implications

This change is small in code footprint but significant in observability accuracy. It establishes a pattern of using shared-memory atomic counters for cross-process WAL statistics that could be extended to other metrics (e.g., WAL segments recycled, removed). The delta-based reporting against a checkpoint-local baseline is a clean pattern that avoids both lock contention and counter-reset races.

The change also touches the pg_stat_checkpointer system view indirectly (through ckpt_segs_added), the log_checkpoints log output, and the DTrace probe interface — three different observability surfaces that all benefit from accurate data.