Technical Analysis: Exposing Checkpoint Timestamp and Duration in pg_stat_checkpointer
Core Problem
PostgreSQL's checkpointer process is critical to database reliability — it ensures dirty buffers are flushed to disk at regular intervals, bounding crash recovery time. However, monitoring checkpoint behavior has significant gaps:
-
Cumulative-only statistics:
pg_stat_checkpointerexposes only cumulative counters (write_time,sync_time,buffers_written, etc.). To determine the duration of any individual checkpoint, external tools must compute deltas between successive polls — a fragile approach when multiple checkpoints occur between sampling intervals. -
No checkpoint type visibility: The checkpoint completion log message (controlled by
log_checkpoints) doesn't indicate whether the checkpoint was triggered by timeout, WAL volume (immediate/wal), or manualCHECKPOINTcommand. This makes it harder to correlate write patterns with trigger reasons. -
No direct "last event" gauge: Unlike some other subsystems, there's no way to ask "how long did the most recent checkpoint take?" via SQL without parsing server logs.
Proposed Solution
The patch adds two new fields to pg_stat_checkpointer:
last_checkpoint_duration(originally namedcheckpoint_total_time): Duration in milliseconds of the most recently completed checkpoint. This is a gauge (overwritten each checkpoint), not a counter.last_checkpoint_time:TimestampTzof when the most recent checkpoint completed.
Additionally, the patch modifies checkpoint completion log messages to include the checkpoint type (immediate/timed/manual), making log-based tools like pgbadger able to categorize checkpoints by trigger reason.
Implementation Details
The changes touch:
src/backend/access/transam/xlog.c(LogCheckpointEnd): Stores the total duration and end timestamp intoPendingCheckpointerStats— the pending stats structure that gets flushed to shared memory via the stats subsystem.src/include/pgstat.h(PgStat_CheckpointerStats): Addslast_checkpoint_durationandlast_checkpoint_timefields to the stats structure.- Catalog changes: Extends the
pg_stat_checkpointerview definition to expose these new columns.
The key architectural decision is that these are gauge values (overwritten per checkpoint) rather than cumulative counters. This breaks the pattern used by all other fields in pg_stat_checkpointer and is the primary point of contention.
Key Technical Debate: Gauges vs. Counters in pg_stat Views
The Case For (Álvaro Herrera, Soumya)
- Monitoring systems already poll stats views periodically and store values externally. A gauge for "last checkpoint duration" gives them a concrete per-event signal they can archive.
- A sudden jump in
last_checkpoint_duration(e.g., 300ms → 5000ms) is an immediate red flag indicating I/O saturation, backend stalls, or WAL flush bottlenecks. This is hard to detect reliably from cumulative counters alone, especially when multiple checkpoints occur between polls. - Checkpoint timing is somewhat unique: a single unusually long checkpoint often indicates systemic problems (unlike, say, a single slow sequential scan).
The Case Against (Michael Banck)
- Missed checkpoints: If the monitoring system polls every 60 seconds but checkpoints happen every 20 seconds during a traffic spike, intermediate checkpoints are invisible — you only see the last one.
- Precedent concerns: PostgreSQL doesn't store "last duration" for any other operation. Why is checkpoint special? If this pattern is valuable, shouldn't it apply more broadly?
- Existing alternatives: Monitoring systems can already compute checkpoint durations from delta calculations on
write_time + sync_time. The new field offers marginal convenience rather than fundamentally new capability.
Álvaro's Deeper Insight on Duration Semantics
Álvaro Herrera raised an important point: raw checkpoint duration is often misleading because much of it is intentional sleep (governed by checkpoint_completion_target). A 5-minute checkpoint might be perfectly healthy if the checkpointer spent 4.5 minutes sleeping to spread I/O. What's actually useful is active time vs. sleep time — how much time was spent doing real work versus throttling. The existing write_time and sync_time partially capture this, but only cumulatively. This insight suggests the patch may not go far enough in providing actionable data.
Patch Quality Issues
- The initial submission was an incremental diff rather than a full patch (noted by Michael Banck).
- Formatting issues: unnecessary newlines,
stats_resetcolumn not kept last. - The logging change and stats view change were mixed in one patch; reviewers recommended splitting them.
- As of the last message, a rebased/updated patch had not been submitted.
Architectural Context
The pg_stat_checkpointer view was refactored in PostgreSQL 15 (split from pg_stat_bgwriter). It uses the pgstat shared memory stats infrastructure introduced in PG15, where pending stats accumulate in backend-local memory and get flushed to shared memory. The proposed change works within this framework — PendingCheckpointerStats accumulates the values in LogCheckpointEnd() and they propagate through the normal stats flush path.
The gauge vs. counter tension reflects a broader question about PostgreSQL's monitoring philosophy. The current design strongly favors cumulative counters that external systems can differentiate. Adding gauges creates a hybrid model that may confuse the semantics of stats reset operations.
Current Status
The thread went dormant after December 2025. Soumya attempted to revive it in May 2026 but received no responses. The patch needs:
- Rebase on current HEAD
- Split into two patches (logging change + stats view change)
- Stronger justification for why gauges are appropriate here, addressing the "missed checkpoints" problem
- Potentially incorporating Álvaro's suggestion about active vs. sleep time decomposition