2026-05-20 · claude-opus-4-6

Relfilenode Statistics: Deep Technical Analysis

The Core Problem

PostgreSQL's statistics subsystem (pgstats) currently keys relation-level statistics by the relation's OID. This creates a fundamental architectural limitation: when writing buffers to disk, the system does not have access to the relation OID — it only knows the RelFileLocator (comprising dbOid, spcOid, and relNumber). This means:

No per-relation write counters exist — While PostgreSQL tracks heap_blks_read and heap_blks_hit, there is no heap_blks_written because the buffer manager (bufmgr) operates at the relfilenode level, not the relation level.
Statistics cannot be populated during WAL replay — The startup process during crash recovery has no database connection and cannot perform catalog lookups. It only knows relfilenode identifiers from WAL records. This means after a crash (or failover from a physical replica), critical statistics like n_dead_tup are completely lost, leaving autovacuum blind about which relations need attention.
Abstraction layer mismatches — The buffer manager currently needs awareness of relation kinds (table vs. index) to route statistics, which violates the layering between the storage engine and the statistics subsystem.

Why This Matters Architecturally

Post-Crash/Failover Reliability

The most compelling argument (raised by Andres Freund) is reliability after crashes and failovers. Currently:

After a crash, n_dead_tup is zero for all relations
Autovacuum skips relations with no stats entirely
Bloat accumulates silently until pgstat_analyze happens to detect dead tuples
Physical replicas promoted to primary have no vacuum-relevant statistics

If stats are keyed by relfilenode, they can be incrementally rebuilt during WAL replay since WAL records contain RelFileLocator information. This would be a "huge step forward" for a more reliable stats system.

Enabling Write Tracking

DBAs currently cannot answer "which tables are being written most?" — a fundamental observability gap compared to read statistics. The relfilenode-keyed approach naturally enables this since checkpoint/bgwriter operations already know the relfilenode.

Proposed Solution & Design Evolution

Initial Approach (v1-v6, May-November 2024)

The original POC added a new PGSTAT_KIND_RELFILENODE alongside the existing PGSTAT_KIND_RELATION, with a new relfile field added to PgStat_HashKey. This created problems:

Two kinds tracking overlapping statistics
Confusing user experience (stats split across two views)
Extra field in the central hash key that most stat kinds don't need

Current Approach (v7+, November 2025 onwards)

The design shifted to replacing the existing PGSTAT_KIND_RELATION keying mechanism. Instead of the relation OID, the hash key now uses:

dboid ← RelFileLocator.dbOid
objoid ← A 64-bit value computed from RelFileLocator.spcOid (upper 32 bits) and RelFileLocator.relNumber (lower 32 bits) via RelFileLocatorToPgStatObjid() macro

This avoids adding any new field to PgStat_HashKey by leveraging the existing 8-byte objoid field (expanded in a prior commit b14e9ce7d55c).

Key Design Decisions

1. Partitioned Tables

Partitioned tables have no relfilenode (rd_rel->relfilenode == 0). The solution introduces PSEUDO_PARTITION_TABLE_SPCOID (value 1665, an existing pg_default tablespace OID that won't conflict) to construct a synthetic key using the partition's OID, ensuring uniqueness even under OID wraparound.

2. Handling Rewrites (TRUNCATE, CLUSTER, VACUUM FULL)

When a relation is rewritten, its relfilenode changes. The patch introduces PgStat_PendingRewrite — a per-transaction list tracking rewrite operations with:

old_locator: the previous relfilenode
new_locator: the new relfilenode
original_locator: for rewrite chains (multiple rewrites in one transaction)

At commit time, stats are copied from the original locator to the final locator. This is integrated with 2PC (pgstat_twophase_postcommit/postabort) and subtransaction abort handling.

3. Avoiding Extra Syscache Lookups

A new pgstat_fetch_stat_tabentry_by_locator() function allows do_autovacuum() to avoid redundant syscache lookups — critical since autovacuum iterates over every table in a database.

Uniqueness Concern

Andres identified that GetNewRelFileNumber() does NOT check for conflicts when creating a new relfilenode for an existing relation (only when the relfilenode is also the OID). The current solution relies on the composite key (spcOid + relNumber) making collisions astronomically unlikely in practice, though this point may need further hardening.

Key Technical Disagreements

Should Stats Be Copied Across Rewrites?

Bertrand (pro-copy):

Not copying would be a user-visible regression (all existing stats like seq_scan, last_analyze would vanish after TRUNCATE/CLUSTER)
A rewrite doesn't change dead tuple counts or insert-since-vacuum counts
DBAs make decisions based on cumulative stats in pg_stat_all_tables

Michael (questioning the need):

The main motivation for relfilenode stats is WAL-replay recovery, where post-rewrite state is fundamentally different
relation_needs_vacanalyze() cares about fields that would naturally be zero after a rewrite
Not copying makes the code significantly simpler

Andres (nuanced):

Simple copy or start-from-scratch are both acceptable
Having both "prior relfilenodes" and "current relfilenode" counters isn't obviously useful given stats can be reset manually or lost on crash anyway

Should We Split PgStat_StatTabEntry Into Multiple Kinds?

Michael (pro-split):

Separate Table, Index, and Relfilenode stat kinds from the start
Cleaner separation of concerns
Relfilenode kind would only contain what's needed for recovery: dead_tuples, ins_since_vacuum, mod_since_analyze

Andres (against premature split):

More fields should be populated during recovery than just those three (e.g., n_tup_ins, n_tup_upd, n_tup_del, n_live_tup)
The right split boundary isn't clear until WAL-replay population is actually implemented
Splitting index stats off is orthogonal to the relfilenode keying question

Bertrand (incremental approach):

Current patch is a contained step that doesn't preclude future splitting
Splitting now is a much larger change with risk of drawing wrong boundaries
Better to key by locator first, then split later when needs are clearer

Current Status (as of v16, May 2026)

The patch set consists of:

0001 (committed by Michael): Rewrite-related stats tests to establish behavioral baseline
0002: Key PGSTAT_KIND_RELATION by RelFileLocator instead of relation OID
0003: Handle relation statistics correctly during rewrites (the PgStat_PendingRewrite mechanism)

The patch is awaiting commitment decision. Bertrand has explicitly asked whether Andres would drive it toward commit, and whether Michael would review despite preferring a different design. No definitive answer has been recorded in the thread.

Implications for Future Work

Once committed, this patch enables:

Write counters (heap_blks_written) trackable per-relation
WAL-replay stat population — startup process can increment stats during crash recovery
Checksum error tracking per relfilenode
Replication stat population — physical replicas can maintain accurate stats for post-failover autovacuum
Future table/index stat split on a well-understood foundation

relfilenode statistics

Latest Update