Relfilenode Statistics: Deep Technical Analysis
The Core Problem
PostgreSQL's statistics subsystem (pgstats) currently keys relation-level statistics by the relation's OID. This creates a fundamental architectural limitation: when writing buffers to disk, the system does not have access to the relation OID — it only knows the RelFileLocator (comprising dbOid, spcOid, and relNumber). This means:
-
No per-relation write counters exist — While PostgreSQL tracks
heap_blks_readandheap_blks_hit, there is noheap_blks_writtenbecause the buffer manager (bufmgr) operates at the relfilenode level, not the relation level. -
Statistics cannot be populated during WAL replay — The startup process during crash recovery has no database connection and cannot perform catalog lookups. It only knows relfilenode identifiers from WAL records. This means after a crash (or failover from a physical replica), critical statistics like
n_dead_tupare completely lost, leaving autovacuum blind about which relations need attention. -
Abstraction layer mismatches — The buffer manager currently needs awareness of relation kinds (table vs. index) to route statistics, which violates the layering between the storage engine and the statistics subsystem.
Why This Matters Architecturally
Post-Crash/Failover Reliability
The most compelling argument (raised by Andres Freund) is reliability after crashes and failovers. Currently:
- After a crash,
n_dead_tupis zero for all relations - Autovacuum skips relations with no stats entirely
- Bloat accumulates silently until
pgstat_analyzehappens to detect dead tuples - Physical replicas promoted to primary have no vacuum-relevant statistics
If stats are keyed by relfilenode, they can be incrementally rebuilt during WAL replay since WAL records contain RelFileLocator information. This would be a "huge step forward" for a more reliable stats system.
Enabling Write Tracking
DBAs currently cannot answer "which tables are being written most?" — a fundamental observability gap compared to read statistics. The relfilenode-keyed approach naturally enables this since checkpoint/bgwriter operations already know the relfilenode.
Proposed Solution & Design Evolution
Initial Approach (v1-v6, May-November 2024)
The original POC added a new PGSTAT_KIND_RELFILENODE alongside the existing PGSTAT_KIND_RELATION, with a new relfile field added to PgStat_HashKey. This created problems:
- Two kinds tracking overlapping statistics
- Confusing user experience (stats split across two views)
- Extra field in the central hash key that most stat kinds don't need
Current Approach (v7+, November 2025 onwards)
The design shifted to replacing the existing PGSTAT_KIND_RELATION keying mechanism. Instead of the relation OID, the hash key now uses:
dboid←RelFileLocator.dbOidobjoid← A 64-bit value computed fromRelFileLocator.spcOid(upper 32 bits) andRelFileLocator.relNumber(lower 32 bits) viaRelFileLocatorToPgStatObjid()macro
This avoids adding any new field to PgStat_HashKey by leveraging the existing 8-byte objoid field (expanded in a prior commit b14e9ce7d55c).
Key Design Decisions
1. Partitioned Tables
Partitioned tables have no relfilenode (rd_rel->relfilenode == 0). The solution introduces PSEUDO_PARTITION_TABLE_SPCOID (value 1665, an existing pg_default tablespace OID that won't conflict) to construct a synthetic key using the partition's OID, ensuring uniqueness even under OID wraparound.
2. Handling Rewrites (TRUNCATE, CLUSTER, VACUUM FULL)
When a relation is rewritten, its relfilenode changes. The patch introduces PgStat_PendingRewrite — a per-transaction list tracking rewrite operations with:
old_locator: the previous relfilenodenew_locator: the new relfilenodeoriginal_locator: for rewrite chains (multiple rewrites in one transaction)
At commit time, stats are copied from the original locator to the final locator. This is integrated with 2PC (pgstat_twophase_postcommit/postabort) and subtransaction abort handling.
3. Avoiding Extra Syscache Lookups
A new pgstat_fetch_stat_tabentry_by_locator() function allows do_autovacuum() to avoid redundant syscache lookups — critical since autovacuum iterates over every table in a database.
Uniqueness Concern
Andres identified that GetNewRelFileNumber() does NOT check for conflicts when creating a new relfilenode for an existing relation (only when the relfilenode is also the OID). The current solution relies on the composite key (spcOid + relNumber) making collisions astronomically unlikely in practice, though this point may need further hardening.
Key Technical Disagreements
Should Stats Be Copied Across Rewrites?
Bertrand (pro-copy):
- Not copying would be a user-visible regression (all existing stats like
seq_scan,last_analyzewould vanish after TRUNCATE/CLUSTER) - A rewrite doesn't change dead tuple counts or insert-since-vacuum counts
- DBAs make decisions based on cumulative stats in
pg_stat_all_tables
Michael (questioning the need):
- The main motivation for relfilenode stats is WAL-replay recovery, where post-rewrite state is fundamentally different
relation_needs_vacanalyze()cares about fields that would naturally be zero after a rewrite- Not copying makes the code significantly simpler
Andres (nuanced):
- Simple copy or start-from-scratch are both acceptable
- Having both "prior relfilenodes" and "current relfilenode" counters isn't obviously useful given stats can be reset manually or lost on crash anyway
Should We Split PgStat_StatTabEntry Into Multiple Kinds?
Michael (pro-split):
- Separate Table, Index, and Relfilenode stat kinds from the start
- Cleaner separation of concerns
- Relfilenode kind would only contain what's needed for recovery:
dead_tuples,ins_since_vacuum,mod_since_analyze
Andres (against premature split):
- More fields should be populated during recovery than just those three (e.g.,
n_tup_ins,n_tup_upd,n_tup_del,n_live_tup) - The right split boundary isn't clear until WAL-replay population is actually implemented
- Splitting index stats off is orthogonal to the relfilenode keying question
Bertrand (incremental approach):
- Current patch is a contained step that doesn't preclude future splitting
- Splitting now is a much larger change with risk of drawing wrong boundaries
- Better to key by locator first, then split later when needs are clearer
Current Status (as of v16, May 2026)
The patch set consists of:
- 0001 (committed by Michael): Rewrite-related stats tests to establish behavioral baseline
- 0002: Key
PGSTAT_KIND_RELATIONbyRelFileLocatorinstead of relation OID - 0003: Handle relation statistics correctly during rewrites (the
PgStat_PendingRewritemechanism)
The patch is awaiting commitment decision. Bertrand has explicitly asked whether Andres would drive it toward commit, and whether Michael would review despite preferring a different design. No definitive answer has been recorded in the thread.
Implications for Future Work
Once committed, this patch enables:
- Write counters (
heap_blks_written) trackable per-relation - WAL-replay stat population — startup process can increment stats during crash recovery
- Checksum error tracking per relfilenode
- Replication stat population — physical replicas can maintain accurate stats for post-failover autovacuum
- Future table/index stat split on a well-understood foundation