relfilenode statistics

First seen: 2024-05-25 07:52:02+00:00 · Messages: 56 · Participants: 6

Latest Update

2026-05-20 · claude-opus-4-6

Relfilenode Statistics: Deep Technical Analysis

The Core Problem

PostgreSQL's statistics subsystem (pgstats) currently keys relation-level statistics by the relation's OID. This creates a fundamental architectural limitation: when writing buffers to disk, the system does not have access to the relation OID — it only knows the RelFileLocator (comprising dbOid, spcOid, and relNumber). This means:

  1. No per-relation write counters exist — While PostgreSQL tracks heap_blks_read and heap_blks_hit, there is no heap_blks_written because the buffer manager (bufmgr) operates at the relfilenode level, not the relation level.

  2. Statistics cannot be populated during WAL replay — The startup process during crash recovery has no database connection and cannot perform catalog lookups. It only knows relfilenode identifiers from WAL records. This means after a crash (or failover from a physical replica), critical statistics like n_dead_tup are completely lost, leaving autovacuum blind about which relations need attention.

  3. Abstraction layer mismatches — The buffer manager currently needs awareness of relation kinds (table vs. index) to route statistics, which violates the layering between the storage engine and the statistics subsystem.

Why This Matters Architecturally

Post-Crash/Failover Reliability

The most compelling argument (raised by Andres Freund) is reliability after crashes and failovers. Currently:

If stats are keyed by relfilenode, they can be incrementally rebuilt during WAL replay since WAL records contain RelFileLocator information. This would be a "huge step forward" for a more reliable stats system.

Enabling Write Tracking

DBAs currently cannot answer "which tables are being written most?" — a fundamental observability gap compared to read statistics. The relfilenode-keyed approach naturally enables this since checkpoint/bgwriter operations already know the relfilenode.

Proposed Solution & Design Evolution

Initial Approach (v1-v6, May-November 2024)

The original POC added a new PGSTAT_KIND_RELFILENODE alongside the existing PGSTAT_KIND_RELATION, with a new relfile field added to PgStat_HashKey. This created problems:

Current Approach (v7+, November 2025 onwards)

The design shifted to replacing the existing PGSTAT_KIND_RELATION keying mechanism. Instead of the relation OID, the hash key now uses:

This avoids adding any new field to PgStat_HashKey by leveraging the existing 8-byte objoid field (expanded in a prior commit b14e9ce7d55c).

Key Design Decisions

1. Partitioned Tables

Partitioned tables have no relfilenode (rd_rel->relfilenode == 0). The solution introduces PSEUDO_PARTITION_TABLE_SPCOID (value 1665, an existing pg_default tablespace OID that won't conflict) to construct a synthetic key using the partition's OID, ensuring uniqueness even under OID wraparound.

2. Handling Rewrites (TRUNCATE, CLUSTER, VACUUM FULL)

When a relation is rewritten, its relfilenode changes. The patch introduces PgStat_PendingRewrite — a per-transaction list tracking rewrite operations with:

At commit time, stats are copied from the original locator to the final locator. This is integrated with 2PC (pgstat_twophase_postcommit/postabort) and subtransaction abort handling.

3. Avoiding Extra Syscache Lookups

A new pgstat_fetch_stat_tabentry_by_locator() function allows do_autovacuum() to avoid redundant syscache lookups — critical since autovacuum iterates over every table in a database.

Uniqueness Concern

Andres identified that GetNewRelFileNumber() does NOT check for conflicts when creating a new relfilenode for an existing relation (only when the relfilenode is also the OID). The current solution relies on the composite key (spcOid + relNumber) making collisions astronomically unlikely in practice, though this point may need further hardening.

Key Technical Disagreements

Should Stats Be Copied Across Rewrites?

Bertrand (pro-copy):

Michael (questioning the need):

Andres (nuanced):

Should We Split PgStat_StatTabEntry Into Multiple Kinds?

Michael (pro-split):

Andres (against premature split):

Bertrand (incremental approach):

Current Status (as of v16, May 2026)

The patch set consists of:

  1. 0001 (committed by Michael): Rewrite-related stats tests to establish behavioral baseline
  2. 0002: Key PGSTAT_KIND_RELATION by RelFileLocator instead of relation OID
  3. 0003: Handle relation statistics correctly during rewrites (the PgStat_PendingRewrite mechanism)

The patch is awaiting commitment decision. Bertrand has explicitly asked whether Andres would drive it toward commit, and whether Michael would review despite preferring a different design. No definitive answer has been recorded in the thread.

Implications for Future Work

Once committed, this patch enables:

  1. Write counters (heap_blks_written) trackable per-relation
  2. WAL-replay stat population — startup process can increment stats during crash recovery
  3. Checksum error tracking per relfilenode
  4. Replication stat population — physical replicas can maintain accurate stats for post-failover autovacuum
  5. Future table/index stat split on a well-understood foundation