Add pg_stat_vfdcache view for VFD cache statistics

First seen: 2026-03-21 16:58:36+00:00 · Messages: 13 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

Overview

This thread proposes adding observability into PostgreSQL's Virtual File Descriptor (VFD) cache — a per-backend LRU-managed cache of open OS file descriptors bounded by max_files_per_process (default 1000). The patch, authored by Ayoub Kazar and inspired by a blog post from Tomas Vondra, introduces a pg_stat_vfdcache view exposing hit/miss counters so DBAs can tell whether the VFD cache is thrashing and whether max_files_per_process needs to be raised.

The Core Architectural Problem

The VFD layer in fd.c exists because PostgreSQL backends can easily want to hold more file descriptors open than the OS ulimit allows (one relation = 1 fd per 1 GB segment × forks; a partitioned table with 1500 children easily blows past 1000 fds). When the LRU is full, VFD closes the OS fd but keeps the metadata entry; the next access calls open() again — a hot-path syscall.

The architectural pain points the thread surfaces:

  1. No introspection. There is currently no way to observe cache pressure. Tomas's blog claimed 4–5× throughput improvements from simply raising max_files_per_process, but operators have no signal to know when to do so.
  2. Unbounded cache metadata. While the number of open OS fds is bounded by max_files_per_process, the number of VFD cache entries (SizeVfdCache) is not. David Geier reports a production system with ~300,000 files and thousands of backends burning ~30 GiB on VFD metadata alone. At sizeof(Vfd) ≈ 56–80 bytes plus a pstrdup()'d fileName of 10–55 bytes, this adds up.
  3. No shrinking. The cache is direct-indexed by VFD index (File is a 4-byte int into VfdCache[]), so it can only be truncated down to the maximum live index, never compacted after a transient spike.

Design Decisions and Tradeoffs

Per-backend vs. global statistics

Jakub Wartak's first response reframed the patch's usefulness: local-only counters are not actionable because applications won't query them. The stats must live in shared memory (cumulative pgstat infrastructure) so an operator can see the global picture across all backends. Kazar pivoted to shared stats in v2, following the pattern of pgstat_bgwriter / pgstat_checkpointer: accumulate locally in PendingVfdCacheStats, then flush to a shared PgStatShared_VfdCache struct protected by an LWLock.

Tomas pushed this further in review: global and per-backend stats are qualitatively different features and should be split. Global hit/miss is "virtually free" and clearly useful for tuning max_files_per_process — possibly PG19 material. Per-backend stats (identifying which backend is thrashing) have higher overhead and a weaker cost/benefit case. Kazar accepted the split: v3-0001 = global only, 0002 = per-backend plus cache-size reporting.

What counters to expose

Tomas questioned whether evictions is meaningful separate from misses — in steady state they converge. Kazar agreed and dropped it. The final shape is:

Tomas suggested pgstat_count_vfd_access(hit bool) as a cleaner single entry point at the top of the access path — pgstat_count_vfd_access(!FileIsNotOpen(file)) — rather than two separate macros with if/else.

Exposing max_safe_fds

max_safe_fds (computed by postmaster, often < max_files_per_process because of ulimit probing) is the real operational ceiling. Kazar exposed it via the view; David countered that a PGC_INTERNAL GUC is more idiomatic for static read-only internal values (e.g., server_version_num). The comment on max_safe_fds currently advertises it as exposed only for save/restore_backend_variables(), so reading it from pgstat code is a slight layering violation that warrants updating the comment.

Cache-size accounting accuracy

David flagged that Kazar's use of sizeof(VfdCache[i]) undercounts because fileName is pstrdup'd. The correct accounting is:

sizeof(VfdCache[i]) + GetMemoryChunkSpace(VfdCache[i].fileName)

Kazar asked whether ignoring filename bytes is acceptable (cheaper — no per-entry palloc header walk). David strongly pushed back: file names are 10–55 bytes (base/5/1249, pg_wal/…, pg_xact/0000), often comparable to the 56-byte struct itself, so ignoring them would materially mislead. A constant "average" would likely be wrong.

Cache shrinkability — a deeper refactor lurking

Kazar raised that a backend that once needed 100k entries keeps that cache forever. The current direct-mapped layout prevents compaction. Two forward-looking options from David:

  1. Convert to simplehash.h. Adds ~1 byte status + optionally 4 bytes cached hash per entry; cache can be rebuilt at a smaller size when load drops.
  2. Palloc VFDs individually. Change File from a 4-byte index into a pointer (ABI break), but enables: moving entries freely for compaction, variable-length inline filenames (eliminating the pstrdup allocation and saving the chunk header), and type-specialized layouts (non-temp files don't need ResourceOwner or file-size fields).

Option 2 is more invasive but strictly better long-term. Neither is in scope for this patch; they're explicitly teed up as follow-on work that the stats view would help motivate and measure.

io_uring interaction

Tomas noted that io_uring can consume "quite a few descriptors" and has caused issues; the VFD stats view would make such resource pressure observable — a non-obvious benefit beyond partition-scan tuning.

Patch Quality / Review Items

Tomas's review of v2 surfaced several infrastructure-level issues:

Architectural Significance

The patch is small but sits at an interesting junction:

  1. It closes a long-standing observability gap in a subsystem (fd.c) that has been essentially unobservable since its creation.
  2. It exposes just enough information to justify — and measure — the much more significant VFD cache redesign (hash-table or palloc-based) that David is contemplating. In that sense it is infrastructure for future optimization work, not just a monitoring nicety.
  3. It's a textbook use of the cumulative pgstat framework (shared entry + local pending + flush callback), making it a good reference for future per-subsystem stats additions.

The global-only 0001 patch is on a realistic path to PG19 given Tomas's endorsement; per-backend stats (0002) and the cache-layout refactor are longer-horizon work.