Overview
This thread proposes adding observability into PostgreSQL's Virtual File Descriptor (VFD) cache — a per-backend LRU-managed cache of open OS file descriptors bounded by max_files_per_process (default 1000). The patch, authored by Ayoub Kazar and inspired by a blog post from Tomas Vondra, introduces a pg_stat_vfdcache view exposing hit/miss counters so DBAs can tell whether the VFD cache is thrashing and whether max_files_per_process needs to be raised.
The Core Architectural Problem
The VFD layer in fd.c exists because PostgreSQL backends can easily want to hold more file descriptors open than the OS ulimit allows (one relation = 1 fd per 1 GB segment × forks; a partitioned table with 1500 children easily blows past 1000 fds). When the LRU is full, VFD closes the OS fd but keeps the metadata entry; the next access calls open() again — a hot-path syscall.
The architectural pain points the thread surfaces:
- No introspection. There is currently no way to observe cache pressure. Tomas's blog claimed 4–5× throughput improvements from simply raising
max_files_per_process, but operators have no signal to know when to do so. - Unbounded cache metadata. While the number of open OS fds is bounded by
max_files_per_process, the number of VFD cache entries (SizeVfdCache) is not. David Geier reports a production system with ~300,000 files and thousands of backends burning ~30 GiB on VFD metadata alone. Atsizeof(Vfd)≈ 56–80 bytes plus apstrdup()'dfileNameof 10–55 bytes, this adds up. - No shrinking. The cache is direct-indexed by VFD index (
Fileis a 4-byte int intoVfdCache[]), so it can only be truncated down to the maximum live index, never compacted after a transient spike.
Design Decisions and Tradeoffs
Per-backend vs. global statistics
Jakub Wartak's first response reframed the patch's usefulness: local-only counters are not actionable because applications won't query them. The stats must live in shared memory (cumulative pgstat infrastructure) so an operator can see the global picture across all backends. Kazar pivoted to shared stats in v2, following the pattern of pgstat_bgwriter / pgstat_checkpointer: accumulate locally in PendingVfdCacheStats, then flush to a shared PgStatShared_VfdCache struct protected by an LWLock.
Tomas pushed this further in review: global and per-backend stats are qualitatively different features and should be split. Global hit/miss is "virtually free" and clearly useful for tuning max_files_per_process — possibly PG19 material. Per-backend stats (identifying which backend is thrashing) have higher overhead and a weaker cost/benefit case. Kazar accepted the split: v3-0001 = global only, 0002 = per-backend plus cache-size reporting.
What counters to expose
Tomas questioned whether evictions is meaningful separate from misses — in steady state they converge. Kazar agreed and dropped it. The final shape is:
vfd_hits— fd was already openvfd_misses— fd wasVFD_CLOSED,open()was requiredstat_reset_timestamp- (proposed) total entries, total bytes,
max_safe_fds
Tomas suggested pgstat_count_vfd_access(hit bool) as a cleaner single entry point at the top of the access path — pgstat_count_vfd_access(!FileIsNotOpen(file)) — rather than two separate macros with if/else.
Exposing max_safe_fds
max_safe_fds (computed by postmaster, often < max_files_per_process because of ulimit probing) is the real operational ceiling. Kazar exposed it via the view; David countered that a PGC_INTERNAL GUC is more idiomatic for static read-only internal values (e.g., server_version_num). The comment on max_safe_fds currently advertises it as exposed only for save/restore_backend_variables(), so reading it from pgstat code is a slight layering violation that warrants updating the comment.
Cache-size accounting accuracy
David flagged that Kazar's use of sizeof(VfdCache[i]) undercounts because fileName is pstrdup'd. The correct accounting is:
sizeof(VfdCache[i]) + GetMemoryChunkSpace(VfdCache[i].fileName)
Kazar asked whether ignoring filename bytes is acceptable (cheaper — no per-entry palloc header walk). David strongly pushed back: file names are 10–55 bytes (base/5/1249, pg_wal/…, pg_xact/0000), often comparable to the 56-byte struct itself, so ignoring them would materially mislead. A constant "average" would likely be wrong.
Cache shrinkability — a deeper refactor lurking
Kazar raised that a backend that once needed 100k entries keeps that cache forever. The current direct-mapped layout prevents compaction. Two forward-looking options from David:
- Convert to
simplehash.h. Adds ~1 byte status + optionally 4 bytes cached hash per entry; cache can be rebuilt at a smaller size when load drops. - Palloc VFDs individually. Change
Filefrom a 4-byte index into a pointer (ABI break), but enables: moving entries freely for compaction, variable-length inline filenames (eliminating thepstrdupallocation and saving the chunk header), and type-specialized layouts (non-temp files don't needResourceOwneror file-size fields).
Option 2 is more invasive but strictly better long-term. Neither is in scope for this patch; they're explicitly teed up as follow-on work that the stats view would help motivate and measure.
io_uring interaction
Tomas noted that io_uring can consume "quite a few descriptors" and has caused issues; the VFD stats view would make such resource pressure observable — a non-obvious benefit beyond partition-scan tuning.
Patch Quality / Review Items
Tomas's review of v2 surfaced several infrastructure-level issues:
usagecountis wrong for this stat kind.usagecountis for single-writer fixed stats; VFD stats have many writers (every backend), so an LWLock-protected flush is required — which the patch does viaPgStatShared_VfdCache.lock.- Don't double-update
PgStatShared_Backendfrompgstat_vfdcache.c. The proper pattern, already used for WAL (pgstat_flush_backend_entry_walinpgstat_backend.c), is to havepgstat_backendsync from the vfdcache subsystem. This is an important layering point in the modular pgstat framework introduced for PG15+. GetVfdCacheOccupancywas dead code in v2.- Missing
REVOKEonpg_stat_vfdcacheinsystem_views.sql(standard for stats views that shouldn't be world-readable by default? — or at least consistent with surrounding views). - pg_indent misformatting of the two new structs — a recurring pitfall when struct name alignment interacts with typedef lists.
Architectural Significance
The patch is small but sits at an interesting junction:
- It closes a long-standing observability gap in a subsystem (
fd.c) that has been essentially unobservable since its creation. - It exposes just enough information to justify — and measure — the much more significant VFD cache redesign (hash-table or palloc-based) that David is contemplating. In that sense it is infrastructure for future optimization work, not just a monitoring nicety.
- It's a textbook use of the cumulative pgstat framework (shared entry + local pending + flush callback), making it a good reference for future per-subsystem stats additions.
The global-only 0001 patch is on a realistic path to PG19 given Tomas's endorsement; per-backend stats (0002) and the cache-layout refactor are longer-horizon work.