Analysis: Freeing BufFile Metadata in BufFileClose() and BufFileAppend()
Core Problem
BufFile is PostgreSQL's buffered-file abstraction layered on top of the virtual file descriptor (File) API in src/backend/storage/file/. It is used heavily by the executor for spill files: hash joins, hash aggregates, tuplestores, logical tape sets for external sort, and parallel-worker temp-file handoff via SharedFileSet. Each BufFile wrapper carries a BLCKSZ (typically 8 KiB) I/O buffer plus bookkeeping (the files array, and for fileset-backed BufFiles, a pstrdup'd logical name).
DaeMyung (charsyam) identifies two narrowly-scoped but real palloc-lifetime issues:
-
BufFileClose()leaks thepstrdup'd name. ForSharedFileSet-basedBufFiles (created throughBufFileCreateFileSet/BufFileOpenFileSet), the constructor duplicates the logical name into the owning memory context.BufFileClose()releases the underlyingFilehandles and the wrapper, but never frees the duplicated name string. Until the owning context is reset, the string sits around. -
BufFileAppend()leaks the entire source wrapper.BufFileAppend(target, source)semantically consumes the source: it transfers the array ofFilehandles into the target and the caller is contractually told not to touch the source again. However, the implementation never frees the source wrapper itself — including itsBLCKSZ-sized I/O buffer, itsfilesarray, and its name. In a parallel external sort at high DOP (max_parallel_workers_per_gatherhigh), the leader'sLogicalTapeSetCreatepath issues manyBufFileAppend()calls to glue worker tapes together within a single long-lived context, so the retained wrappers aggregate into hundreds of KiB of dead memory before the context finally resets.
Architecturally this is not a correctness bug — nothing is unbounded, everything dies at context reset — but it is wasteful in exactly the workload (parallel sort) where memory pressure is already tight against work_mem/maintenance_work_mem.
Proposed Patch
The patch touches only src/backend/storage/file/buffile.c:
- In
BufFileClose():pfree(file->name)when the BufFile is fileset-backed (name is non-NULL). - In
BufFileAppend(): after splicingsource->filesintotarget->files, free the source's buffer, files array, name, and the wrapper struct itself. EffectivelyBufFileAppend()becomes the destructor of its source argument — matching the documented contract. - Adds an assertion (apparently validating the consumed-source invariant).
The diff is small: +17/-5.
Design Tradeoff and Michael Paquier's Objection
Michael Paquier (committer, owner of large swaths of the file/WAL/backup infrastructure) accepts the assertion and the overall direction ("we care about resources in this specific call") but pushes back on the pfree()s in BufFileAppend() on API-compatibility grounds:
"What if the caller cares about keeping a track of the source data? Your assumptions are based on the sole caller of
BufFileAppend()in the tree. There could be callers outside the core tree, in extension code."
This is the crux of the design tension. The header comment for BufFileAppend() says the source must not be used again, which in-tree is treated as "logically dead." But:
- A third-party extension might still be inspecting harmless fields of the source wrapper (e.g.,
numFilesfor logging, or the name pointer for diagnostics) after the append, relying on the current lenient behavior. - Turning
BufFileAppend()into a destructor is a silent ABI/behavior change. Code that previously "worked" (because the struct was merely logically orphaned, not physically freed) would now dereference freed memory.
Michael's stance reflects a recurring PostgreSQL committer value: file-level and storage-level APIs are considered extension surface area, and semantics should not be tightened just because in-tree usage happens to conform. The same conservatism has historically applied to fd.c, md.c, and buffile.c itself.
By contrast, the BufFileClose() name-free seems uncontested — BufFileClose() is unambiguously a destructor, so freeing an internally-allocated, internally-owned string cannot surprise any caller.
Procedural Outcome
Michael explicitly defers this out of the v19 cycle ("None of this is material for v19") and requests a CommitFest entry (CF #59 / the one following v19 feature freeze) with himself attached as reviewer/committer. This is the canonical disposition for a cleanup patch that arrives late in a cycle and needs more design discussion: park it in the next CF rather than rush it.
Technical Implications
- If accepted as-is:
BufFileAppend()gains destructor semantics for its source. Parallel sort leader memory duringLogicalTapeSetCreatefrom worker tapes drops by roughlyN_workers * (BLCKSZ + sizeof(BufFile) + name_len). - Likely v2 shape: To address Michael's concern, the patch may need either (a) a new explicit API like
BufFileFreeConsumedSource()that existing callers opt into, or (b) stronger documentation in the header that the source is freed, plus a release-notes note for extension authors. Option (a) is more conservative and more in keeping with how PostgreSQL has historically handled this kind of API sharpening. - The name-free in
BufFileClose()will likely land independently and uncontroversially.
Broader Context
DaeMyung frames this as part of a "storage/file audit" — a systematic sweep for metadata leaks in the file layer. Such audits periodically surface on -hackers and tend to produce a cluster of small, targeted patches. The pattern here — free everything the constructor allocated, even if context reset would eventually catch it — is consistent with recent work tightening memory behavior in long-lived contexts (e.g., logical replication workers, autovacuum, and parallel leader processes) where "eventually freed" is not good enough.