[PATCH] Free BufFile metadata in close and append paths

First seen: 2026-04-29 15:46:59+00:00 · Messages: 2 · Participants: 2

Latest Update

2026-05-09 · opus 4.7

Analysis: Freeing BufFile Metadata in BufFileClose() and BufFileAppend()

Core Problem

BufFile is PostgreSQL's buffered-file abstraction layered on top of the virtual file descriptor (File) API in src/backend/storage/file/. It is used heavily by the executor for spill files: hash joins, hash aggregates, tuplestores, logical tape sets for external sort, and parallel-worker temp-file handoff via SharedFileSet. Each BufFile wrapper carries a BLCKSZ (typically 8 KiB) I/O buffer plus bookkeeping (the files array, and for fileset-backed BufFiles, a pstrdup'd logical name).

DaeMyung (charsyam) identifies two narrowly-scoped but real palloc-lifetime issues:

  1. BufFileClose() leaks the pstrdup'd name. For SharedFileSet-based BufFiles (created through BufFileCreateFileSet/BufFileOpenFileSet), the constructor duplicates the logical name into the owning memory context. BufFileClose() releases the underlying File handles and the wrapper, but never frees the duplicated name string. Until the owning context is reset, the string sits around.

  2. BufFileAppend() leaks the entire source wrapper. BufFileAppend(target, source) semantically consumes the source: it transfers the array of File handles into the target and the caller is contractually told not to touch the source again. However, the implementation never frees the source wrapper itself — including its BLCKSZ-sized I/O buffer, its files array, and its name. In a parallel external sort at high DOP (max_parallel_workers_per_gather high), the leader's LogicalTapeSetCreate path issues many BufFileAppend() calls to glue worker tapes together within a single long-lived context, so the retained wrappers aggregate into hundreds of KiB of dead memory before the context finally resets.

Architecturally this is not a correctness bug — nothing is unbounded, everything dies at context reset — but it is wasteful in exactly the workload (parallel sort) where memory pressure is already tight against work_mem/maintenance_work_mem.

Proposed Patch

The patch touches only src/backend/storage/file/buffile.c:

The diff is small: +17/-5.

Design Tradeoff and Michael Paquier's Objection

Michael Paquier (committer, owner of large swaths of the file/WAL/backup infrastructure) accepts the assertion and the overall direction ("we care about resources in this specific call") but pushes back on the pfree()s in BufFileAppend() on API-compatibility grounds:

"What if the caller cares about keeping a track of the source data? Your assumptions are based on the sole caller of BufFileAppend() in the tree. There could be callers outside the core tree, in extension code."

This is the crux of the design tension. The header comment for BufFileAppend() says the source must not be used again, which in-tree is treated as "logically dead." But:

Michael's stance reflects a recurring PostgreSQL committer value: file-level and storage-level APIs are considered extension surface area, and semantics should not be tightened just because in-tree usage happens to conform. The same conservatism has historically applied to fd.c, md.c, and buffile.c itself.

By contrast, the BufFileClose() name-free seems uncontested — BufFileClose() is unambiguously a destructor, so freeing an internally-allocated, internally-owned string cannot surprise any caller.

Procedural Outcome

Michael explicitly defers this out of the v19 cycle ("None of this is material for v19") and requests a CommitFest entry (CF #59 / the one following v19 feature freeze) with himself attached as reviewer/committer. This is the canonical disposition for a cleanup patch that arrives late in a cycle and needs more design discussion: park it in the next CF rather than rush it.

Technical Implications

Broader Context

DaeMyung frames this as part of a "storage/file audit" — a systematic sweep for metadata leaks in the file layer. Such audits periodically surface on -hackers and tend to produce a cluster of small, targeted patches. The pattern here — free everything the constructor allocated, even if context reset would eventually catch it — is consistent with recent work tightening memory behavior in long-lived contexts (e.g., logical replication workers, autovacuum, and parallel leader processes) where "eventually freed" is not good enough.