2026-05-18 · claude-opus-4-6
Bug Analysis: mdunlinkfiletag Unlinks Wrong Fork/Segment
Core Problem
The mdunlinkfiletag() function in PostgreSQL's magnetic disk (MD) storage manager has a latent bug where it always unlinks segment 0 of MAIN_FORKNUM, regardless of what fork number and segment number are specified in the FileTag structure passed to it. This means the FileTag infrastructure's forknum and segno fields are silently ignored for unlink operations.
Technical Context
PostgreSQL's MD storage manager uses a checkpoint-based sync request queue (SyncOps) to defer certain filesystem operations. The FileTag structure carries metadata about which relation file to operate on:
typedef struct FileTag {
int16 handler;
int16 forknum;
RelFileLocator rlocator;
uint32 segno;
} FileTag;
When a relation is dropped, PostgreSQL must ensure the underlying files are removed. For the main fork's segment 0, this is handled specially: rather than unlinking immediately (which could cause problems if the relfilenode number is recycled before the next checkpoint), a "tombstone" entry is registered via register_unlink_segment(). This tombstone prevents relfilenode recycling and is processed at the next checkpoint via mdunlinkfiletag().
The bug exists in mdunlinkfiletag():
// BUG: Always uses MAIN_FORKNUM regardless of ftag->forknum
p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
// Also ignores ftag->segno (always unlinks segment 0)
Severity and Impact
The bug is practically harmless in core PostgreSQL because:
register_unlink_segment() is only called from mdunlinkfork() when forknum == MAIN_FORKNUM
- All non-main-fork segments and all segments > 0 are unlinked directly on commit, not deferred
- The tombstone mechanism is specifically designed for main fork segment 0 only
However, the bug is architecturally misleading because:
- The function signature and
FileTag structure suggest general-purpose fork/segment unlink capability
- Extensions using the SMGR infrastructure could reasonably assume these fields are honored
- During recovery, incorrect unlinks could cause data loss if an extension relied on this path
Historical Origin
- PG12 (commit 3eb77eba): Introduced the
FileTag/SyncOps infrastructure. The segment number was never properly used in mdunlinkfiletag() from the start.
- PG16 (commit b0a55e43): Refactored
RelFileNode → RelFileLocator. This introduced the explicit MAIN_FORKNUM hard-coding where previously the fork was implicitly fixed.
Proposed Solutions
v1: Fix the Bug Directly
The initial patch simply corrects mdunlinkfiletag() to use the FileTag's actual fork and segment values:
- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = _mdfd_segpath_rflb(rlfb, ftag->forknum, ftag->segno);
This makes the function behave as its interface contract implies.
v2: Narrow the Interface (Adopted Direction)
Thomas Munro suggested going the opposite direction — rather than making the implementation more general, make the interface more restrictive to match what the implementation actually does and what's actually needed:
- Rename
register_unlink_segment() → register_unlink_tombstone() — removes forknum and segno parameters entirely
- Add Assert() in
mdunlinkfiletag() verifying that any incoming FileTag contains MAIN_FORKNUM and segment 0
- Make it explicit that this is a tombstone-only mechanism, not a general-purpose deferred unlink
This design choice reflects a key architectural principle: don't expose generality you don't intend to support. The tombstone mechanism is inherently unreliable (crash between checkpoint record write and queue processing leaks the file), so building more functionality on top of it is undesirable.
Key Design Tradeoffs
Generality vs. Clarity
The v2 approach sacrifices potential extensibility for clarity of intent. This is the right call because:
- The mechanism has a fundamental reliability gap (crash can leak files)
- No known consumer needs general fork/segment unlink via this path
- A better long-term solution for relfilenode recycling would eliminate this mechanism entirely
Extension API Surface
This touches on the broader question of what SMGR extension points are "supported." The SyncOps / FileTag infrastructure was designed with some generality (the handler field allows extensions to register their own sync handlers), but SYNC_UNLINK_REQUEST through md.c is considered internal. The thread implicitly acknowledges that SMGR extension authors might stumble into this code path but argues against treating it as a stable API.
Reliability Gap (Noted for Future Work)
Thomas Munro flags that the entire tombstone mechanism has an inherent race condition: if the system crashes after writing the checkpoint record but before processing the unlink queue, the tombstone file leaks permanently. This is a pre-existing issue not addressed by this patch, but it motivates the desire to eventually replace this mechanism entirely rather than extending it.