Bug: mdunlinkfiletag unlinks mainfork seg.0 instead of indicated fork+segment

First seen: 2024-12-20 22:41:16+00:00 · Messages: 6 · Participants: 4

Latest Update

2026-05-20 · claude-opus-4-6

No Substantive Progress

The new message from Surya Poondla is a brief acknowledgment thanking Solai for testing/confirming and agreeing with a comment suggestion. There is no new technical content, no new patch version, no architectural discussion, and no position changes.

History (1 prior analysis)
2026-05-18 · claude-opus-4-6

Bug Analysis: mdunlinkfiletag Unlinks Wrong Fork/Segment

Core Problem

The mdunlinkfiletag() function in PostgreSQL's magnetic disk (MD) storage manager has a latent bug where it always unlinks segment 0 of MAIN_FORKNUM, regardless of what fork number and segment number are specified in the FileTag structure passed to it. This means the FileTag infrastructure's forknum and segno fields are silently ignored for unlink operations.

Technical Context

PostgreSQL's MD storage manager uses a checkpoint-based sync request queue (SyncOps) to defer certain filesystem operations. The FileTag structure carries metadata about which relation file to operate on:

typedef struct FileTag {
    int16       handler;
    int16       forknum;
    RelFileLocator rlocator;
    uint32      segno;
} FileTag;

When a relation is dropped, PostgreSQL must ensure the underlying files are removed. For the main fork's segment 0, this is handled specially: rather than unlinking immediately (which could cause problems if the relfilenode number is recycled before the next checkpoint), a "tombstone" entry is registered via register_unlink_segment(). This tombstone prevents relfilenode recycling and is processed at the next checkpoint via mdunlinkfiletag().

The bug exists in mdunlinkfiletag():

// BUG: Always uses MAIN_FORKNUM regardless of ftag->forknum
p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
// Also ignores ftag->segno (always unlinks segment 0)

Severity and Impact

The bug is practically harmless in core PostgreSQL because:

  1. register_unlink_segment() is only called from mdunlinkfork() when forknum == MAIN_FORKNUM
  2. All non-main-fork segments and all segments > 0 are unlinked directly on commit, not deferred
  3. The tombstone mechanism is specifically designed for main fork segment 0 only

However, the bug is architecturally misleading because:

  • The function signature and FileTag structure suggest general-purpose fork/segment unlink capability
  • Extensions using the SMGR infrastructure could reasonably assume these fields are honored
  • During recovery, incorrect unlinks could cause data loss if an extension relied on this path

Historical Origin

  • PG12 (commit 3eb77eba): Introduced the FileTag/SyncOps infrastructure. The segment number was never properly used in mdunlinkfiletag() from the start.
  • PG16 (commit b0a55e43): Refactored RelFileNodeRelFileLocator. This introduced the explicit MAIN_FORKNUM hard-coding where previously the fork was implicitly fixed.

Proposed Solutions

v1: Fix the Bug Directly

The initial patch simply corrects mdunlinkfiletag() to use the FileTag's actual fork and segment values:

- p = relpathperm(ftag->rlocator, MAIN_FORKNUM);
+ p = _mdfd_segpath_rflb(rlfb, ftag->forknum, ftag->segno);

This makes the function behave as its interface contract implies.

v2: Narrow the Interface (Adopted Direction)

Thomas Munro suggested going the opposite direction — rather than making the implementation more general, make the interface more restrictive to match what the implementation actually does and what's actually needed:

  1. Rename register_unlink_segment()register_unlink_tombstone() — removes forknum and segno parameters entirely
  2. Add Assert() in mdunlinkfiletag() verifying that any incoming FileTag contains MAIN_FORKNUM and segment 0
  3. Make it explicit that this is a tombstone-only mechanism, not a general-purpose deferred unlink

This design choice reflects a key architectural principle: don't expose generality you don't intend to support. The tombstone mechanism is inherently unreliable (crash between checkpoint record write and queue processing leaks the file), so building more functionality on top of it is undesirable.

Key Design Tradeoffs

Generality vs. Clarity

The v2 approach sacrifices potential extensibility for clarity of intent. This is the right call because:

  • The mechanism has a fundamental reliability gap (crash can leak files)
  • No known consumer needs general fork/segment unlink via this path
  • A better long-term solution for relfilenode recycling would eliminate this mechanism entirely

Extension API Surface

This touches on the broader question of what SMGR extension points are "supported." The SyncOps / FileTag infrastructure was designed with some generality (the handler field allows extensions to register their own sync handlers), but SYNC_UNLINK_REQUEST through md.c is considered internal. The thread implicitly acknowledges that SMGR extension authors might stumble into this code path but argues against treating it as a stable API.

Reliability Gap (Noted for Future Work)

Thomas Munro flags that the entire tombstone mechanism has an inherent race condition: if the system crashes after writing the checkpoint record but before processing the unlink queue, the tombstone file leaks permanently. This is a pre-existing issue not addressed by this patch, but it motivates the desire to eventually replace this mechanism entirely rather than extending it.