Random pg_upgrade 004_subscription test failure on drongo

First seen: 2025-03-13 09:04:15+00:00 · Messages: 8 · Participants: 5

Latest Update

2026-05-11 · opus 4.7

Deep Analysis: Random pg_upgrade 004_subscription Test Failure on drongo

Core Problem: Windows STATUS_DELETE_PENDING Race in Relation Storage Creation

The thread concerns a long-standing, flaky failure mode of pg_upgrade on Windows buildfarm animals (drongo, fairywren) in which schema restoration into the new cluster fails with:

ERROR: could not create file "base/1/2683": File exists

during a TRUNCATE pg_catalog.pg_largeobject executed under binary_upgrade_set_next_index_relfilenode/binary_upgrade_set_next_heap_relfilenode. The failure occurs on the mdcreate() path immediately after an mdunlink() of the old relfilenode, which on POSIX is perfectly safe — unlink(2) makes the name disappear immediately, and open(O_CREAT|O_EXCL) can proceed without conflict.

Why Windows breaks the POSIX contract

On Windows, DeleteFile() / NtClose() on a file that still has open handles (including handles held in other processes, e.g. checkpointer or bgwriter that had the segment open via md.c) does not remove the directory entry. Instead the file enters the STATUS_DELETE_PENDING state: the name still exists in the namespace, but any attempt to CreateFile() it returns ERROR_ACCESS_DENIED (which Postgres translates to EEXIST via pgwin32_open_handle()), until the last handle closes.

The relevant call stacks establish this unambiguously:

The comment in pgwin32_open_handle() quoted by Vignesh confirms this is a known Windows kernel behavior, and pg_RtlGetLastNtStatus() is already used in tree to distinguish STATUS_DELETE_PENDING from a genuine EEXIST.

Why this matters beyond the test

Heikki Linnakangas immediately widens the scope: this is not a pg_upgrade-specific bug. Any relfilenumber reuse in a running server is vulnerable — notably TRUNCATE, CLUSTER, VACUUM FULL, and REINDEX, all of which go through RelationSetNewRelfilenumber. The test merely surfaces it because the workaround bgwriter_lru_maxpages = 0 and checkpoint_timeout = 1h applied earlier (for 003_logical_slots) prevents bgwriter/checkpointer from ever opening the doomed segment.

The precedent for a proper fix already exists in dropdb() and DropTableSpace():

WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));

This forces every backend to process a PROCSIGNAL_BARRIER_SMGRRELEASE, which invokes smgrreleaseall() and closes dangling HANDLEs so Windows can actually free the pending-delete inode.

Proposed Solutions

v1 (Vignesh, 2025-03-13): GUC workaround in the test

Mirror the 003_logical_slots fix — set bgwriter_lru_maxpages = 0 and checkpoint_timeout = 1h on the target cluster. This masks the failure rather than fixing it, but is low-risk and backpatchable.

v2 (Vignesh, 2025-03-21, after Heikki's suggestion): mdcreate() retry with barrier

Patch mdcreate() to detect a Windows STATUS_DELETE_PENDING error, emit a PROCSIGNAL_BARRIER_SMGRRELEASE, wait for all backends to process it, and retry the open. Roughly:

#if defined(WIN32) && !defined(__CYGWIN__)
    if (!retryattempted && pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
    {
        retryattempted = true;
        WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
        goto retry;
    }
#endif

This is the architecturally correct place in the sense that it handles the generic case (TRUNCATE/CLUSTER/etc.) and not just pg_upgrade.

Michael Paquier's objection (2025-10-20)

Michael pushes back on two grounds:

  1. Layering concern: Placing a Windows-specific barrier inside md.c pollutes a code path the project actively tries to keep POSIX-shaped. More importantly, mdcreate isn't the only consumer of BasicOpenFilePerm that could be bitten by a pending-delete; any FD acquisition against a recently-unlinked path is vulnerable. A more general place would be inside pgwin32_open_handle() itself (the existing retry loop in src/port/open.c), so all callers benefit.

  2. Verification concern: Because the failure is a timing-dependent cross-process handle race, neither the reporter nor reviewers can reproduce it. Michael proposes a dedicated test module that deliberately induces concurrent filesystem operations (holding FDs while other backends try to unlink/create) to produce a reliable regression harness. Without it, we're shipping fixes we can't verify.

Michael's critique carries significant weight — he is a committer with deep expertise in the storage and portability layers, and his observation that src/port/open.c already contains a retry loop partially targeting this exact condition is the strongest argument for centralizing the fix there rather than in md.c.

Key Technical Insights

1. WaitForProcSignalBarrier is the only correct remediation

A simple sleep/retry loop (as already exists in src/port/open.c) cannot suffice: if the handle is held by a long-lived auxiliary process like checkpointer, it will never close on its own timescale. The barrier mechanism forces smgrreleaseall() in every backend, which is the only way to guarantee the pending-delete resolves. This is why dropdb/DropTableSpace already use it.

2. Retry scope inside vs. outside md.c

The md.c placement gives a clean "once per creation" retry with access to SMGR-level state. The pgwin32_open_handle / src/port/open.c placement is more general but riskier: emitting a procsignal barrier from deep inside a libc-like wrapper violates layering far worse than an #ifdef WIN32 in md.c does — open() callers do not expect to block on IPC. This is a genuine design tension with no obvious right answer; Michael leans toward the wrapper, Heikki toward mdcreate.

3. The buildfarm visibility problem

A meta-problem that repeatedly blocks progress: pg_upgrade_output.d/ is not captured by the buildfarm client, so the actual error text in pg_upgrade_dump_1.log is lost. Andrew Dunstan acknowledges this is "not supposed to happen" and is working on a buildfarm-client fix. Alexander Lakhin's 2026 follow-up shows the visibility gap persists over a year later and now affects 006_transfer_modes as well — meaning whatever fix was (or wasn't) committed hasn't resolved the underlying race.

4. Relfilenumber reuse within a transaction is the real hazard

RelationSetNewRelfilenumber both unlinks the old storage and creates the new one in quick succession. On POSIX the window between them is benign; on Windows it is the entire bug. A different architectural approach (never reuse, always allocate fresh) would sidestep the issue, but that's a far larger change and would interact with binary-upgrade OID preservation.

Status and Open Questions

As of the last message (Alexander, 2026-05-10), the race is still occurring on fairywren and drongo, now also triggering 006_transfer_modes failures. Michael's October 2025 review appears to have stalled the patch pending either:

Neither has materialized. The practical workaround (GUC tweaks in the test) has not been applied to 004_subscription or 006_transfer_modes, leaving the buildfarm sporadically red.