Deep Analysis: Random pg_upgrade 004_subscription Test Failure on drongo
Core Problem: Windows STATUS_DELETE_PENDING Race in Relation Storage Creation
The thread concerns a long-standing, flaky failure mode of pg_upgrade on Windows buildfarm animals (drongo, fairywren) in which schema restoration into the new cluster fails with:
ERROR: could not create file "base/1/2683": File exists
during a TRUNCATE pg_catalog.pg_largeobject executed under binary_upgrade_set_next_index_relfilenode/binary_upgrade_set_next_heap_relfilenode. The failure occurs on the mdcreate() path immediately after an mdunlink() of the old relfilenode, which on POSIX is perfectly safe — unlink(2) makes the name disappear immediately, and open(O_CREAT|O_EXCL) can proceed without conflict.
Why Windows breaks the POSIX contract
On Windows, DeleteFile() / NtClose() on a file that still has open handles (including handles held in other processes, e.g. checkpointer or bgwriter that had the segment open via md.c) does not remove the directory entry. Instead the file enters the STATUS_DELETE_PENDING state: the name still exists in the namespace, but any attempt to CreateFile() it returns ERROR_ACCESS_DENIED (which Postgres translates to EEXIST via pgwin32_open_handle()), until the last handle closes.
The relevant call stacks establish this unambiguously:
- Unlink path:
ExecuteTruncate → RelationSetNewRelfilenumber → smgrdounlinkall → mdunlink → pgunlink()— returns success but the file is only pending deletion. - Create path (same transaction, same relation):
RelationSetNewRelfilenumber → RelationCreateStorage → mdcreate → PathNameOpenFile → BasicOpenFilePerm → open() → pgwin32_open()— fails because another backend (most plausibly bgwriter/checkpointer, which had opened the segment to flush dirty buffers) still holds anHANDLEto the old inode.
The comment in pgwin32_open_handle() quoted by Vignesh confirms this is a known Windows kernel behavior, and pg_RtlGetLastNtStatus() is already used in tree to distinguish STATUS_DELETE_PENDING from a genuine EEXIST.
Why this matters beyond the test
Heikki Linnakangas immediately widens the scope: this is not a pg_upgrade-specific bug. Any relfilenumber reuse in a running server is vulnerable — notably TRUNCATE, CLUSTER, VACUUM FULL, and REINDEX, all of which go through RelationSetNewRelfilenumber. The test merely surfaces it because the workaround bgwriter_lru_maxpages = 0 and checkpoint_timeout = 1h applied earlier (for 003_logical_slots) prevents bgwriter/checkpointer from ever opening the doomed segment.
The precedent for a proper fix already exists in dropdb() and DropTableSpace():
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
This forces every backend to process a PROCSIGNAL_BARRIER_SMGRRELEASE, which invokes smgrreleaseall() and closes dangling HANDLEs so Windows can actually free the pending-delete inode.
Proposed Solutions
v1 (Vignesh, 2025-03-13): GUC workaround in the test
Mirror the 003_logical_slots fix — set bgwriter_lru_maxpages = 0 and checkpoint_timeout = 1h on the target cluster. This masks the failure rather than fixing it, but is low-risk and backpatchable.
v2 (Vignesh, 2025-03-21, after Heikki's suggestion): mdcreate() retry with barrier
Patch mdcreate() to detect a Windows STATUS_DELETE_PENDING error, emit a PROCSIGNAL_BARRIER_SMGRRELEASE, wait for all backends to process it, and retry the open. Roughly:
#if defined(WIN32) && !defined(__CYGWIN__)
if (!retryattempted && pg_RtlGetLastNtStatus() == STATUS_DELETE_PENDING)
{
retryattempted = true;
WaitForProcSignalBarrier(EmitProcSignalBarrier(PROCSIGNAL_BARRIER_SMGRRELEASE));
goto retry;
}
#endif
This is the architecturally correct place in the sense that it handles the generic case (TRUNCATE/CLUSTER/etc.) and not just pg_upgrade.
Michael Paquier's objection (2025-10-20)
Michael pushes back on two grounds:
-
Layering concern: Placing a Windows-specific barrier inside
md.cpollutes a code path the project actively tries to keep POSIX-shaped. More importantly,mdcreateisn't the only consumer ofBasicOpenFilePermthat could be bitten by a pending-delete; any FD acquisition against a recently-unlinked path is vulnerable. A more general place would be insidepgwin32_open_handle()itself (the existing retry loop insrc/port/open.c), so all callers benefit. -
Verification concern: Because the failure is a timing-dependent cross-process handle race, neither the reporter nor reviewers can reproduce it. Michael proposes a dedicated test module that deliberately induces concurrent filesystem operations (holding FDs while other backends try to unlink/create) to produce a reliable regression harness. Without it, we're shipping fixes we can't verify.
Michael's critique carries significant weight — he is a committer with deep expertise in the storage and portability layers, and his observation that src/port/open.c already contains a retry loop partially targeting this exact condition is the strongest argument for centralizing the fix there rather than in md.c.
Key Technical Insights
1. WaitForProcSignalBarrier is the only correct remediation
A simple sleep/retry loop (as already exists in src/port/open.c) cannot suffice: if the handle is held by a long-lived auxiliary process like checkpointer, it will never close on its own timescale. The barrier mechanism forces smgrreleaseall() in every backend, which is the only way to guarantee the pending-delete resolves. This is why dropdb/DropTableSpace already use it.
2. Retry scope inside vs. outside md.c
The md.c placement gives a clean "once per creation" retry with access to SMGR-level state. The pgwin32_open_handle / src/port/open.c placement is more general but riskier: emitting a procsignal barrier from deep inside a libc-like wrapper violates layering far worse than an #ifdef WIN32 in md.c does — open() callers do not expect to block on IPC. This is a genuine design tension with no obvious right answer; Michael leans toward the wrapper, Heikki toward mdcreate.
3. The buildfarm visibility problem
A meta-problem that repeatedly blocks progress: pg_upgrade_output.d/ is not captured by the buildfarm client, so the actual error text in pg_upgrade_dump_1.log is lost. Andrew Dunstan acknowledges this is "not supposed to happen" and is working on a buildfarm-client fix. Alexander Lakhin's 2026 follow-up shows the visibility gap persists over a year later and now affects 006_transfer_modes as well — meaning whatever fix was (or wasn't) committed hasn't resolved the underlying race.
4. Relfilenumber reuse within a transaction is the real hazard
RelationSetNewRelfilenumber both unlinks the old storage and creates the new one in quick succession. On POSIX the window between them is benign; on Windows it is the entire bug. A different architectural approach (never reuse, always allocate fresh) would sidestep the issue, but that's a far larger change and would interact with binary-upgrade OID preservation.
Status and Open Questions
As of the last message (Alexander, 2026-05-10), the race is still occurring on fairywren and drongo, now also triggering 006_transfer_modes failures. Michael's October 2025 review appears to have stalled the patch pending either:
- Relocation of the retry logic to the
pgwin32_open_handlelayer, and/or - A concurrency test module to prove the fix.
Neither has materialized. The practical workaround (GUC tweaks in the test) has not been applied to 004_subscription or 006_transfer_modes, leaving the buildfarm sporadically red.