Random pg_upgrade 004_subscription test failure on drongo

First seen: 2025-03-13 09:04:15+00:00 · Messages: 10 · Participants: 5

Latest Update

2026-05-18 · claude-opus-4-6

Incremental Update: Buildfarm Client Bug Fixed for Log Capture (2026-05-16)

Andrew Dunstan reports he has identified and fixed the buildfarm client bug that was preventing pg_upgrade_output.d/ logs from being captured on failing runs. The fix is committed to the buildfarm client repository.

What's New

  1. Buildfarm client fix committed: The specific commit (55fdf7e0) resolves the long-standing diagnostic visibility problem — the meta-issue that has hampered debugging of this race condition for over a year. Previously, when pg_upgrade failed, the actual error text in pg_upgrade_dump_1.log was lost because the buildfarm client didn't collect it.

  2. Already deployed on affected animals: The fix has been deployed on drongo and fairywren ahead of the next official buildfarm client release, meaning the next time the STATUS_DELETE_PENDING race triggers a test failure, the full error output should be captured and visible.

Significance

This is an infrastructure/diagnostic fix only — it does not address the underlying Windows STATUS_DELETE_PENDING race condition in PostgreSQL itself. However, it removes the major blocker that Michael Paquier and others cited: the inability to verify and reproduce the error details from buildfarm runs. With proper log capture now in place, the next occurrence should provide definitive confirmation of the error path, which may help unstall the stalled discussion about where to place the retry logic (md.c vs src/port/open.c).

No Progress on the Core Fix

The actual PostgreSQL code fix remains stalled. No new patch version, no movement on Michael Paquier's design concerns, and no concurrency test module.

History (1 prior analysis)
2026-05-14 · claude-opus-4-6

Incremental Update: Andrew Dunstan's Buildfarm Diagnostic Adjustment (2026-05-11)

Andrew Dunstan responds to Alexander Lakhin's report about continued failures on drongo and fairywren. The message is brief and primarily operational rather than introducing new technical arguments or patches.

What's New

  1. Buildfarm test loosening on drongo/fairywren: Andrew has further relaxed test constraints on these two Windows animals specifically to improve the chances of capturing the output logs (pg_upgrade_output.d/) when the failure occurs. This is a continuation of the diagnostic visibility problem identified previously.

  2. Acknowledgment of a secondary bug in 004_subscription: Andrew notes that the test's check for the output directory ("The ok test for the output directory is wrong if there's a failure") is itself buggy — it apparently passes even when pg_upgrade has failed, masking the real error. He defers fixing this for now but flags it as a known issue.

  3. No patch progress: No new patch version, no resolution of Michael Paquier's design concerns (retry in md.c vs src/port/open.c), and no concurrency test module. The thread remains stalled on the actual fix.

Summary

This message is primarily an operational/infrastructure update from the buildfarm maintainer. It confirms the issue remains unresolved and that the immediate priority is getting better diagnostic output from the failing animals rather than committing a code fix.