Heads Up: cirrus-ci is shutting down June 1st

First seen: 2026-04-09 20:55:14+00:00 · Messages: 79 · Participants: 19

Latest Update

2026-06-04 · claude-opus-4-6

Incremental Update: v7a-v8 Patches, Test Performance Analysis, and Fork Security Resolution

Major Technical Progress

Andres's Large Incremental Patch (v7a series, June 1)

Andres posted a substantial rework on top of Bilal's v5, introducing numerous architectural changes:

  1. Container registry moved to GitHub (GHCR): Containers now hosted at ghcr.io/anarazel/pg-vm-images/main instead of GCP, making pulls faster and cheaper for GitHub Actions runners. Docs container split out separately; further splits (cross-building, 32-bit) deferred.

  2. Matrix strategy removed: Andres ultimately found the matrix approach too constraining and confusing. Replaced with heavier use of YAML anchors/aliases plus programmatic environment updates — a middle ground between full duplication and matrix rigidity.

  3. Cancellation fix: always() in CompilerWarnings was preventing timely job cancellation. Replaced with !cancelled(). Also added !cancelled() to SanityCheck.

  4. Concurrency group redesign: The simple branch-based concurrency was wrong for stable branches — it serialized runs rather than allowing parallel execution. Andres devised a conditional concurrency group using case() that gives each push to master/stable branches a unique group (via github.run_id), while feature branches still cancel superseded runs.

  5. MacPorts cache key fix: Removed run_id from cache key to prevent unnecessary re-uploads when nothing changed, preserving limited cache space.

  6. ccache improvements: Moved to YAML anchor for deduplication, used ${{github.job_id}} for automatic disambiguation, added immediate cache save after build (so cancelled builds still save).

  7. Cross-platform env vars: Converted most variable references to ${{env.varname}} syntax to enable code sharing between Windows and non-Windows tasks.

  8. Sanitizer optimization: Changed sanitizer builds to -O2 to reduce CPU cost at the expense of slower uncached builds. Removed obsolete :detect_stack_use_after_return=0.

  9. Segment size reduction commit: Added a commit reducing segment size in tests to 1MB, decreasing I/O intensity.

  10. Cirrus CI removal commit included.

Windows-Specific Fixes (v8, June 3)

Peter's FreeBSD Test Migration (June 3)

Peter posted a patch moving previously-FreeBSD-only test configurations (running server checks, PG_TEST_PG_UPGRADE_MODE, debug options) to other platforms. Andres suggested moving them to linux-autoconf and linux-meson-32 instead of macOS, citing nearly 2x speed advantage. Peter agreed.

Jakub Wartak's Deep Performance Analysis (June 1-2)

Jakub performed detailed local profiling revealing the test suite's fundamental bottleneck:

Andres responded that the segment size reduction commit should help Windows more than Linux (due to synchronous writes), and suggested WAIT FOR and adaptive poll_query_until() sleep times rather than connection pooling.

Álvaro's Instance Sharing Proposal

Álvaro proposed reducing test overhead by sharing PostgreSQL instances across test files — noting 14 Cluster->init() calls in src/bin/scripts/t/ where 12 could share one instance. Suggested a 000_create.pl pattern with ->init_from_environment(). This is a longer-term architectural improvement to the test framework.

Fork Security Resolution (June 3)

Jacob Champion raised the fork concern again more urgently. After live testing with Andres:

cfbot Already Working (June 2)

Thomas Munro confirmed cfbot is already showing results through the new GHA pipeline for this very thread's commitfest entry. Log URLs need adjustment and test analysis features need restoration, but the core pipeline is functional.

Workflow File Naming

Settled on pg-ci.yml (Peter's preference for shorter name).

PR Trigger Removed

Jelte and Daniel Gustafsson confirmed: having both push and pull_request triggers causes duplicate runs when pushing to a PR-linked branch. Decision: push-only trigger.

Remaining Items Before Commit

  1. Merge Peter's runningcheck tests with Andres's approach (both posted implementations)
  2. cpan caching (identified as small project — currently 30s uncached every run)
  3. MacPorts/MinGW base cache in pg-vm-images for cfbot branch efficiency
  4. src/tools/ci/README update (in progress)
  5. Final commit message polishing
History (2 prior analyses)
2026-06-01 · claude-opus-4-6

Monthly Summary: Cirrus CI Shutdown and PostgreSQL CI Infrastructure Migration (May 2026)

Overview

With Cirrus CI's June 1, 2026 shutdown deadline looming, the PostgreSQL community moved from debate to action this month, culminating in an in-person unconference session at PGConf.dev 2026 that produced binding decisions and a working v2 patch for GitHub Actions migration.

The Crisis

Cirrus CI's shutdown threatens two critical workflows:

  • cfbot — automated CI on every commitfest patch submission
  • Personal repository CI — committers running CI on GitHub forks before pushing

The scale is significant: ~1,464 core-hours/day across all CI jobs, ~396 of which are Windows (expensive due to licensing), plus macOS on self-hosted runners.

Key Decisions (PGConf.dev 2026 Unconference)

The community formally triaged the migration with these binding decisions:

  1. GitHub Actions confirmed as immediate path — all alternatives deferred
  2. BSDs explicitly dropped from v1 — ending the signal-vs-flakiness debate
  3. Backpatch to PG15 (where CI was first introduced)
  4. VS 2019 dropped — only VS 2022 going forward
  5. Public logs deferred — acknowledged as important but not blocking

This represents a "ship working-but-reduced CI first, iterate later" strategy.

Technical Implementation

v2 Patch (Bilal Yavuz)

Bilal produced a v2 patch merging Jelte Fennema-Nio's container-based approach with his own parallel work:

  • Docker containers for Linux: Chosen over raw runner installation for future-proofing — enables custom images and up-to-date Debian rather than being locked to Ubuntu 24.04
  • io_uring working: Confirmed functional on Linux Meson task via prlimit --memlock=unlimited workaround
  • Helper script abstraction deferred: All commands in one file for debugging simplicity in v1
  • Debug options relocated: With FreeBSD removed, extra debug initdb options (debug_copy_parse_plan_trees, debug_write_read_parse_plan_trees, debug_raw_expression_coverage_test, debug_parallel_query=regress) moved to the macOS task
  • Green CI run demonstrated: https://github.com/nbyavuz/postgres/actions/runs/26398508250

Initial Jelte Patch (May 18)

Jelte Fennema-Nio produced the first working GitHub Actions workflow (with Claude Code assistance) achieving green builds across all previously-supported platforms, including BSDs via cross-platform-actions/action (QEMU nested virtualization). Described explicitly as AI-generated with cursory review, reflecting deadline urgency.

Longer-term Architecture Discussion

Self-hosted Open Source CI

Multiple participants advocate for eventually running self-hosted CI:

  • Woodpecker CI (David Wheeler) — Go-based, Forgejo-integrated
  • QEMU-based universal image infrastructure (Thomas Munro) — Publishing standardized images at ci.postgresql.org/images/qemu/ that decouple "what to test" from "where to test"
  • Sponsored cloud + open source CI (Alexander Korotkov)

Resource Optimization

Robert Haas raised efficiency concerns (cfbot runs 14 CI cycles on a 6-line patch). Proposed heuristics:

  • Tom Lane: Run promptly on new submissions, reduce periodic re-testing
  • Michael Paquier: Reduce frequency for small patches
  • Euler Taveira: Allow manual triggering instead of automatic re-runs

Unresolved Issues

  1. Public log access: GitHub Actions logs require authentication — regression from Cirrus CI's public URLs
  2. cfbot integration: Thomas Munro's domain, not yet addressed in patches
  3. Image pipeline: pg-vm-images currently targets GCP; needs adaptation
  4. Cost model: Full workload on GitHub Actions would be expensive without sponsorship
  5. PG_TEST_PG_UPGRADE_MODE: --link placement: Previously FreeBSD-specific, needs new home
  6. Backpatching strategy: Deferred until HEAD version is committed

Key Debates

Topic Position A Position B
Self-hosted vs. Proprietary Self-hosted is critical for independence (Bruce Momjian, Peter Eisentraut) GitHub will outlive underfunded OSS CI (Jelte); Use both in layers (Thomas Munro)
BSD Support BSDs catch unique issues (Bilal Yavuz) Signal-to-flakiness ratio too low (Jelte); formally deferred by unconference
Who Should Pay Active contributors can pay for own CI (Heikki Linnaka) Priority: lower barriers for new contributors (Peter Eisentraut)

2026-06-01 · claude-opus-4-6

Incremental Update: v3-v5 Patches, Performance Analysis, and Security Concerns

Major Technical Progress

v3 Patch: Matrix-based Linux Tasks and MacPorts Migration

Bilal produced v3 addressing Andres's detailed code review of v2. Key architectural changes:

  1. Linux tasks merged via matrix: All Linux CI tasks (Meson 64-bit, Meson 32-bit, Autoconf) now use a GitHub Actions matrix strategy instead of fully duplicated job definitions. This addresses Andres's concern about unmaintainability from repetition, though Peter Eisentraut later notes the matrix form is "quite confusing" — leaving this as an unresolved design tension.

  2. MacPorts replaces Homebrew on macOS: Bilal switched back to MacPorts (matching the previous Cirrus setup) with caching. First run takes ~10 minutes, subsequent runs ~5 seconds. This resolves the "heavy bootstrap" concern Andres raised about Homebrew's ~95s uncached install time.

  3. YAML anchor limitations acknowledged: GitHub Actions supports YAML anchors but NOT merge keys (<<: *anchor), severely limiting deduplication. Bilal uses anchors only for the checkout step and log paths — a fundamental platform limitation that forces more repetition than ideal.

  4. Concurrency replaces custom cancel: The custom "Cancel previous runs" step (originally AI-generated by Claude Code in Jelte's patch) is removed in favor of GitHub Actions' native concurrency feature.

  5. ccache strategy clarified: Uses run_id in the cache key to force cache refresh every run (since exact-match prevents updates). Bilal proposes a restore/save split pattern using branch names for cross-branch cache population, which Andres had requested for cfbot efficiency.

v4 Patch: Security Hardening and MinGW Optimization

  • PYTHONHOME override removed: Jacob Champion flagged that globally overriding PYTHONHOME would cause future pain. Bilal instead strips Mercurial from PATH — a more targeted fix.
  • Third-party action pinning: Jacob raised Scorecard compliance concerns about the unpinned msys2/setup-msys2@v2 action. Bilal's solution: eliminate the action entirely, using pacman directly from the pre-installed MSYS2.
  • MinGW D: drive workaround: GitHub Actions' C: drive is significantly slower than D: drive (a known runner-images issue). Moving MSYS2 to D: saves ~13 minutes on Windows/MinGW builds (35m → 22m).

v5 Patch: Peter's Tweaks and Crash Log Fix

  • Applied Peter Eisentraut's minor fixups
  • Removed non-functional Windows crash log collection (the registry settings from pg-vm-images that enable crash dumps aren't present on GHA runners; cdb.exe is available but needs configuration work)

Performance Deep-Dive: Why GHA is Slower

Andres performed CPU utilization analysis and corrected his earlier hypothesis:

  • Not primarily I/O bound (as initially suspected) — CPU utilization is ~100% during test runs
  • The real issue: GHA 4-core runners provide 4 hardware threads on 2 physical cores (SMT), whereas Cirrus provided 4 non-SMT cores. Effectively half the compute capacity.
  • Result: Álvaro's first run took 3 hours 3 minutes total; subsequent cached runs would be faster but still notably slower than Cirrus

Jakub Wartak independently measured test I/O on Linux: 17.1 GB written for a single test run (275k write ops). By tuning dirty page writeback settings (dirty_ratio=50, dirty_expire_centisecs=60000), he reduced this to 5.55 GB written — demonstrating that VM tuning could help on self-hosted runners but isn't available on GHA.

Andres notes that PostgreSQL's test suite creates ~36GB of data directories per full run, with many short-lived data directories for sub-second tests — an unfavorable ratio suggesting the test infrastructure itself needs optimization.

Security Discussion: Fork Workflow Implications

A substantive debate emerged about shipping a workflow that auto-runs on all 5,700+ downstream forks:

  • Jacob Champion's concern: GitHub's claimed protection against upstream workflow changes propagating to forks appears broken (referencing community discussions). A leaked actions: write token from the workflow could approve pending workflow runs on forks with permissive default settings.
  • Andres's position: The benefit of easy fork CI (many contributors didn't realize they could run CI individually) outweighs the risk. Fork owners who add their own workflows or blindly run PRs from unknown contributors already have security problems independent of PostgreSQL's workflow.
  • Unresolved: Whether to add an opt-in mechanism (e.g., requiring a repo-level environment variable), which Andres considers "crufty."

Commit Timeline Clarification

Peter Eisentraut notes there's a master branch freeze until beta is tagged — the patch likely can't be committed until late Monday or Tuesday (June 2-3), meaning it will slightly miss the June 1 Cirrus shutdown but be committed very shortly after.

Andres's Backpatching Strategy

Andres proposes: commit to master only first, run cfbot for several days to identify stability issues, then backpatch to older branches. This avoids noisy multi-branch backpatches for each stability fix discovered in the initial period.

Remaining Open Items for Commit

  1. Matrix vs. duplicated jobs for Linux tasks (readability vs. DRY)
  2. Windows crash dump collection (needs registry configuration work)
  3. Third-party action pinning strategy for long-term maintenance
  4. Fork security documentation/mitigation
  5. Credits and expanded commit message (Peter's request)