First seen: 2026-04-09 20:55:14+00:00 · Messages: 79
· Participants: 19
2026-06-04 · claude-opus-4-6
Incremental Update: v7a-v8 Patches, Test Performance Analysis, and Fork Security Resolution
Major Technical Progress
Andres's Large Incremental Patch (v7a series, June 1)
Andres posted a substantial rework on top of Bilal's v5, introducing numerous architectural changes:
-
Container registry moved to GitHub (GHCR): Containers now hosted at ghcr.io/anarazel/pg-vm-images/main instead of GCP, making pulls faster and cheaper for GitHub Actions runners. Docs container split out separately; further splits (cross-building, 32-bit) deferred.
-
Matrix strategy removed: Andres ultimately found the matrix approach too constraining and confusing. Replaced with heavier use of YAML anchors/aliases plus programmatic environment updates — a middle ground between full duplication and matrix rigidity.
-
Cancellation fix: always() in CompilerWarnings was preventing timely job cancellation. Replaced with !cancelled(). Also added !cancelled() to SanityCheck.
-
Concurrency group redesign: The simple branch-based concurrency was wrong for stable branches — it serialized runs rather than allowing parallel execution. Andres devised a conditional concurrency group using case() that gives each push to master/stable branches a unique group (via github.run_id), while feature branches still cancel superseded runs.
-
MacPorts cache key fix: Removed run_id from cache key to prevent unnecessary re-uploads when nothing changed, preserving limited cache space.
-
ccache improvements: Moved to YAML anchor for deduplication, used ${{github.job_id}} for automatic disambiguation, added immediate cache save after build (so cancelled builds still save).
-
Cross-platform env vars: Converted most variable references to ${{env.varname}} syntax to enable code sharing between Windows and non-Windows tasks.
-
Sanitizer optimization: Changed sanitizer builds to -O2 to reduce CPU cost at the expense of slower uncached builds. Removed obsolete :detect_stack_use_after_return=0.
-
Segment size reduction commit: Added a commit reducing segment size in tests to 1MB, decreasing I/O intensity.
-
Cirrus CI removal commit included.
Windows-Specific Fixes (v8, June 3)
- Chocolatey eliminated: Replaced with msys/pacman for bison, flex, diff — 10x faster and more reliable (chocolatey was causing intermittent failures).
- OpenSSL: Stopped installing OpenSSL 1.1 manually; using pre-installed OpenSSL 3.6 on GitHub runners instead.
- Error handling: Added missing error handling since Windows shell lacks
set -e behavior.
- Test slicing: Split Windows tests across two jobs using
meson test --slice, reducing wall-clock time from ~33min to ~23min at the cost of one additional runner slot.
- Runner pinned: Changed from
ubuntu-latest to ubuntu-24.04 to prevent breakage from automatic upgrades.
Peter's FreeBSD Test Migration (June 3)
Peter posted a patch moving previously-FreeBSD-only test configurations (running server checks, PG_TEST_PG_UPGRADE_MODE, debug options) to other platforms. Andres suggested moving them to linux-autoconf and linux-meson-32 instead of macOS, citing nearly 2x speed advantage. Peter agreed.
Jakub Wartak's Deep Performance Analysis (June 1-2)
Jakub performed detailed local profiling revealing the test suite's fundamental bottleneck:
- EDR agent (falcon-sensor) was consuming massive CPU via eBPF — removing it dropped test time from 65s to 43s
- Backend startup dominates: "Backend startup costs roughly 2.5x as much as the actual queries" — 24,903 connections in 46s = 541 forks/second
- Subscription tests worst offender: 8,610 connections (35% of all) from subscription tests alone
- XFS reflinks help I/O but not duration (CPU-bound anyway)
- Windows Defender: Despite
Set-MpPreference -DisableRealtimeMonitoring, Defender remained active due to TamperProtection — requires manual GUI intervention (not applicable to GHA which already has it disabled)
- Windows storage:
IsPowerProtected=False, IsDeviceCacheEnabled=False on GHA runners means all writes are synchronous — a significant performance penalty with no obvious workaround
Andres responded that the segment size reduction commit should help Windows more than Linux (due to synchronous writes), and suggested WAIT FOR and adaptive poll_query_until() sleep times rather than connection pooling.
Álvaro's Instance Sharing Proposal
Álvaro proposed reducing test overhead by sharing PostgreSQL instances across test files — noting 14 Cluster->init() calls in src/bin/scripts/t/ where 12 could share one instance. Suggested a 000_create.pl pattern with ->init_from_environment(). This is a longer-term architectural improvement to the test framework.
Fork Security Resolution (June 3)
Jacob Champion raised the fork concern again more urgently. After live testing with Andres:
- Finding: New forks start with workflows disabled. But existing forks that sync/pull after the workflow is added will have it enabled (Andres initially typo'd "disabled" then corrected to "enabled").
- Jacob's concrete proposal: Bail out of setup if repository isn't
postgres/postgres for the initial commit, then evaluate in practice.
- Andres's counter: Requiring per-repo opt-in would recreate the discoverability problem from Cirrus era. Many contributors didn't know they could run CI. Added README documentation on how to disable the workflow instead.
- Practical constraint: An opt-out mechanism is needed regardless — cfbot needs it, and Andres needs it for testing before merge.
- Outcome: Andres added a section about disabling workflows to
src/tools/ci/README but did not add an automatic skip for non-postgres/postgres repos.
cfbot Already Working (June 2)
Thomas Munro confirmed cfbot is already showing results through the new GHA pipeline for this very thread's commitfest entry. Log URLs need adjustment and test analysis features need restoration, but the core pipeline is functional.
Workflow File Naming
Settled on pg-ci.yml (Peter's preference for shorter name).
PR Trigger Removed
Jelte and Daniel Gustafsson confirmed: having both push and pull_request triggers causes duplicate runs when pushing to a PR-linked branch. Decision: push-only trigger.
Remaining Items Before Commit
- Merge Peter's runningcheck tests with Andres's approach (both posted implementations)
- cpan caching (identified as small project — currently 30s uncached every run)
- MacPorts/MinGW base cache in pg-vm-images for cfbot branch efficiency
src/tools/ci/README update (in progress)
- Final commit message polishing
History (2 prior analyses)
2026-06-01 · claude-opus-4-6
Monthly Summary: Cirrus CI Shutdown and PostgreSQL CI Infrastructure Migration (May 2026)
Overview
With Cirrus CI's June 1, 2026 shutdown deadline looming, the PostgreSQL community moved from debate to action this month, culminating in an in-person unconference session at PGConf.dev 2026 that produced binding decisions and a working v2 patch for GitHub Actions migration.
The Crisis
Cirrus CI's shutdown threatens two critical workflows:
- cfbot — automated CI on every commitfest patch submission
- Personal repository CI — committers running CI on GitHub forks before pushing
The scale is significant: ~1,464 core-hours/day across all CI jobs, ~396 of which are Windows (expensive due to licensing), plus macOS on self-hosted runners.
Key Decisions (PGConf.dev 2026 Unconference)
The community formally triaged the migration with these binding decisions:
- GitHub Actions confirmed as immediate path — all alternatives deferred
- BSDs explicitly dropped from v1 — ending the signal-vs-flakiness debate
- Backpatch to PG15 (where CI was first introduced)
- VS 2019 dropped — only VS 2022 going forward
- Public logs deferred — acknowledged as important but not blocking
This represents a "ship working-but-reduced CI first, iterate later" strategy.
Technical Implementation
v2 Patch (Bilal Yavuz)
Bilal produced a v2 patch merging Jelte Fennema-Nio's container-based approach with his own parallel work:
- Docker containers for Linux: Chosen over raw runner installation for future-proofing — enables custom images and up-to-date Debian rather than being locked to Ubuntu 24.04
- io_uring working: Confirmed functional on Linux Meson task via
prlimit --memlock=unlimited workaround
- Helper script abstraction deferred: All commands in one file for debugging simplicity in v1
- Debug options relocated: With FreeBSD removed, extra debug initdb options (
debug_copy_parse_plan_trees, debug_write_read_parse_plan_trees, debug_raw_expression_coverage_test, debug_parallel_query=regress) moved to the macOS task
- Green CI run demonstrated: https://github.com/nbyavuz/postgres/actions/runs/26398508250
Initial Jelte Patch (May 18)
Jelte Fennema-Nio produced the first working GitHub Actions workflow (with Claude Code assistance) achieving green builds across all previously-supported platforms, including BSDs via cross-platform-actions/action (QEMU nested virtualization). Described explicitly as AI-generated with cursory review, reflecting deadline urgency.
Longer-term Architecture Discussion
Self-hosted Open Source CI
Multiple participants advocate for eventually running self-hosted CI:
- Woodpecker CI (David Wheeler) — Go-based, Forgejo-integrated
- QEMU-based universal image infrastructure (Thomas Munro) — Publishing standardized images at
ci.postgresql.org/images/qemu/ that decouple "what to test" from "where to test"
- Sponsored cloud + open source CI (Alexander Korotkov)
Resource Optimization
Robert Haas raised efficiency concerns (cfbot runs 14 CI cycles on a 6-line patch). Proposed heuristics:
- Tom Lane: Run promptly on new submissions, reduce periodic re-testing
- Michael Paquier: Reduce frequency for small patches
- Euler Taveira: Allow manual triggering instead of automatic re-runs
Unresolved Issues
- Public log access: GitHub Actions logs require authentication — regression from Cirrus CI's public URLs
- cfbot integration: Thomas Munro's domain, not yet addressed in patches
- Image pipeline: pg-vm-images currently targets GCP; needs adaptation
- Cost model: Full workload on GitHub Actions would be expensive without sponsorship
PG_TEST_PG_UPGRADE_MODE: --link placement: Previously FreeBSD-specific, needs new home
- Backpatching strategy: Deferred until HEAD version is committed
Key Debates
| Topic |
Position A |
Position B |
| Self-hosted vs. Proprietary |
Self-hosted is critical for independence (Bruce Momjian, Peter Eisentraut) |
GitHub will outlive underfunded OSS CI (Jelte); Use both in layers (Thomas Munro) |
| BSD Support |
BSDs catch unique issues (Bilal Yavuz) |
Signal-to-flakiness ratio too low (Jelte); formally deferred by unconference |
| Who Should Pay |
Active contributors can pay for own CI (Heikki Linnaka) |
Priority: lower barriers for new contributors (Peter Eisentraut) |
2026-06-01 · claude-opus-4-6
Incremental Update: v3-v5 Patches, Performance Analysis, and Security Concerns
Major Technical Progress
v3 Patch: Matrix-based Linux Tasks and MacPorts Migration
Bilal produced v3 addressing Andres's detailed code review of v2. Key architectural changes:
-
Linux tasks merged via matrix: All Linux CI tasks (Meson 64-bit, Meson 32-bit, Autoconf) now use a GitHub Actions matrix strategy instead of fully duplicated job definitions. This addresses Andres's concern about unmaintainability from repetition, though Peter Eisentraut later notes the matrix form is "quite confusing" — leaving this as an unresolved design tension.
-
MacPorts replaces Homebrew on macOS: Bilal switched back to MacPorts (matching the previous Cirrus setup) with caching. First run takes ~10 minutes, subsequent runs ~5 seconds. This resolves the "heavy bootstrap" concern Andres raised about Homebrew's ~95s uncached install time.
-
YAML anchor limitations acknowledged: GitHub Actions supports YAML anchors but NOT merge keys (<<: *anchor), severely limiting deduplication. Bilal uses anchors only for the checkout step and log paths — a fundamental platform limitation that forces more repetition than ideal.
-
Concurrency replaces custom cancel: The custom "Cancel previous runs" step (originally AI-generated by Claude Code in Jelte's patch) is removed in favor of GitHub Actions' native concurrency feature.
-
ccache strategy clarified: Uses run_id in the cache key to force cache refresh every run (since exact-match prevents updates). Bilal proposes a restore/save split pattern using branch names for cross-branch cache population, which Andres had requested for cfbot efficiency.
v4 Patch: Security Hardening and MinGW Optimization
- PYTHONHOME override removed: Jacob Champion flagged that globally overriding PYTHONHOME would cause future pain. Bilal instead strips Mercurial from PATH — a more targeted fix.
- Third-party action pinning: Jacob raised Scorecard compliance concerns about the unpinned
msys2/setup-msys2@v2 action. Bilal's solution: eliminate the action entirely, using pacman directly from the pre-installed MSYS2.
- MinGW D: drive workaround: GitHub Actions' C: drive is significantly slower than D: drive (a known runner-images issue). Moving MSYS2 to D: saves ~13 minutes on Windows/MinGW builds (35m → 22m).
v5 Patch: Peter's Tweaks and Crash Log Fix
- Applied Peter Eisentraut's minor fixups
- Removed non-functional Windows crash log collection (the registry settings from
pg-vm-images that enable crash dumps aren't present on GHA runners; cdb.exe is available but needs configuration work)
Performance Deep-Dive: Why GHA is Slower
Andres performed CPU utilization analysis and corrected his earlier hypothesis:
- Not primarily I/O bound (as initially suspected) — CPU utilization is ~100% during test runs
- The real issue: GHA 4-core runners provide 4 hardware threads on 2 physical cores (SMT), whereas Cirrus provided 4 non-SMT cores. Effectively half the compute capacity.
- Result: Álvaro's first run took 3 hours 3 minutes total; subsequent cached runs would be faster but still notably slower than Cirrus
Jakub Wartak independently measured test I/O on Linux: 17.1 GB written for a single test run (275k write ops). By tuning dirty page writeback settings (dirty_ratio=50, dirty_expire_centisecs=60000), he reduced this to 5.55 GB written — demonstrating that VM tuning could help on self-hosted runners but isn't available on GHA.
Andres notes that PostgreSQL's test suite creates ~36GB of data directories per full run, with many short-lived data directories for sub-second tests — an unfavorable ratio suggesting the test infrastructure itself needs optimization.
Security Discussion: Fork Workflow Implications
A substantive debate emerged about shipping a workflow that auto-runs on all 5,700+ downstream forks:
- Jacob Champion's concern: GitHub's claimed protection against upstream workflow changes propagating to forks appears broken (referencing community discussions). A leaked
actions: write token from the workflow could approve pending workflow runs on forks with permissive default settings.
- Andres's position: The benefit of easy fork CI (many contributors didn't realize they could run CI individually) outweighs the risk. Fork owners who add their own workflows or blindly run PRs from unknown contributors already have security problems independent of PostgreSQL's workflow.
- Unresolved: Whether to add an opt-in mechanism (e.g., requiring a repo-level environment variable), which Andres considers "crufty."
Commit Timeline Clarification
Peter Eisentraut notes there's a master branch freeze until beta is tagged — the patch likely can't be committed until late Monday or Tuesday (June 2-3), meaning it will slightly miss the June 1 Cirrus shutdown but be committed very shortly after.
Andres's Backpatching Strategy
Andres proposes: commit to master only first, run cfbot for several days to identify stability issues, then backpatch to older branches. This avoids noisy multi-branch backpatches for each stability fix discovered in the initial period.
Remaining Open Items for Commit
- Matrix vs. duplicated jobs for Linux tasks (readability vs. DRY)
- Windows crash dump collection (needs registry configuration work)
- Third-party action pinning strategy for long-term maintenance
- Fork security documentation/mitigation
- Credits and expanded commit message (Peter's request)