Init Connection Time Grows Quadratically: Technical Analysis
Core Problem
The thread investigates a surprising performance characteristic: PostgreSQL's initial connection time (ICT) — the time from pgbench process start until all N connections are established — grows quadratically (O(n²)) with the number of clients, rather than the expected linear O(n) scaling.
The measurements on REL_16_STABLE demonstrate this clearly:
- 1024 clients: ~435ms
- 2048 clients: ~1062ms (expected ~870ms if linear)
- 4096 clients: ~3284ms (expected ~1740ms if linear)
- 8192 clients: ~11617ms
- 16384 clients: ~43391ms
The curve fits y ≈ 0.0002x², indicating that each additional connection becomes progressively more expensive as total connections increase.
Architectural Context
The PGPROC Array and ProcArrayAdd
PostgreSQL maintains a shared-memory array of PGPROC structures — one per backend process. Each PGPROC is a relatively large structure containing transaction state, lock information, and various metadata. When a new backend connects, ProcArrayAdd() is called to register it in the global process array.
Critically, each PGPROC has a field pgxactoff which stores the backend's offset in the ProcArray. During ProcArrayAdd, the code must access:
allProcs[procno].pgxactoff = index;
This seemingly simple assignment becomes pathological at scale because:
- PGPROC structures are large (~hundreds of bytes each), meaning they span many memory pages
- Process numbers (procno) are assigned non-sequentially after initial warmup — backends that disconnect and reconnect get reused slots scattered across the array
- With
huge_pages=off, the standard 4KB page size means the PGPROC array spans thousands of pages
The Degradation Pattern
Maksim Melnikov's investigation reveals a crucial secondary finding: ICT degrades across repeated pgbench iterations without server restart, eventually stabilizing at worse values than the first run. This is a classic symptom of TLB thrashing and page fault amplification:
- First iteration: PGPROC slots are allocated sequentially, so
pgxactoffwrites hit consecutive memory pages — good spatial locality - Subsequent iterations: After backends disconnect and reconnect, PGPROC slot reuse creates a random access pattern across the array
- Minor page faults: Each access to a scattered
allProcs[procno].pgxactofftriggers a minor page fault because the OS page table entries for those pages may have been evicted from the TLB
The perf data confirms that ProcArrayAdd is dominated by minor page fault overhead on the allProcs[procno].pgxactoff = index line.
Proposed Solution: Separate pgxactoff Array
The patch (0001-This-patch-reduce-connection-init-close-time.patch) extracts the pgxactoff field from the PGPROC structure into a separate dense shared-memory array indexed by process number:
ProcGlobal->pgxactoffs[procno] = index; // instead of allProcs[procno].pgxactoff
Why This Helps
- Cache line density: Instead of touching one int field buried inside a large PGPROC struct (causing an entire cache line load for 4 bytes of useful data), the dense array packs ~16 pgxactoff values per cache line
- Reduced page footprint: For 16384 connections, the pgxactoff array is only
16384 * 4 = 64KB— fits in just 16 pages (4KB) or a single huge page (2MB), versus the PGPROC array which spans hundreds of megabytes - TLB friendliness: Fewer pages means fewer TLB entries needed, eliminating the minor fault cascade
Results
| Clients | Without Patch (warmup) | With Patch (warmup) | Improvement |
|---|---|---|---|
| 512 | ~500ms | ~215ms | 2.3x |
| 1024 | ~1000ms | ~920ms | 1.1x |
| 2048 | ~2240ms | ~1800ms | 1.2x |
| 4096 | ~6140ms | ~3740ms | 1.6x |
| 8192 | ~18840ms | ~8100ms | 2.3x |
Crucially, the patch eliminates the degradation between iterations — performance remains stable across runs rather than worsening.
Key Technical Debate
Huge Pages Question
Matthias van de Meent raises the critical observation that this entire investigation uses huge_pages=off, and PostgreSQL is generally not optimized for small-page configurations. With huge pages (2MB pages on x86-64):
- The PGPROC array for 16384 backends (~hundreds of MB) would use far fewer page table entries
- TLB coverage would be vastly better
- Minor page faults would be dramatically reduced
The question of whether this optimization matters in production (where huge_pages=on is recommended) remains open — no data with huge pages has been presented.
Measurement Methodology Concerns
Matthias initially questions whether the quadratic behavior is actually in PostgreSQL or in pgbench itself:
- pgbench's "init connection time" measures wall-clock time from process start to all-connections-established
- Thread spawning, OS scheduling, and synchronization overhead could contribute O(n) per-thread costs
- Alexander confirms the quadratic behavior persists regardless of thread count (tested with 128 and 1024 threads)
Patch Review Issues
Matthias identifies several technical issues with the patch:
- Alignment: The
pgxactoffsarray allocation doesn't account for alignment requirements whenTotalProcs * sizeof(statusFlags)isn't a multiple ofsizeof(int) - Indirection cost: Previously
pgxactoffwas a direct offset from the PGPROC pointer; now it requires a separate pointer dereference throughProcGlobal->pgxactoffs - API design: Suggests macro definitions like
ProcGetXactOff(procno)andProcGetMyXactOff()to avoid redundant procno-from-PGPROC calculations - Code style: The
add_size/mul_sizepattern should be used for shared memory size calculations
The Fast Connection Rate Patch
An auxiliary issue surfaces: at very high connection rates (many thousands of connections on a fast multi-core server), the kernel's socket backlog can overflow, producing "Resource temporarily unavailable" errors. The 0001-Fix-fast-connection-rate-issue.patch works around this by adjusting kernel parameters and/or pgbench behavior, though this patch is not the focus of discussion.
Open Questions
- Does the quadratic behavior persist with
huge_pages=on? This is the critical missing data point - What is the regression cost of the additional indirection for
pgxactoffaccess in hot paths likeGetSnapshotData()? - Is the root cause actually in ProcArrayAdd, or elsewhere? The connection path involves fork(), shared memory attachment, catalog access, and authentication — all of which could have O(n) components
- Would a connection pooler (PgBouncer, built-in) make this moot for real workloads? 16384 direct connections is extreme for production