2026-06-04 · claude-opus-4-6

Init Connection Time Grows Quadratically: Technical Analysis

Core Problem

The thread investigates a surprising performance characteristic: PostgreSQL's initial connection time (ICT) — the time from pgbench process start until all N connections are established — grows quadratically (O(n²)) with the number of clients, rather than the expected linear O(n) scaling.

The measurements on REL_16_STABLE demonstrate this clearly:

1024 clients: ~435ms
2048 clients: ~1062ms (expected ~870ms if linear)
4096 clients: ~3284ms (expected ~1740ms if linear)
8192 clients: ~11617ms
16384 clients: ~43391ms

The curve fits y ≈ 0.0002x², indicating that each additional connection becomes progressively more expensive as total connections increase.

Architectural Context

The PGPROC Array and ProcArrayAdd

PostgreSQL maintains a shared-memory array of PGPROC structures — one per backend process. Each PGPROC is a relatively large structure containing transaction state, lock information, and various metadata. When a new backend connects, ProcArrayAdd() is called to register it in the global process array.

Critically, each PGPROC has a field pgxactoff which stores the backend's offset in the ProcArray. During ProcArrayAdd, the code must access:

allProcs[procno].pgxactoff = index;

This seemingly simple assignment becomes pathological at scale because:

PGPROC structures are large (~hundreds of bytes each), meaning they span many memory pages
Process numbers (procno) are assigned non-sequentially after initial warmup — backends that disconnect and reconnect get reused slots scattered across the array
With huge_pages=off, the standard 4KB page size means the PGPROC array spans thousands of pages

The Degradation Pattern

Maksim Melnikov's investigation reveals a crucial secondary finding: ICT degrades across repeated pgbench iterations without server restart, eventually stabilizing at worse values than the first run. This is a classic symptom of TLB thrashing and page fault amplification:

First iteration: PGPROC slots are allocated sequentially, so pgxactoff writes hit consecutive memory pages — good spatial locality
Subsequent iterations: After backends disconnect and reconnect, PGPROC slot reuse creates a random access pattern across the array
Minor page faults: Each access to a scattered allProcs[procno].pgxactoff triggers a minor page fault because the OS page table entries for those pages may have been evicted from the TLB

The perf data confirms that ProcArrayAdd is dominated by minor page fault overhead on the allProcs[procno].pgxactoff = index line.

Proposed Solution: Separate pgxactoff Array

The patch (0001-This-patch-reduce-connection-init-close-time.patch) extracts the pgxactoff field from the PGPROC structure into a separate dense shared-memory array indexed by process number:

ProcGlobal->pgxactoffs[procno] = index;  // instead of allProcs[procno].pgxactoff

Why This Helps

Cache line density: Instead of touching one int field buried inside a large PGPROC struct (causing an entire cache line load for 4 bytes of useful data), the dense array packs ~16 pgxactoff values per cache line
Reduced page footprint: For 16384 connections, the pgxactoff array is only 16384 * 4 = 64KB — fits in just 16 pages (4KB) or a single huge page (2MB), versus the PGPROC array which spans hundreds of megabytes
TLB friendliness: Fewer pages means fewer TLB entries needed, eliminating the minor fault cascade

Results

Clients	Without Patch (warmup)	With Patch (warmup)	Improvement
512	~500ms	~215ms	2.3x
1024	~1000ms	~920ms	1.1x
2048	~2240ms	~1800ms	1.2x
4096	~6140ms	~3740ms	1.6x
8192	~18840ms	~8100ms	2.3x

Crucially, the patch eliminates the degradation between iterations — performance remains stable across runs rather than worsening.

Key Technical Debate

Huge Pages Question

Matthias van de Meent raises the critical observation that this entire investigation uses huge_pages=off, and PostgreSQL is generally not optimized for small-page configurations. With huge pages (2MB pages on x86-64):

The PGPROC array for 16384 backends (~hundreds of MB) would use far fewer page table entries
TLB coverage would be vastly better
Minor page faults would be dramatically reduced

The question of whether this optimization matters in production (where huge_pages=on is recommended) remains open — no data with huge pages has been presented.

Measurement Methodology Concerns

Matthias initially questions whether the quadratic behavior is actually in PostgreSQL or in pgbench itself:

pgbench's "init connection time" measures wall-clock time from process start to all-connections-established
Thread spawning, OS scheduling, and synchronization overhead could contribute O(n) per-thread costs
Alexander confirms the quadratic behavior persists regardless of thread count (tested with 128 and 1024 threads)

Patch Review Issues

Matthias identifies several technical issues with the patch:

Alignment: The pgxactoffs array allocation doesn't account for alignment requirements when TotalProcs * sizeof(statusFlags) isn't a multiple of sizeof(int)
Indirection cost: Previously pgxactoff was a direct offset from the PGPROC pointer; now it requires a separate pointer dereference through ProcGlobal->pgxactoffs
API design: Suggests macro definitions like ProcGetXactOff(procno) and ProcGetMyXactOff() to avoid redundant procno-from-PGPROC calculations
Code style: The add_size/mul_size pattern should be used for shared memory size calculations

The Fast Connection Rate Patch

An auxiliary issue surfaces: at very high connection rates (many thousands of connections on a fast multi-core server), the kernel's socket backlog can overflow, producing "Resource temporarily unavailable" errors. The 0001-Fix-fast-connection-rate-issue.patch works around this by adjusting kernel parameters and/or pgbench behavior, though this patch is not the focus of discussion.

Open Questions

Does the quadratic behavior persist with huge_pages=on? This is the critical missing data point
What is the regression cost of the additional indirection for pgxactoff access in hot paths like GetSnapshotData()?
Is the root cause actually in ProcArrayAdd, or elsewhere? The connection path involves fork(), shared memory attachment, catalog access, and authentication — all of which could have O(n) components
Would a connection pooler (PgBouncer, built-in) make this moot for real workloads? 16384 direct connections is extreme for production

Init connection time grows quadratically

Latest Update