Injection Points: Switching Wait/Wakeup from Latches to Atomics
Core Problem
The injection_points testing infrastructure in PostgreSQL provides a mechanism for test code to synchronize backends at specific execution points — one backend waits at an injection point while another wakes it up. The existing implementation uses condition variables (which internally rely on latches) for this synchronization.
The fundamental limitation: condition variables and latches require a PGPROC slot. A backend only has a PGPROC slot after it has been assigned one from the proc array — which happens after authentication and initial setup. This means the injection_points wait/wakeup mechanism cannot be used in:
- Postmaster context — the postmaster never has a standard PGPROC slot or access to DSM segments in the same way backends do.
- Pre-authentication code paths — before a backend has acquired its PGPROC slot.
- Post-ProcKill scenarios — once a backend has released its PGPROC slot during shutdown (the immediate motivating case from the ProcKill thread).
This creates a significant gap in test coverage: any code that runs outside the PGPROC-attached lifecycle of a backend cannot be tested with injection point synchronization.
Proposed Solution
Michael Paquier's patch replaces the condition variable mechanism with atomic counters combined with a polling loop with exponential backoff:
Architecture Change
Before:
- Wait side: ConditionVariableWait() on inj_state->wait_point (requires latch/PGPROC)
- Wakeup side: Increment wait_counts[index] under spinlock, then ConditionVariableBroadcast()
After:
- Wait side: Polling loop checking pg_atomic_read_u32(&inj_state->wait_counts[index])
with escalating sleep (10μs → 100ms max)
- Wakeup side: pg_atomic_fetch_add_u32(&inj_state->wait_counts[index], 1)
Key Design Characteristics
-
Exponential backoff: Starting at 10μs and maxing at 100ms. This balances CPU usage (no tight spin loop) against responsiveness (fast machines see wakeups within microseconds to low milliseconds).
-
CHECK_FOR_INTERRUPTS() in the wait loop: Required for tests like the autovacuum test in
test_miscthat depend on signal processing during waits. -
Context-independence: Atomic operations work without any PGPROC, DSM, or latch infrastructure — they operate directly on shared memory words.
-
Loss of wait event visibility: Without a PGPROC, there's no way to report wait events in
pg_stat_activity. The proposed workaround is emitting a LOG entry when entering the wait state, with TAP tests polling server logs instead.
Technical Tradeoffs
Responsiveness vs. Universality
The core tradeoff is clear: latches provide instant wakeup (event-driven via kernel signaling mechanisms like SetLatch → write() to self-pipe or signalfd), while atomic polling introduces bounded latency. Robert Haas initially pushed back precisely because this goes against the general PostgreSQL design principle of replacing polling loops with event-driven wakeups.
However, this is test infrastructure only — it never runs in production. The 100ms maximum delay is acceptable for test synchronization where the alternative is having no synchronization mechanism at all.
Atomicity of Wakeup vs. Slot Identity
Andrey Borodin raised a subtle correctness concern: in the wakeup path, the slot index is determined under the spinlock, but the atomic increment happens after the lock is released. Between these two operations, the slot's identity could theoretically change (if the injection point is detached and re-attached to a different name). While Borodin acknowledges this isn't a problem in "correctly written tests," it's a defensive programming concern. Moving the atomic increment back under the lock would eliminate this window.
Additional Motivation: Corruption Testing
Andrey Borodin reveals an important secondary use case: testing postmaster death scenarios. When the postmaster is kill-9'd while a backend waits on a ConditionVariable, the LWLock release cascade can cause the checkpointer to flush dirty buffers that haven't been WAL-logged — actual data corruption. By using atomics instead, the wait doesn't hold any LWLocks, making it safe to test crash scenarios without inducing false corruption.
Backpatch Considerations
Michael proposes backpatching to v17 (when injection_points were introduced). This is aggressive for infrastructure changes but motivated by wanting future test patches that depend on this capability to also be backpatchable.