Technical Analysis: Failure in test_slru for host gokiburi (REL_16_STABLE only)
Core Problem
This thread identifies a subtle bug in the test_slru contrib module's LWLock initialization logic that manifests only under EXEC_BACKEND mode on the REL_16_STABLE branch. The failure is an assertion violation indicating that a lock the code expects to be held is either not properly initialized or not actually acquired.
The EXEC_BACKEND Context
The EXEC_BACKEND compile flag is critical to understanding this bug. On Unix systems, PostgreSQL normally uses fork() to create child processes, which inherits the parent's memory space including all initialized shared memory structures. Under EXEC_BACKEND (used on Windows and optionally for testing on Unix), child processes are created via fork()+exec(), meaning they must re-attach to shared memory and re-discover all shared structures from scratch. This means:
- LWLocks initialized in the postmaster's address space are not automatically available to child backends
- The SLRU module's lock must be properly registered in a shared-memory location that survives the exec boundary
- Any initialization that happens only in the postmaster but not in re-attached backends will silently fail under EXEC_BACKEND
The Specific Failure
The backtrace shows:
TRAP: failed Assert("LWLockHeldByMe(TestSLRULock)"), File: "test_slru.c", Line: 124
And the server log shows:
ERROR: lock <unassigned:0> is not held
The <unassigned:0> designation is the smoking gun — it indicates the LWLock tranche has not been properly initialized from the perspective of the child backend process. The lock structure exists in shared memory, but the backend doesn't recognize it as a valid, initialized lock.
Root Cause Analysis
Michael Paquier identifies the problem as the module calling LWLockInitialize() more times than necessary, which corrupts internal lock state. In the EXEC_BACKEND path, the initialization sequence differs:
- Without EXEC_BACKEND:
fork()inherits the already-initialized lock state from the postmaster — the lock is initialized once and works correctly. - With EXEC_BACKEND: The child process re-attaches to shared memory. If
LWLockInitialize()is called again (e.g., in a shmem startup hook that runs in each backend), it can reset the lock's internal state, including clearing the wait queue or resetting held-by information while other backends may be interacting with it.
The key architectural issue is the distinction between:
- One-time initialization (setting up the lock structure in shared memory at postmaster startup)
- Per-backend attachment (finding the lock in shared memory without reinitializing it)
The test_slru module appears to conflate these two operations in v16, calling LWLockInitialize() in a code path that executes in each backend under EXEC_BACKEND, rather than only in the postmaster during initial shared memory setup.
Why It Suddenly Started Failing
Michael notes the host was upgraded to the latest Debian. This likely changed:
- The libc version (affecting memory layout, alignment, or timing)
- The kernel version (affecting process scheduling)
- Possibly the compiler version (affecting struct layout or optimization)
Any of these could change the race window or memory layout enough to expose a latent bug that was previously masked by lucky timing. The fact that this is an aarch64 host (evident from the backtrace showing /lib/aarch64-linux-gnu/libc.so.6) adds another dimension — ARM's weaker memory ordering model compared to x86 can expose initialization races more readily.
Why Only REL_16_STABLE
This suggests the initialization logic was refactored or fixed in later branches (v17+), likely as part of broader SLRU or shared memory infrastructure improvements. The test_slru module in v16 may use an older initialization pattern that was subsequently corrected.
Proposed Solution
Michael's attached patch (referenced but not shown in the email) reportedly fixes the issue by ensuring LWLockInitialize() is called exactly once — during initial shared memory creation in the postmaster — and not redundantly in child backends that re-attach under EXEC_BACKEND. This is the standard pattern used by all core SLRU instances (pg_subtrans, pg_multixact, etc.) and the test module simply wasn't following it correctly.
Architectural Significance
This bug highlights a common pitfall in PostgreSQL extension development: the EXEC_BACKEND code path is rarely tested on Unix (it's primarily a Windows concern), yet extensions that use shared memory must handle it correctly. The test_slru module, being a test harness, ironically had this bug because it wasn't subjected to the same rigor as production SLRU users.
The fix pattern is well-established: use IsUnderPostmaster or check whether shared memory is being created vs. attached, and only call LWLockInitialize() in the creation path.