PoC: VALGRIND_MAKE_MEM_NOACCESS for dynamic shared memory

First seen: 2026-05-04 08:18:31+00:00 · Messages: 3 · Participants: 2

Latest Update

2026-05-06 · opus 4.7

VALGRIND_MAKE_MEM_NOACCESS for Dynamic Shared Memory: Closing a Gap in PostgreSQL's Memory Instrumentation

The Core Problem

PostgreSQL has long relied on Valgrind (and similar tools) as a safety net for catching memory errors — buffer overflows, use-after-free, uninitialized reads. This works well for private (backend-local) memory because palloc()/pfree() and the MemoryContext machinery are instrumented with VALGRIND_MAKE_MEM_* client requests. Chunks that are freed or lie in red zones between allocations are marked NOACCESS, so any stray pointer that wanders into them generates a diagnostic.

Dynamic shared memory (DSM), and specifically allocations carved out of a shm_toc (table-of-contents) segment, have no such instrumentation. That gap became painfully visible in a recent bug where a btree parallel-scan computed an incorrect shared-memory size, producing a buffer overflow that manifested only as a confusing downstream crash (fixed in commit 748d871b7c). Valgrind was silent because the corrupt writes landed inside a region it considered fully accessible.

This matters architecturally because parallel query, parallel index build, and an increasing number of subsystems use shm_toc to lay out per-operation shared state. A single sizing mistake in any *_estimate() function can silently clobber a neighboring TOC entry, and the resulting failure mode (a worker reading half-corrupted state) is precisely the kind of heisenbug that Valgrind exists to make loud.

The Proposed Solution

Tomas Vondra's PoC is intentionally minimal: in shm_toc.c, after each TOC entry is allocated, mark a trailing red-zone (the PoC uses 32 bytes, but he suggests this should be larger) as NOACCESS via the Valgrind client request API. Any overflow past the nominal end of an entry then trips Valgrind exactly as it would in palloc-land.

The reproducer from the btree bug, run against the patched code, produces a clean "invalid write of size N" report instead of a corrupted-state crash — confirming the approach works in principle.

Key Technical Tensions

1. Cross-process visibility of Valgrind markings

The critical subtlety — and the piece Tomas initially got wrong — is that Valgrind's memory state is per-process, not per-mapping. Even though the underlying DSM segment is shared by the leader and all parallel workers, each process maintains its own shadow memory describing accessibility. A NOACCESS marking applied by the leader is invisible to workers.

Andres Freund (committer, deep expertise in parallelism and low-level infrastructure) pinpointed this immediately: the fix is to apply the markings in every attaching process, most naturally inside shm_toc_attach() (or a helper invoked from there). This is a non-trivial design requirement because at attach time a worker doesn't inherently know the length of each entry — it sees only the TOC's (key, offset) pairs. Tomas acknowledges this will likely require adding a len field to each TOC entry so that the marking loop can run in the worker without recomputing sizes from subsystem-specific *_estimate() code. That's a small but real on-disk/on-wire layout change to shm_toc internals.

2. mprotect() as an alternative — and why it loses

mprotect() is attractive because its effect is shared across processes (it's a kernel-level property of the VMA). But:

So mprotect() is strictly worse for the dominant failure mode this patch targets. Valgrind markings, despite the cross-process limitation, are a better tool.

3. The BUFFERALIGN padding problem

Andres flagged a subtler correctness-of-instrumentation issue. shm_toc_allocate() already does:

nbytes = BUFFERALIGN(nbytes);  /* ALIGNOF_BUFFER = 32 */

The alignment is not cosmetic — it exists because atomic operations need wider-than-MAXALIGN alignment, and there's no formal minimum defined. The consequence is that every TOC entry already has up to 32 bytes of slack at its tail. A 32-byte red zone appended after that slack does catch gross overflows but misses small ones (1–31 bytes past the logical end), because those land inside the alignment padding, which Valgrind sees as valid.

To catch small overflows, the instrumentation would need to mark the alignment padding itself as NOACCESS, then mark only the caller-requested length as accessible. That requires tracking the requested length separately from the allocated length — another argument for storing len (the pre-alignment size) in the TOC entry.

4. Scope: DSM only, or regular shmem too?

Tomas raised whether the "main" shared memory area allocated at postmaster startup deserves the same treatment. His own intuition: bugs there tend to crash quickly because so many processes touch that memory, so the ROI is lower than for DSM. But a clean Valgrind report is always more actionable than a crash, so extending this is plausible future work.

5. The debug_parallel_query = regress escape hatch

Because the cross-process limitation means workers won't report, coverage under the buildfarm's Valgrind animals is degraded for parallel paths. However, debug_parallel_query = regress forces the leader to execute what would normally run in workers in the same process, which means the leader's Valgrind state does see the NOACCESS markings. This makes the PoC immediately useful for the regression suite even before per-process attach-time marking is implemented — a pragmatic interim story.

Implications

The patch as evolved through this short exchange points toward a modest but meaningful change in shm_toc's on-segment representation:

None of this is ABI-visible to extensions that use shm_toc via the documented API, but any code poking at raw TOC layout would need adjustment.

Who Carries Weight Here

Andres Freund's input is authoritative on two fronts: (a) he is a committer with deep ownership of the parallel infrastructure and shared memory subsystems, and (b) he speaks from direct experience ("I can tell you from experience that no, it's not necessarily quickly caught") when rebutting Tomas's guess that main-shmem bugs crash fast enough not to need instrumentation. His BUFFERALIGN observation is the kind of detail that separates instrumentation that "works on the demo" from instrumentation that actually catches the bugs people write. Tomas, also a committer, is driving the implementation and is receptive to the feedback — the thread converges quickly rather than disagreeing.