Umbra: A Remap-Aware smgr Prototype on PostgreSQL Master
Executive Summary
Umbra is a proof-of-concept alternative storage manager (smgr) implementation for PostgreSQL that decouples logical page identity from physical page placement. Its primary goal is to eliminate the ordinary checkpoint-boundary full-page-image (FPW) path by maintaining an explicit logical-to-physical block mapping (lblk → pblk), allowing WAL to record remap metadata instead of full 8kB page images for first-dirty-after-checkpoint updates.
The prototype demonstrates 2–7× WAL volume reduction and significant throughput recovery (up to 166% improvement over md + fpw=on) on TPC-C workloads at high concurrency, while preserving crash-recovery semantics.
The Technical Problem
Why Full-Page Writes Are Expensive
PostgreSQL's crash recovery depends on full-page writes (FPW): after a checkpoint, the first modification to any page logs a complete 8kB page image into WAL. This provides a known-good baseline for redo in case of torn pages. The costs compound at scale:
- WAL volume amplification: Every first-dirty page after checkpoint generates an 8kB image regardless of actual change size
- WAL write/sync pressure: Larger WAL means more I/O bandwidth consumed by WAL writes and fsyncs
- Data-file write amplification: The overwrite-in-place model means the same physical location is repeatedly written
- Checkpoint-interval repetition: These costs recur every checkpoint interval for write-heavy workloads
At 1000 concurrent TPC-C terminals, the prototype shows md + fpw=on generating 4.9–6.6× more WAL than Umbra, with throughput degrading to 275K–322K tpmC versus Umbra's 674K–729K tpmC.
The Core Insight
If logical block identity is separated from physical placement, an update after checkpoint can be written to a new physical location while the old physical page is preserved as the recovery baseline. WAL then needs only:
- The old physical block number (recovery baseline)
- The new physical block number (WAL-owned target)
- A compact delta (the actual page modification)
This eliminates the need for an inline full-page image while preserving deterministic crash recovery.
Architectural Design
Layer Decomposition
Umbra operates entirely at the smgr layer with six distinct pieces:
- Upper PostgreSQL layers: Continue using logical block numbers unchanged
- Metadata fork: Per-relation internal fork containing MAP superblock (512B) and MAP pages (fixed-entry arrays of pblkno values)
- Runtime access split:
umbra.cowns mapped-fork semantics; MAP subsystem owns lookup/state;umfile.cowns physical file I/O - WAL ownership boundary: Fork-level superblock facts, physical-page lifecycle transitions, and mapping changes are all redo-visible through WAL
- Redo replay semantics: Redo reconstructs the correct mapping view before replaying page contents
- Background maintenance:
mapwriterhandles MAP-page flush/preallocation;mapcompactorhandles reclaim/compaction
MAP Metadata Layout
The metadata fork uses a proportional group layout:
- Block 0: MAP superblock (512B packed sector with magic, version, per-fork allocator frontiers, logical EOF, CRC)
- Blocks 1+: Repeated groups of [1 FSM map page][1 VM map page][8192 MAIN map pages]
Each MAP page is a simple uint32[MAP_ENTRIES_PER_PAGE] array where each entry stores one physical block number. 0xFFFFFFFF means unmapped.
WAL/Redo Model
Two complementary WAL mechanisms:
Block-reference remap metadata (BKPBLOCK_HAS_REMAP):
- Full header:
old_pblkno,new_pblkno,logical_nblocks,next_free_pblkno - Attached to ordinary WAL block records when automatic checkpoint-boundary remap is triggered
Umbra rmgr records (RM_UMBRA_ID):
XLOG_UMBRA_MAP_SET: Single mapping establishment/switchXLOG_UMBRA_RANGE_REMAP/RANGE_REMAP_COMPACT: Batch first-born mappingsXLOG_UMBRA_SKIP_WAL_DENSE_MAP: Skip-WAL relation dense identity anchorsXLOG_UMBRA_RECLAIM_UNLINK: Physical segment deletion after compaction
Redo Cases
The most architecturally significant case is remap-without-image replay:
- Temporarily install
old_pblknoas current mapping - Read the old physical page through that mapping view
- Lock and dirty the buffer (applying WAL delta)
- Flush the buffer to ensure the old baseline is durable
- Switch mapping to
new_pblkno - Bump
next_free_pblkno
This makes delta replay deterministic without requiring a full-page image.
Foreground/Background Split
Foreground policy: Always allocate new physical pages monotonically; never reuse old pages in the hot path; never perform synchronous reclaim.
Background maintenance:
mapwriter: MAP-page flush, physical preallocation (fallocate/posix_fallocate)mapcompactor: Extent-level live-density analysis, page relocation, reclaim boundary advancement, segment unlink queueing
In-Flight Ownership
Per-MAP-buffer pending bitmaps serialize concurrent remap/relocation of the same logical block. The chosen physical block remains backend-local until WAL insertion commits the owner state. This prevents:
- Concurrent publication of multiple new mappings for the same lblk
- Foreign backends borrowing uncommitted physical targets
- Foreground/background migration conflicts
Key Design Decisions and Tradeoffs
1. Conservative Space Cleanup
Physical frontier moves monotonically forward. Immediate reuse would push allocator complexity back into foreground paths. Reclaim is background-only, constrained by live-mapping scans, reclaim boundary, and checkpoint semantics.
2. Double-Buffering
MAP metadata has its own shared-memory buffer cache separate from PostgreSQL's normal buffer pool. This keeps Umbra-specific complexity (remap, allocator, metadata lifecycle) inside Umbra rather than leaking into generic buffer ownership. The authors acknowledge this is a design tradeoff, not a claim of optimality.
3. smgr as the Integration Layer
The prototype chose smgr rather than table AM or a new storage engine abstraction. The rationale: physical placement is the storage manager's responsibility, and code above smgr should not know physical block numbers.
4. WAL-Owned First-Born Protocol
MAIN fork pages use page-WAL-owned first-born mapping (the page WAL record commits the birth claim). FSM/VM use explicit producer-path publication instead, because their growth patterns are more structured.
5. Hint-FPI Path Preserved
XLOG_FPI_FOR_HINT keeps PostgreSQL's existing hint-image rule without Umbra remap. Optimizing this would require separate checksum/torn-page protection design.
Critical Technical Observations
Verification State
The patch series has a known per-patch verification boundary:
- P1–P5: All four matrix items pass (md make check, md recovery, umbra make check, umbra recovery)
- P6: UMBRA recovery does NOT pass (intentional: P6 establishes WAL/redo state machine, P7 closes the remap loop)
- P7–P9: All four matrix items pass
Performance Context
The numbers should be read as directional PoC signals:
- At 200 terminals (checksum=off): Umbra achieves 995K tpmC vs md+fpw=on's 641K (+55%)
- WAL reduction scales with concurrency: 2× at 10 terminals → 6.5× at 1000 terminals
- Umbra does not fully reach
md + fpw=offupper bound, suggesting remaining system costs beyond FPW
Open Engineering Work
- AIO integration: adaptation exists but not complete native rewrite
CREATE DATABASE WAL_LOG: falls back toFILE_COPYwhen Umbra is enabled- Primary/standby physical-page alignment: local recovery coverage exists but not systematically closed
- Superblock shared-entry replacement: currently allocate/free, not LRU eviction
- Range-born/batch mapping publish: no generic upper-layer interface yet
Community Reception and Discussion Points
Interface Complexity Concern (Álvaro Herrera)
The most substantive technical pushback came from Álvaro Herrera, who argued that the fork system itself is the wrong abstraction and that Umbra's internal metadata fork creates "yet another layer to paper over that failed abstraction." He pointed to BRIN's revmap evacuate problem as evidence that a proper owner-defined fork facility would be more fundamental.
The author's response acknowledged the fork abstraction question but argued that:
- lblk→pblk mapping is inherently a storage-manager physical placement concern, not AM content
- Even with owner-defined forks, this metadata would likely still be smgr-owned
- The separate MAP cache follows naturally from translation metadata being different from relation page content
Submission Format Issues (Tom Lane, Bruce Momjian)
The initial submission attempts failed due to mailing list format requirements. The eventual successful format was a tar.gz archive containing all 10 patch files as a single attachment.
Commit Message Requirements (Robert Haas)
Robert Haas noted that each patch needs real commit messages explaining its purpose in detail, not just subject lines.
Implementation Scale
The patch adds approximately 22,588 lines across 132 files:
- ~2,659 lines in
umbra.c(smgr implementation) - ~2,613 lines in
umfile.c(physical file layer) - ~1,789 lines in
mapsuper.c(superblock management) - ~1,547 lines in
map.c(mapping lookup/allocation) - ~1,063 lines in
mapbgproc.c(background maintenance) - ~744 lines added to
xloginsert.c(remap-aware WAL assembly) - ~560 lines added to
xlogutils.c(redo-side remap interpretation) - 22 new TAP recovery tests covering crash consistency, 2PC, truncate, remap redo, torn-page recovery, compactor relocation, and more
AI Assistance Transparency
The author explicitly states that AI coding assistants (Codex) were used extensively for implementation work, while core architecture, boundary definitions, and state-machine reasoning come from the author's own design work. The implementation heavily depends on the prior PG12 shadow prototype and existing PostgreSQL subsystem patterns.