2026-06-04 · claude-opus-4-6

Umbra: A Remap-Aware smgr Prototype on PostgreSQL Master

Executive Summary

Umbra is a proof-of-concept alternative storage manager (smgr) implementation for PostgreSQL that decouples logical page identity from physical page placement. Its primary goal is to eliminate the ordinary checkpoint-boundary full-page-image (FPW) path by maintaining an explicit logical-to-physical block mapping (lblk → pblk), allowing WAL to record remap metadata instead of full 8kB page images for first-dirty-after-checkpoint updates.

The prototype demonstrates 2–7× WAL volume reduction and significant throughput recovery (up to 166% improvement over md + fpw=on) on TPC-C workloads at high concurrency, while preserving crash-recovery semantics.

The Technical Problem

Why Full-Page Writes Are Expensive

PostgreSQL's crash recovery depends on full-page writes (FPW): after a checkpoint, the first modification to any page logs a complete 8kB page image into WAL. This provides a known-good baseline for redo in case of torn pages. The costs compound at scale:

WAL volume amplification: Every first-dirty page after checkpoint generates an 8kB image regardless of actual change size
WAL write/sync pressure: Larger WAL means more I/O bandwidth consumed by WAL writes and fsyncs
Data-file write amplification: The overwrite-in-place model means the same physical location is repeatedly written
Checkpoint-interval repetition: These costs recur every checkpoint interval for write-heavy workloads

At 1000 concurrent TPC-C terminals, the prototype shows md + fpw=on generating 4.9–6.6× more WAL than Umbra, with throughput degrading to 275K–322K tpmC versus Umbra's 674K–729K tpmC.

The Core Insight

If logical block identity is separated from physical placement, an update after checkpoint can be written to a new physical location while the old physical page is preserved as the recovery baseline. WAL then needs only:

The old physical block number (recovery baseline)
The new physical block number (WAL-owned target)
A compact delta (the actual page modification)

This eliminates the need for an inline full-page image while preserving deterministic crash recovery.

Architectural Design

Layer Decomposition

Umbra operates entirely at the smgr layer with six distinct pieces:

Upper PostgreSQL layers: Continue using logical block numbers unchanged
Metadata fork: Per-relation internal fork containing MAP superblock (512B) and MAP pages (fixed-entry arrays of pblkno values)
Runtime access split: umbra.c owns mapped-fork semantics; MAP subsystem owns lookup/state; umfile.c owns physical file I/O
WAL ownership boundary: Fork-level superblock facts, physical-page lifecycle transitions, and mapping changes are all redo-visible through WAL
Redo replay semantics: Redo reconstructs the correct mapping view before replaying page contents
Background maintenance: mapwriter handles MAP-page flush/preallocation; mapcompactor handles reclaim/compaction

MAP Metadata Layout

The metadata fork uses a proportional group layout:

Block 0: MAP superblock (512B packed sector with magic, version, per-fork allocator frontiers, logical EOF, CRC)
Blocks 1+: Repeated groups of [1 FSM map page][1 VM map page][8192 MAIN map pages]

Each MAP page is a simple uint32[MAP_ENTRIES_PER_PAGE] array where each entry stores one physical block number. 0xFFFFFFFF means unmapped.

WAL/Redo Model

Two complementary WAL mechanisms:

Block-reference remap metadata (BKPBLOCK_HAS_REMAP):

Full header: old_pblkno, new_pblkno, logical_nblocks, next_free_pblkno
Attached to ordinary WAL block records when automatic checkpoint-boundary remap is triggered

Umbra rmgr records (RM_UMBRA_ID):

XLOG_UMBRA_MAP_SET: Single mapping establishment/switch
XLOG_UMBRA_RANGE_REMAP / RANGE_REMAP_COMPACT: Batch first-born mappings
XLOG_UMBRA_SKIP_WAL_DENSE_MAP: Skip-WAL relation dense identity anchors
XLOG_UMBRA_RECLAIM_UNLINK: Physical segment deletion after compaction

Redo Cases

The most architecturally significant case is remap-without-image replay:

Temporarily install old_pblkno as current mapping
Read the old physical page through that mapping view
Lock and dirty the buffer (applying WAL delta)
Flush the buffer to ensure the old baseline is durable
Switch mapping to new_pblkno
Bump next_free_pblkno

This makes delta replay deterministic without requiring a full-page image.

Foreground/Background Split

Foreground policy: Always allocate new physical pages monotonically; never reuse old pages in the hot path; never perform synchronous reclaim.

Background maintenance:

mapwriter: MAP-page flush, physical preallocation (fallocate/posix_fallocate)
mapcompactor: Extent-level live-density analysis, page relocation, reclaim boundary advancement, segment unlink queueing

In-Flight Ownership

Per-MAP-buffer pending bitmaps serialize concurrent remap/relocation of the same logical block. The chosen physical block remains backend-local until WAL insertion commits the owner state. This prevents:

Concurrent publication of multiple new mappings for the same lblk
Foreign backends borrowing uncommitted physical targets
Foreground/background migration conflicts

Key Design Decisions and Tradeoffs

1. Conservative Space Cleanup

Physical frontier moves monotonically forward. Immediate reuse would push allocator complexity back into foreground paths. Reclaim is background-only, constrained by live-mapping scans, reclaim boundary, and checkpoint semantics.

2. Double-Buffering

MAP metadata has its own shared-memory buffer cache separate from PostgreSQL's normal buffer pool. This keeps Umbra-specific complexity (remap, allocator, metadata lifecycle) inside Umbra rather than leaking into generic buffer ownership. The authors acknowledge this is a design tradeoff, not a claim of optimality.

3. smgr as the Integration Layer

The prototype chose smgr rather than table AM or a new storage engine abstraction. The rationale: physical placement is the storage manager's responsibility, and code above smgr should not know physical block numbers.

4. WAL-Owned First-Born Protocol

MAIN fork pages use page-WAL-owned first-born mapping (the page WAL record commits the birth claim). FSM/VM use explicit producer-path publication instead, because their growth patterns are more structured.

5. Hint-FPI Path Preserved

XLOG_FPI_FOR_HINT keeps PostgreSQL's existing hint-image rule without Umbra remap. Optimizing this would require separate checksum/torn-page protection design.

Critical Technical Observations

Verification State

The patch series has a known per-patch verification boundary:

P1–P5: All four matrix items pass (md make check, md recovery, umbra make check, umbra recovery)
P6: UMBRA recovery does NOT pass (intentional: P6 establishes WAL/redo state machine, P7 closes the remap loop)
P7–P9: All four matrix items pass

Performance Context

The numbers should be read as directional PoC signals:

At 200 terminals (checksum=off): Umbra achieves 995K tpmC vs md+fpw=on's 641K (+55%)
WAL reduction scales with concurrency: 2× at 10 terminals → 6.5× at 1000 terminals
Umbra does not fully reach md + fpw=off upper bound, suggesting remaining system costs beyond FPW

Open Engineering Work

AIO integration: adaptation exists but not complete native rewrite
CREATE DATABASE WAL_LOG: falls back to FILE_COPY when Umbra is enabled
Primary/standby physical-page alignment: local recovery coverage exists but not systematically closed
Superblock shared-entry replacement: currently allocate/free, not LRU eviction
Range-born/batch mapping publish: no generic upper-layer interface yet

Community Reception and Discussion Points

Interface Complexity Concern (Álvaro Herrera)

The most substantive technical pushback came from Álvaro Herrera, who argued that the fork system itself is the wrong abstraction and that Umbra's internal metadata fork creates "yet another layer to paper over that failed abstraction." He pointed to BRIN's revmap evacuate problem as evidence that a proper owner-defined fork facility would be more fundamental.

The author's response acknowledged the fork abstraction question but argued that:

lblk→pblk mapping is inherently a storage-manager physical placement concern, not AM content
Even with owner-defined forks, this metadata would likely still be smgr-owned
The separate MAP cache follows naturally from translation metadata being different from relation page content

Submission Format Issues (Tom Lane, Bruce Momjian)

The initial submission attempts failed due to mailing list format requirements. The eventual successful format was a tar.gz archive containing all 10 patch files as a single attachment.

Commit Message Requirements (Robert Haas)

Robert Haas noted that each patch needs real commit messages explaining its purpose in detail, not just subject lines.

Implementation Scale

The patch adds approximately 22,588 lines across 132 files:

~2,659 lines in umbra.c (smgr implementation)
~2,613 lines in umfile.c (physical file layer)
~1,789 lines in mapsuper.c (superblock management)
~1,547 lines in map.c (mapping lookup/allocation)
~1,063 lines in mapbgproc.c (background maintenance)
~744 lines added to xloginsert.c (remap-aware WAL assembly)
~560 lines added to xlogutils.c (redo-side remap interpretation)
22 new TAP recovery tests covering crash consistency, 2PC, truncate, remap redo, torn-page recovery, compactor relocation, and more

AI Assistance Transparency

The author explicitly states that AI coding assistants (Codex) were used extensively for implementation work, while core architecture, boundary definitions, and state-machine reasoning come from the author's own design work. The implementation heavily depends on the prior PG12 shadow prototype and existing PostgreSQL subsystem patterns.

[PoC] Umbra: a remap-aware smgr prototype on PostgreSQL master

Latest Update