Two issues leading to discrepancies in FSM data on the standby server

First seen: 2026-03-20 01:32:20+00:00 · Messages: 11 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

Two FSM Discrepancy Bugs on Standby: Lost Hint Writes and Wrong Free-Space Accounting

Architectural Context

The Free Space Map (FSM) is a non-WAL-logged auxiliary fork that tracks approximate free space in heap (and some index) pages. Because it's a hint structure — reconstructable and tolerant of inaccuracy — PostgreSQL deliberately avoids the cost of WAL-logging its updates. Instead, on the primary, FSM is updated opportunistically during DML and vacuum. On the standby, FSM updates are derived from replay of heap WAL records (e.g., XLOG_HEAP_INSERT, XLOG_HEAP2_VISIBLE) via XLogRecordPageWithFreeSpace().

This design creates an asymmetry: the primary's FSM is never WAL-logged directly, and the standby must synthesize FSM state from observing heap-record replay. The correctness assumption is that FSM can be "wrong" without causing corruption — at worst, inserters waste time probing pages that lack space. The two bugs in this thread both violate that assumption badly enough to produce user-visible pathologies after failover.

Bug 1: MarkBufferDirtyHint Silently Drops FSM Updates During Recovery (with checksums)

Mechanism

XLogRecordPageWithFreeSpace() is the standby-side entry point for propagating observed free-space changes into the FSM. Since commit 96ef3b8 it uses MarkBufferDirtyHint() rather than the original MarkBufferDirty() (as introduced in e981653). Crucially:

The reasoning is sound for heap hint bits (PD_ALL_VISIBLE, HEAP_XMIN_COMMITTED, etc.), where torn writes of a checksummed page would render the page unreadable and the hint bit information can be re-derived from CLOG/visibility. But it was applied uniformly to FSM updates — and the tradeoff is wrong there.

Why the Tradeoff Inverts for FSM

FSM pages are read with RBM_ZERO_ON_ERROR. A corrupt (torn) FSM page is simply zeroed on read and treated as "no free space known" — the FSM naturally rebuilds from subsequent updates. So the torn-page risk that motivated the hint-bit restriction does not apply to FSM. By suppressing the dirty mark, MarkBufferDirtyHint() in recovery actively loses legitimate FSM updates: between the FPI that first brings the FSM page into the standby's buffer cache and the next modification (e.g., from an XLOG_HEAP2_VISIBLE replay), the page can be evicted clean, discarding the intervening update entirely.

The Pathological Scenario

Alexey's reproducer exploits a table with fillfactor < 80:

  1. Primary inserts hit a page; RelationGetBufferForTuple()'s 80% threshold suppresses FSM updates on the primary until much later.
  2. On the standby, the first replay touching an FSM page writes a dirty FPI (so the page is persisted once).
  3. Subsequent XLOG_HEAP2_VISIBLE replays (ab7dbd681 added FSM updates here to address the 2018 thread Alvaro reported) attempt to refresh FSM entries, but MarkBufferDirtyHint() no-ops under checksums.
  4. The FSM buffer is evicted clean; the leaf block on disk retains stale (over-optimistic) free-space values.
  5. After failover/promotion, autovacuum rolls these stale leaves up into the FSM upper/root tree, and new inserters consult the now-authoritative-but-wrong FSM. Each insert probes pages lacking space, causing the observed insertion latency explosion.

The Fix

Replace MarkBufferDirtyHint() with MarkBufferDirty() in XLogRecordPageWithFreeSpace(), with a comment explaining why the usual torn-page concern does not apply (FSM uses RBM_ZERO_ON_ERROR).

The Mirroring Subtlety (Korotkov/Plageman exchange)

Alexander Korotkov initially argued that FPIs from the primary give torn-page protection for standby FSM modifications because standby FSM changes "mirror" primary changes. Melanie Plageman corrected this: FSM update conditions on primary (during-INSERT, on rejection, at >80% threshold) and on standby (during replay, when action == BLK_NEEDS_REDO && freespace < BLCKSZ/5) are not identical. The standby may update FSM entries for pages whose FSM the primary never touched, so these standby-local updates have no corresponding primary FPI to protect them. This is precisely why relying on hint-bit semantics is unsafe — and also why ZERO_ON_ERROR is the real justification for the fix, not FPI mirroring.

Bug 2: PageGetFreeSpace vs PageGetHeapFreeSpace in heap_xlog_visible

Commit ab7dbd681 (the fix for Alvaro's 2018 report) added an FSM update to heap_xlog_visible replay but called PageGetFreeSpace() instead of PageGetHeapFreeSpace(). The latter additionally caps reported free space at zero when the line-pointer array has reached MaxHeapTuplesPerPage — which can happen on pages with many HOT-pruning redirect slots. Every other caller of XLogRecordPageWithFreeSpace() uses PageGetHeapFreeSpace(); this is the lone inconsistency.

The consequence is FSM entries advertising free space on pages that cannot actually accept a new tuple because they're out of line pointers. Inserters pick the page, fail, and must update the FSM themselves — an efficiency bug, not a correctness one.

This bug is moot on master: commit a881cc9 removed heap_xlog_visible entirely, consolidating visibility-map updates into XLOG_HEAP2_PRUNE_VACUUM_CLEANUP whose handler already uses PageGetHeapFreeSpace(). The fix is therefore a back-branch-only patch (14–18) and Alexey agreed to move it to a separate thread/CF entry.

Key Design Observations

  1. Hint-bit machinery is not one-size-fits-all. The choice between MarkBufferDirty and MarkBufferDirtyHint depends not just on "is this WAL-logged" but on how corruption of the target page is handled on read. FSM's RBM_ZERO_ON_ERROR semantics make the conservative hint-bit path actively harmful.

  2. Commit archaeology matters. The switch in 96ef3b8 from MarkBufferDirty to MarkBufferDirtyHint was made without a comment, and its rationale only became clear via a README update in a follow-up commit (9df56f6). Andrey Borodin's insistence on commenting the revert is a direct response to that earlier invisibility of intent.

  3. Index FSM is different. Andrey noticed indexes don't use XLogRecordPageWithFreeSpace() at all during replay; they rely on index vacuum to rebuild FSM on the standby. This is consistent but means index FSM on freshly-promoted standbys can also be stale — a design point, not a bug.

  4. Back-patch scope. Bug 1 is user-visible (insertion slowdown after promotion) but not a data-corruption issue, so the back-patch decision rested on impact severity. Consensus (Korotkov, Plageman) landed on back-patching to all supported branches.

Patch Disposition