Two FSM Discrepancy Bugs on Standby: Lost Hint Writes and Wrong Free-Space Accounting
Architectural Context
The Free Space Map (FSM) is a non-WAL-logged auxiliary fork that tracks approximate free space in heap (and some index) pages. Because it's a hint structure — reconstructable and tolerant of inaccuracy — PostgreSQL deliberately avoids the cost of WAL-logging its updates. Instead, on the primary, FSM is updated opportunistically during DML and vacuum. On the standby, FSM updates are derived from replay of heap WAL records (e.g., XLOG_HEAP_INSERT, XLOG_HEAP2_VISIBLE) via XLogRecordPageWithFreeSpace().
This design creates an asymmetry: the primary's FSM is never WAL-logged directly, and the standby must synthesize FSM state from observing heap-record replay. The correctness assumption is that FSM can be "wrong" without causing corruption — at worst, inserters waste time probing pages that lack space. The two bugs in this thread both violate that assumption badly enough to produce user-visible pathologies after failover.
Bug 1: MarkBufferDirtyHint Silently Drops FSM Updates During Recovery (with checksums)
Mechanism
XLogRecordPageWithFreeSpace() is the standby-side entry point for propagating observed free-space changes into the FSM. Since commit 96ef3b8 it uses MarkBufferDirtyHint() rather than the original MarkBufferDirty() (as introduced in e981653). Crucially:
MarkBufferDirtyHint()is a no-op during recovery whendata_checksumsis enabled. The rationale, documented in the README update of commit9df56f6, is torn-page safety: hint bits changed on the standby cannot be safely flushed without a covering FPI, because no new WAL is generated on the standby and a torn write of a checksummed page is unrecoverable.
The reasoning is sound for heap hint bits (PD_ALL_VISIBLE, HEAP_XMIN_COMMITTED, etc.), where torn writes of a checksummed page would render the page unreadable and the hint bit information can be re-derived from CLOG/visibility. But it was applied uniformly to FSM updates — and the tradeoff is wrong there.
Why the Tradeoff Inverts for FSM
FSM pages are read with RBM_ZERO_ON_ERROR. A corrupt (torn) FSM page is simply zeroed on read and treated as "no free space known" — the FSM naturally rebuilds from subsequent updates. So the torn-page risk that motivated the hint-bit restriction does not apply to FSM. By suppressing the dirty mark, MarkBufferDirtyHint() in recovery actively loses legitimate FSM updates: between the FPI that first brings the FSM page into the standby's buffer cache and the next modification (e.g., from an XLOG_HEAP2_VISIBLE replay), the page can be evicted clean, discarding the intervening update entirely.
The Pathological Scenario
Alexey's reproducer exploits a table with fillfactor < 80:
- Primary inserts hit a page;
RelationGetBufferForTuple()'s 80% threshold suppresses FSM updates on the primary until much later. - On the standby, the first replay touching an FSM page writes a dirty FPI (so the page is persisted once).
- Subsequent
XLOG_HEAP2_VISIBLEreplays (ab7dbd681 added FSM updates here to address the 2018 thread Alvaro reported) attempt to refresh FSM entries, butMarkBufferDirtyHint()no-ops under checksums. - The FSM buffer is evicted clean; the leaf block on disk retains stale (over-optimistic) free-space values.
- After failover/promotion, autovacuum rolls these stale leaves up into the FSM upper/root tree, and new inserters consult the now-authoritative-but-wrong FSM. Each insert probes pages lacking space, causing the observed insertion latency explosion.
The Fix
Replace MarkBufferDirtyHint() with MarkBufferDirty() in XLogRecordPageWithFreeSpace(), with a comment explaining why the usual torn-page concern does not apply (FSM uses RBM_ZERO_ON_ERROR).
The Mirroring Subtlety (Korotkov/Plageman exchange)
Alexander Korotkov initially argued that FPIs from the primary give torn-page protection for standby FSM modifications because standby FSM changes "mirror" primary changes. Melanie Plageman corrected this: FSM update conditions on primary (during-INSERT, on rejection, at >80% threshold) and on standby (during replay, when action == BLK_NEEDS_REDO && freespace < BLCKSZ/5) are not identical. The standby may update FSM entries for pages whose FSM the primary never touched, so these standby-local updates have no corresponding primary FPI to protect them. This is precisely why relying on hint-bit semantics is unsafe — and also why ZERO_ON_ERROR is the real justification for the fix, not FPI mirroring.
Bug 2: PageGetFreeSpace vs PageGetHeapFreeSpace in heap_xlog_visible
Commit ab7dbd681 (the fix for Alvaro's 2018 report) added an FSM update to heap_xlog_visible replay but called PageGetFreeSpace() instead of PageGetHeapFreeSpace(). The latter additionally caps reported free space at zero when the line-pointer array has reached MaxHeapTuplesPerPage — which can happen on pages with many HOT-pruning redirect slots. Every other caller of XLogRecordPageWithFreeSpace() uses PageGetHeapFreeSpace(); this is the lone inconsistency.
The consequence is FSM entries advertising free space on pages that cannot actually accept a new tuple because they're out of line pointers. Inserters pick the page, fail, and must update the FSM themselves — an efficiency bug, not a correctness one.
This bug is moot on master: commit a881cc9 removed heap_xlog_visible entirely, consolidating visibility-map updates into XLOG_HEAP2_PRUNE_VACUUM_CLEANUP whose handler already uses PageGetHeapFreeSpace(). The fix is therefore a back-branch-only patch (14–18) and Alexey agreed to move it to a separate thread/CF entry.
Key Design Observations
-
Hint-bit machinery is not one-size-fits-all. The choice between
MarkBufferDirtyandMarkBufferDirtyHintdepends not just on "is this WAL-logged" but on how corruption of the target page is handled on read. FSM'sRBM_ZERO_ON_ERRORsemantics make the conservative hint-bit path actively harmful. -
Commit archaeology matters. The switch in
96ef3b8fromMarkBufferDirtytoMarkBufferDirtyHintwas made without a comment, and its rationale only became clear via a README update in a follow-up commit (9df56f6). Andrey Borodin's insistence on commenting the revert is a direct response to that earlier invisibility of intent. -
Index FSM is different. Andrey noticed indexes don't use
XLogRecordPageWithFreeSpace()at all during replay; they rely on index vacuum to rebuild FSM on the standby. This is consistent but means index FSM on freshly-promoted standbys can also be stale — a design point, not a bug. -
Back-patch scope. Bug 1 is user-visible (insertion slowdown after promotion) but not a data-corruption issue, so the back-patch decision rested on impact severity. Consensus (Korotkov, Plageman) landed on back-patching to all supported branches.
Patch Disposition
- Patch 0001 (MarkBufferDirty fix): accepted, to be committed with back-patches by Korotkov.
- Patch 0002 (PageGetHeapFreeSpace fix): split off to a separate bugs thread / CF entry, targeting 14–18 only.