GSoC 2026 Proposal: B-tree Index Bloat Reduction via Sibling Page Merging
Context and Motivation
PostgreSQL's B-tree (nbtree) implementation is well known for being asymmetric with respect to space reclamation: splits are eager and cheap, but merges effectively do not exist. When tuples are deleted (or HOT-pruned out of index relevance via VACUUM's index-vacuum phase), a B-tree page can become nearly empty, yet the page is never combined with a sparsely populated sibling. The only way such a page can be reclaimed is if it becomes completely empty, at which point _bt_pagedel() marks it half-dead and ultimately unlinks it from the tree (see nbtpage.c). This produces the classic "index bloat" pathology where a B-tree that once held, say, 80% full pages degrades to 10–20% full pages after a workload of random deletes, and REINDEX (or REINDEX CONCURRENTLY) is the only remedy.
Prior discussion on -hackers (e.g., Peter Geoghegan's long-running work on bottom-up index deletion in PG14, and earlier threads on "B-tree page merging" by Simon Riggs and others) has repeatedly identified sibling merging as desirable but very hard to make concurrency-safe. The hard part is not the data movement — it is reconciling an in-progress merge with the existing scan concurrency protocol, which relies on right-links (the Lehman & Yao / Lanin & Shasha invariants as adapted by PostgreSQL) and on the fact that pages only ever disappear after becoming empty and half-dead.
This proposal by Salma El-Sayed is a GSoC 2026 project that tries to tackle exactly that: introduce a real sibling-merge operation with a concurrency scheme that keeps forward and backward scans correct.
Proposed Design
Merge mechanics
Two sibling leaf pages L (left) and R (right) are merge candidates when both are sparsely populated (e.g., <10% full) and size(L) + size(R) ≤ fillfactor. The proposed algorithm:
- Copy L's tuples into R.
- Mark R with a new flag
BTP_MERGED(R now contains L's contents in addition to its own). - Mark L with a new flag
BTP_MERGED_AWAY(logically gone, physically retained). - Unlink L from its parent immediately; defer fixing L's left sibling's right-link to VACUUM.
- Once no snapshot predating the merge can still reference L, VACUUM transitions L to
HALF_DEADand reclaims it through the existing page-deletion path.
This is essentially a deferred-reclamation scheme analogous to how _bt_pagedel already works, but generalized from "empty page" to "page whose contents now live in its right sibling."
Concurrent scan correctness
The correctness argument hinges on two new flag states observed by an in-flight scan:
- Forward scan on L after the merge: sees
BTP_MERGED_AWAY, follows the right-link to R. If the scan had already consumed some of L before the merge, it uses its saved high-key from L to binary-search into R and skip already-returned tuples. This is a reasonable extension of the existing "recheck via high key on page-split" logic in_bt_readpage/_bt_steppage. - Backward scan on R after the merge: sees
BTP_MERGED, reads R (which now includes L's data), and then must not also read L. The proposal adds per-scan state: if the scan has ever observedBTP_MERGED, it knows any subsequentBTP_MERGED_AWAYpage to its left is already subsumed and should be skipped. If it has never observedBTP_MERGED, then arriving at aBTP_MERGED_AWAYpage means the scan crossed R before the merge and still owes the user L's tuples, so it must read L's retained copy.
This is the most novel (and most fragile) part of the proposal. It essentially turns the scan into a small state machine that reasons about "did I cross the merge boundary before or after it happened?" The correctness of this depends critically on L's page image remaining unchanged between BTP_MERGED_AWAY being set and the page actually being reclaimed — otherwise the late-arriving backward scan would read inconsistent data.
Triggering policy
Two options are floated:
- (a) Full-index scan finding all merge candidates (aggressive, expensive on huge indexes).
- (b) Bounded scan via a user-callable function, e.g.
bt_merge_pages(index regclass, min_fillfactor float8, max_fillfactor float8, max_pages int4).
Option (b) mirrors the philosophy of pg_repack and pgstattuple — give the DBA a scalpel rather than bundling everything into autovacuum. This is a pragmatic choice for a GSoC-scoped deliverable and avoids the very contentious question of autovacuum integration.
Flag allocation
Two new bits in BTPageOpaqueData.btpo_flags are requested: BTP_MERGED and BTP_MERGED_AWAY. Currently btpo_flags uses BTP_LEAF, BTP_ROOT, BTP_DELETED, BTP_META, BTP_HALF_DEAD, BTP_SPLIT_END, BTP_HAS_GARBAGE, BTP_INCOMPLETE_SPLIT, BTP_HAS_FULLXID — nine bits out of 16, so bits are physically available, but every new flag bit is scrutinized heavily because btpo_flags is part of the on-disk format.
Architectural Concerns the Proposal Does Not Yet Address
An experienced nbtree reviewer (Peter Geoghegan is the de facto gatekeeper here) would likely raise:
- WAL logging and crash recovery. The proposal is silent on how the merge is journaled. It must be a single atomic WAL record covering L, R, and the parent (similar to
xl_btree_unlink_page), or the invariants break on replay. Standby conflict handling (ResolveRecoveryConflictWithSnapshot) also needs to know aboutBTP_MERGED_AWAY. - Parent-level concurrency. Unlinking L from the parent immediately (step 3) means the parent is modified while L is still readable. Any concurrent descent that landed in the parent before the merge but is about to descend to L must still reach L correctly — today this is guaranteed because pages are never removed from the parent until they are empty and half-dead. Changing that ordering is the crux of why sibling merging has historically been deemed hard.
- Interaction with
_bt_moveright. The right-link chasing protocol assumes the right-link always points to a live page or a page being deleted.BTP_MERGED_AWAYneeds to be a valid link target with semantics equivalent to "keep going right." - Unique index / posting-list considerations. Deduplicated posting tuples (PG13+) and unique-index insertion's cross-page visibility checks both assume specific page stability properties.
- VACUUM's reliance on recentGlobalXmin vs. snapshot horizons. The "once all transactions active at merge time have committed" criterion must be expressed in terms of
GlobalVisHorizonsemantics; the proposal uses informal wording. - Why not defer to
REINDEX CONCURRENTLY? The strongest pushback any such project faces is "we already have a tool for this." The proposal would benefit from data showing merge-based reclamation is meaningfully cheaper than periodic reindexing in realistic workloads.
Status of the Thread
As of this snapshot the thread contains only the student's own design-and-questions message, with no community replies yet. The substantive review — which will determine whether this design is workable or needs a fundamental rethink — has not happened. The linked introduction email suggests a mentor is engaged, but on-list feedback from nbtree experts (Peter Geoghegan, Heikki Linnakangas, Robert Haas) is still pending.