[GSoC 2026] - B-tree Index Bloat Reduction - Introduction

2026-05-06 · opus 4.7

Thread Overview

This is a GSoC 2026 introductory email from a student contributor (Salma El-Sayed) announcing her project to tackle B-tree index bloat reduction via page merging. No community replies are present in the thread as captured — this is effectively a one-way announcement of scope and intent. The technical substance therefore lies in the proposal and what it implies for the nbtree code path, not in any dialogue.

The Core Technical Problem

PostgreSQL's B-tree (nbtree) implementation has an asymmetric space-management story:

Page splits happen eagerly: when a leaf overflows on insert, _bt_split in nbtinsert.c allocates a new page and redistributes tuples.
Page deletion happens only in the narrow case handled by _bt_pagedel / _bt_unlink_page during VACUUM, and only when a page becomes completely empty. Half-dead pages and the two-stage deletion protocol (introduced to handle concurrent scans safely) are documented in nbtree/README.

There is no mechanism to merge two underfull sibling pages into one. Consequently, workloads dominated by non-HOT UPDATE and DELETE produce leaf pages that each retain one or a handful of live tuples but remain linked in the tree indefinitely. In production this manifests as indexes with 90%+ bloat where VACUUM (even aggressively) cannot reclaim space, and the only remediation is REINDEX CONCURRENTLY — expensive and operationally disruptive.

This is a well-known architectural gap. Classic B-link-tree literature (Lehman & Yao, and the Lanin-Shasha refinements PostgreSQL's nbtree is loosely based on) describes merge/rebalance operations, but PostgreSQL has historically skipped them because the concurrency proofs are significantly harder than for split/delete, and because the existing approach (full-page deletion + eventual reuse via the FSM) was deemed "good enough" for append-mostly workloads.

Why Merging Is Hard in nbtree Specifically

The proposal glosses over — but correctly identifies — the hard subsystems that must be touched:

Concurrency with the B-link protocol. nbtree relies on right-links (BTPageOpaque->btpo_next) so that readers descending the tree can always recover if a page has split underneath them. A merge operation moves tuples leftward and then unlinks the right page. Concurrent scanners holding a pin on the right sibling must not see corrupt state or miss tuples. The existing half-dead / unlink protocol in _bt_mark_page_halfdead and _bt_unlink_page solves this for empty pages by guaranteeing there are no live tuples to miss; merging violates that precondition.
Parent downlink maintenance. Deleting a page removes one downlink from the parent; merging requires deleting one downlink and possibly updating the high-key (separator) of the surviving page. If the parent is itself underfull after the downlink removal, merging may need to cascade upward — a fundamentally different invariant than the current code maintains.
WAL and crash recovery. A new WAL record type (likely XLOG_BTREE_MERGE) is needed. It must be idempotent on redo and must correctly handle partial merges if a crash occurs between the leaf-level merge and the parent-level downlink update. The existing btree_xlog_unlink_page logic is a useful template but is not sufficient.
Reverse scans. nbtree supports backward index scans via btpo_prev. Merging must update prev-links on the right-right neighbor atomically (under WAL) so that backward traversal does not land on a deleted page.
Unique-index and deduplication interactions. Deduplicated posting-list tuples (Peter Geoghegan's work, PG 13) complicate tuple counting: a page may look "sparse" in tuple count but be near-full in bytes, or vice versa. The merge decision heuristic must use byte-level fill factor, not tuple counts.
VACUUM integration. The natural home for merging is btvacuumscan / btvacuumpage, which already walks pages in physical order and holds the cleanup lock. The student's framing ("VACUUM can only delete pages that are completely empty") correctly identifies the integration point.

Proposed Approach — Assessment

The proposal states: "identify adjacent sparse pages that share the same parent, consolidate their tuples into a single page, and update the parent node's downlinks accordingly."

This is the textbook approach, and the "same parent" restriction is a pragmatic simplification that avoids cross-subtree rebalancing. However, several design decisions are left open:

Trigger policy: threshold-based (e.g., both pages <40% full) vs. opportunistic during VACUUM vs. a new maintenance command. This has large performance implications — aggressive merging can thrash with subsequent inserts that re-split.
Locking order: left-then-right is the nbtree convention; the merge must respect this to avoid deadlock with concurrent splits (which also lock left-to-right).
Rightmost page handling: the rightmost leaf has special status (P_RIGHTMOST) and its high key semantics differ.
Interaction with _bt_getbuf / FSM: the freed page must go through the same recycling protocol (with xid horizon) that _bt_unlink_page uses, otherwise concurrent scanners can follow a stale right-link into a reused page.

The student's tooling branch (visualizing nbtree before/after merge) is a sensible starting point — being able to inspect page-level state is essential for debugging concurrency bugs in this code.

Strategic Assessment

The scope described ("finalize analysis, identify impacted areas: WAL, reverse/forward scans, RECOVERY, testing") is realistic for a GSoC timeline in that analysis and a prototype are achievable. The student's own acknowledgment that it "might not get committed during GSoC2026" is calibrated correctly — page merging in nbtree has been discussed on -hackers periodically for years without landing, and a committer (most likely Peter Geoghegan, the de facto nbtree maintainer) would need to be deeply involved for this to ever reach master. Notably, Peter Geoghegan is not listed among the mentors (Kirk Wolak, Andreas, Pavlo Golub, Nikolay Samokhvalov, Andrei Lepikhov), which means the patch will need to win over the actual nbtree gatekeeper separately.

The mentor list is interesting: Pavlo Golub and Nikolay Samokhvalov are associated with the Postgres.ai / performance-tuning community where index bloat is a front-and-center operational pain point, which explains the project's motivation. Andrei Lepikhov has committer-adjacent experience on planner/executor work. None of the named mentors are primary nbtree hackers, which is a structural risk for the project.

Outlook

For this effort to succeed beyond GSoC, the design document will need to engage with prior -hackers discussions on B-tree rebalancing and with Peter Geoghegan's published reasoning for why deduplication + bottom-up index deletion (PG 14) were preferred over merging as bloat mitigations. Those features already absorb much of the motivating workload; the incremental case for merging must be made on top of them, not as if they didn't exist.