Randomize B-Tree page split location to avoid oscillating patterns

First seen: 2026-04-27 16:24:13+00:00 · Messages: 5 · Participants: 3

Latest Update

2026-05-07 · opus 4.7

Randomizing B-Tree Split Locations to Defeat "Waves of Misery"

The Core Problem

This thread addresses a well-documented pathology in B-Tree indexes known as "Waves of Misery" (Glombiewski, Seeger, Graefe, BTW 2019). The phenomenon arises from the interaction between two seemingly innocuous facts:

  1. B-Tree page splits (in PostgreSQL, and most B-Tree implementations) choose a deterministic split point — typically the location that best satisfies a fillfactor-derived space-balance criterion.
  2. When inserts follow approximately the same key distribution as the existing data (e.g., UUIDv4 insertions into a UUID-keyed index), page splits become temporally correlated. A page that splits at time t produces two pages that, because they received keys from the same distribution at the same rate, will both become full and split again at approximately time t + Δ. This correlation propagates, producing visible oscillations in split rate, index growth, and — critically — buffer contention and I/O.

The fillfactor merely phase-shifts the waves; it cannot dampen them, because it doesn't break the symmetry that causes sibling pages to fill at the same rate. The canonical cure discussed in the literature is to inject entropy into the split location itself, so that sibling pages end up with slightly different remaining capacities and thus desynchronize over successive generations of splits.

Why this matters architecturally: oscillating split rates translate directly into bursts of WAL generation, buffer eviction pressure, FPI writes, and lock contention on inner pages. Smooth, uncorrelated split behavior is a prerequisite for predictable tail latency under steady-state insert workloads — particularly with high-entropy keys like UUIDs, which are increasingly common.

The v1 Patch and Its Flaw

Dmitry's initial patch randomizes the chosen split location by perturbing lowsplit/highsplit within the existing split interval (roughly ±20% around the fillfactor-optimal point), working directly off the delta-sorted candidate list in nbtsplitloc.c.

Peter Geoghegan's objection is the central technical insight of the thread: the v1 approach conflates space-balance optimization with suffix-truncation optimization, and undermines the latter. PostgreSQL's nbtsplitloc.c is not merely a space balancer — since the nbtree "suffix truncation" work (commit dd299df8, PG 12), it actively searches for split points that allow the most aggressive truncation of the new high key, shortening pivot tuples in internal pages. This is critical for fan-out and cache efficiency. By picking a random offset from the fillfactor-sorted list, v1 will frequently override a split point that truncates many attributes in favor of one that truncates fewer — a clear regression.

The Correct Design: Two-Pass Selection

Peter's counter-proposal, which Dmitry ultimately accepts, is a two-pass algorithm:

  1. First pass: gather all candidate split points that are equally good under the existing criteria — same number of suffix attributes truncated, all within a space-utilization tolerance of the optimal point.
  2. Second pass: randomly pick among those equivalence-class members.

The elegance of this design is that it makes randomness orthogonal to penalty: every candidate selected has identical truncation quality and approximately identical space balance, so the random choice cannot regress either dimension. Over time, space utilization still averages to leaffillfactor%, and suffix truncation remains maximally aggressive.

Peter also notes an important asymmetry in Dmitry's follow-up idea of simply widening LEAF_SPLIT_DISTANCE:

So widening the interval is at best neutral for truncation and actively bad for space balance — it cannot substitute for the two-pass design. In the degenerate case of a single-column unique index (where suffix truncation is uniform across all split points), the two-pass scheme reduces naturally to "pick any split point in the interval uniformly at random," which is exactly what's wanted.

Andres's Orthogonal Observation: I/O Under Exclusive Lock

Andres Freund raises a largely independent but thematically related concern: during a split, nbtree may perform I/O (writing out a victim buffer, extending the relation) while holding an exclusive lock on the page being split. Under contention, this is a direct contributor to worst-case latency spikes — exactly the kind of tail-latency pathology that the split-randomization work is trying to mitigate from a different angle.

His suggested fix is a classic lock-release/reacquire pattern:

  1. Release the exclusive page lock.
  2. Acquire a fresh empty page (doing any required I/O without the hot lock held).
  3. Reacquire the page lock.
  4. Recheck whether the split is still necessary (another backend may have split it, or the page state may have changed).
  5. Perform the split against the pre-acquired page.

This is non-trivial because nbtree's split protocol is tightly coupled to buffer pins and lock coupling with the parent; the recheck step must handle the case where the page no longer needs splitting and release the pre-allocated page cleanly. But architecturally it's the right direction — splits should never do blocking I/O under a contended buffer content lock.

Performance Measurement Difficulty

A recurring theme is that the benefit of this work is not throughput but latency variance. Dmitry's experimental setup (RAM disk / unlogged tables) was sensitive enough to make the oscillation visible in split-count graphs but not sensitive enough to exhibit latency impact, because the I/O amplification that turns oscillating splits into user-visible pain was absent. Peter explicitly predicts no absolute throughput gain and directs attention to p99/p99.9 latency metrics. This is an important methodological point for anyone reviewing or extending this work: benchmark design must include a storage path where buffer eviction and WAL fsync pressure are real.

Weight of Opinions

Implications for nbtsort.c

Peter's point that nbtsort.c focuses purely on free-space balance means that bulk-built indexes (CREATE INDEX, REINDEX) are maximally susceptible to waves of misery — every leaf page starts at exactly fillfactor%, so post-build insertions hit a perfectly synchronized wavefront. Any serious deployment of this feature needs to also randomize initial fill in the sort build path, or the benefit will be delayed behind one full wave period after every reindex.