2026-05-27 · claude-opus-4-6

Support Loser Tree for K-Way Merge in PostgreSQL

Core Problem

PostgreSQL's external sort implementation uses a binary heap (priority queue) to perform k-way merging of sorted runs during tuplesort. When work_mem is insufficient to sort an entire relation in memory, the data is divided into sorted runs which must be merged. The current heap-based merge has time complexity O(n log k) where n is total tuples and k is the number of runs (tapes).

The loser tree (also called "tree of losers") is an alternative tournament-tree data structure that maintains the same O(n log k) asymptotic complexity but achieves fewer tuple comparisons per element extraction in practice. This matters because:

Tuple comparison cost is non-trivial — especially for text/collation-dependent sorts, multi-key sorts, or abbreviated keys that fall through to full comparison
External merge is often the bottleneck — when work_mem is small relative to data size, the merge phase dominates total sort time
The reported speedup of 3–13% on realistic workloads (sorting 20M rows by md5 hash text) is significant for a core operation

How the Two Data Structures Differ

Binary Heap (Current Implementation)

tuplesort_heap_replace_top() performs a sift-down after replacing the root
At each level: 2 comparisons (compare left child vs right child, then winner vs parent)
Can early-return if parent is less than both children (already valid heap)
Best case: O(1) when data has many equal keys (parent always wins)

Loser Tree (Proposed)

tuplesort_loser_tree_adjust() propagates the new element up from leaf to root
At each level: 1 comparison (compare with the stored "loser" at that internal node)
Cannot early-return — must always traverse to the root
Best case is still O(log k) comparisons, but each level costs exactly 1 comparison vs heap's 2

The Tradeoff

For high-cardinality data (many distinct values), the loser tree wins because:

Heap rarely gets early returns when keys are diverse
Loser tree does half the comparisons per level (1 vs 2)

For low-cardinality data (many duplicate keys), the heap can win because:

It frequently early-returns near the top of the tree
Loser tree must always traverse the full height

The author's analysis for small k values:

k=2: heap=1 comparison, loser tree=1 (identical)
k=3: heap=2, loser tree=[1,2] (loser tree never worse)
k=4: heap=[2,3], loser tree=2 (loser tree never worse)
k<5: loser tree is never worse than heap for any input distribution

Patch Architecture (v2)

v2-0001: O(n) Heap Build

Replaces the current O(n log n) heap construction (repeated insertions) with a bottom-up tuplesort_heap_build() that constructs a valid heap in O(n). This is a standalone improvement to beginmerge() that also simplifies the code path for loser tree initialization.

v2-0002: Loser Tree Implementation

Adds a loser tree array to the tuplesort state (currently allocated as MAXORDER * 2 entries, author notes MAXORDER entries would suffice with refactoring)
Introduces tuplesort_loser_tree_adjust() as the core operation
Controlled by enable_loser_tree GUC (default OFF in v2)
Passes check-world

Key Design Decisions and Open Questions

1. Selection Strategy

The most contentious question: how to decide which algorithm to use?

Options discussed:

Always use loser tree — simplest, but regresses on low-cardinality inputs
GUC control — enable_loser_tree proposed, but criticized as unusable by end users who don't know their data characteristics
Optimizer statistics (n_distinct) — proposed by Sami Imseih, but rejected by John Naylor based on historical precedent that cardinality stats are often wildly wrong, and underestimation (the common case) would just default to heap anyway
Heuristic based on k — for k<5, loser tree is never worse; could unconditionally use it for small k
Fast-path optimization — John Naylor's suggestion: check if current key equals previous key to get early-return benefits for low-cardinality without losing loser tree benefits for high-cardinality

2. Memory Accounting

Open question whether USEMEM() should be called for the loser array allocation, which is O(k) additional memory.

3. Module Extraction

Late suggestion (from Xuneng Zhou, May 2026) to extract the loser tree as a standalone reusable module (like binaryheap.c), referencing Goetz Graefe's 2025 ACM paper on tree-of-losers priority queues.

Technical Skepticism and Barriers

John Naylor (committer with sorting expertise) raises significant concerns:

Historical pattern: Multiple proposals have improved sorting on some inputs while regressing on low-cardinality — this is a known trap
GUC skepticism: If the patch author cannot define when to enable the feature, users certainly cannot
Need for structured testing: Comprehensive benchmarks across varied workloads, not just the favorable case
Small-k irrelevance: When k is small, the merge step is likely dominated by sorting individual runs, so the improvement matters less precisely where the loser tree is provably better

Andreas Karlsson suggests focusing on finding and quantifying regressions before proceeding, and investigating whether pre-sorted input is a regression case.

Current Status

The thread appears stalled as of early 2026. The author has not yet:

Produced comprehensive benchmarks showing regression boundaries
Implemented a heuristic that doesn't require user intervention
Addressed the fast-path suggestion for equal keys
Demonstrated that the improvement matters for the cases where k is large (where the algorithmic advantage is greatest but regressions may also be largest)

The May 2026 message from Xuneng Zhou suggests renewed interest via the Graefe paper, potentially opening a path to a more complete implementation as a standalone priority queue module.

Support loser tree for k-way merge

Latest Update