Support Loser Tree for K-Way Merge in PostgreSQL
Core Problem
PostgreSQL's external sort implementation uses a binary heap (priority queue) to perform k-way merging of sorted runs during tuplesort. When work_mem is insufficient to sort an entire relation in memory, the data is divided into sorted runs which must be merged. The current heap-based merge has time complexity O(n log k) where n is total tuples and k is the number of runs (tapes).
The loser tree (also called "tree of losers") is an alternative tournament-tree data structure that maintains the same O(n log k) asymptotic complexity but achieves fewer tuple comparisons per element extraction in practice. This matters because:
- Tuple comparison cost is non-trivial — especially for text/collation-dependent sorts, multi-key sorts, or abbreviated keys that fall through to full comparison
- External merge is often the bottleneck — when
work_memis small relative to data size, the merge phase dominates total sort time - The reported speedup of 3–13% on realistic workloads (sorting 20M rows by md5 hash text) is significant for a core operation
How the Two Data Structures Differ
Binary Heap (Current Implementation)
tuplesort_heap_replace_top()performs a sift-down after replacing the root- At each level: 2 comparisons (compare left child vs right child, then winner vs parent)
- Can early-return if parent is less than both children (already valid heap)
- Best case: O(1) when data has many equal keys (parent always wins)
Loser Tree (Proposed)
tuplesort_loser_tree_adjust()propagates the new element up from leaf to root- At each level: 1 comparison (compare with the stored "loser" at that internal node)
- Cannot early-return — must always traverse to the root
- Best case is still O(log k) comparisons, but each level costs exactly 1 comparison vs heap's 2
The Tradeoff
For high-cardinality data (many distinct values), the loser tree wins because:
- Heap rarely gets early returns when keys are diverse
- Loser tree does half the comparisons per level (1 vs 2)
For low-cardinality data (many duplicate keys), the heap can win because:
- It frequently early-returns near the top of the tree
- Loser tree must always traverse the full height
The author's analysis for small k values:
- k=2: heap=1 comparison, loser tree=1 (identical)
- k=3: heap=2, loser tree=[1,2] (loser tree never worse)
- k=4: heap=[2,3], loser tree=2 (loser tree never worse)
- k<5: loser tree is never worse than heap for any input distribution
Patch Architecture (v2)
v2-0001: O(n) Heap Build
Replaces the current O(n log n) heap construction (repeated insertions) with a bottom-up tuplesort_heap_build() that constructs a valid heap in O(n). This is a standalone improvement to beginmerge() that also simplifies the code path for loser tree initialization.
v2-0002: Loser Tree Implementation
- Adds a loser tree array to the tuplesort state (currently allocated as
MAXORDER * 2entries, author notesMAXORDERentries would suffice with refactoring) - Introduces
tuplesort_loser_tree_adjust()as the core operation - Controlled by
enable_loser_treeGUC (default OFF in v2) - Passes
check-world
Key Design Decisions and Open Questions
1. Selection Strategy
The most contentious question: how to decide which algorithm to use?
Options discussed:
- Always use loser tree — simplest, but regresses on low-cardinality inputs
- GUC control —
enable_loser_treeproposed, but criticized as unusable by end users who don't know their data characteristics - Optimizer statistics (n_distinct) — proposed by Sami Imseih, but rejected by John Naylor based on historical precedent that cardinality stats are often wildly wrong, and underestimation (the common case) would just default to heap anyway
- Heuristic based on k — for k<5, loser tree is never worse; could unconditionally use it for small k
- Fast-path optimization — John Naylor's suggestion: check if current key equals previous key to get early-return benefits for low-cardinality without losing loser tree benefits for high-cardinality
2. Memory Accounting
Open question whether USEMEM() should be called for the loser array allocation, which is O(k) additional memory.
3. Module Extraction
Late suggestion (from Xuneng Zhou, May 2026) to extract the loser tree as a standalone reusable module (like binaryheap.c), referencing Goetz Graefe's 2025 ACM paper on tree-of-losers priority queues.
Technical Skepticism and Barriers
John Naylor (committer with sorting expertise) raises significant concerns:
- Historical pattern: Multiple proposals have improved sorting on some inputs while regressing on low-cardinality — this is a known trap
- GUC skepticism: If the patch author cannot define when to enable the feature, users certainly cannot
- Need for structured testing: Comprehensive benchmarks across varied workloads, not just the favorable case
- Small-k irrelevance: When k is small, the merge step is likely dominated by sorting individual runs, so the improvement matters less precisely where the loser tree is provably better
Andreas Karlsson suggests focusing on finding and quantifying regressions before proceeding, and investigating whether pre-sorted input is a regression case.
Current Status
The thread appears stalled as of early 2026. The author has not yet:
- Produced comprehensive benchmarks showing regression boundaries
- Implemented a heuristic that doesn't require user intervention
- Addressed the fast-path suggestion for equal keys
- Demonstrated that the improvement matters for the cases where k is large (where the algorithmic advantage is greatest but regressions may also be largest)
The May 2026 message from Xuneng Zhou suggests renewed interest via the Graefe paper, potentially opening a path to a more complete implementation as a standalone priority queue module.