Support loser tree for k-way merge

First seen: 2025-12-03 11:48:10+00:00 · Messages: 13 · Participants: 6

Latest Update

2026-05-27 · claude-opus-4-6

Support Loser Tree for K-Way Merge in PostgreSQL

Core Problem

PostgreSQL's external sort implementation uses a binary heap (priority queue) to perform k-way merging of sorted runs during tuplesort. When work_mem is insufficient to sort an entire relation in memory, the data is divided into sorted runs which must be merged. The current heap-based merge has time complexity O(n log k) where n is total tuples and k is the number of runs (tapes).

The loser tree (also called "tree of losers") is an alternative tournament-tree data structure that maintains the same O(n log k) asymptotic complexity but achieves fewer tuple comparisons per element extraction in practice. This matters because:

  1. Tuple comparison cost is non-trivial — especially for text/collation-dependent sorts, multi-key sorts, or abbreviated keys that fall through to full comparison
  2. External merge is often the bottleneck — when work_mem is small relative to data size, the merge phase dominates total sort time
  3. The reported speedup of 3–13% on realistic workloads (sorting 20M rows by md5 hash text) is significant for a core operation

How the Two Data Structures Differ

Binary Heap (Current Implementation)

Loser Tree (Proposed)

The Tradeoff

For high-cardinality data (many distinct values), the loser tree wins because:

For low-cardinality data (many duplicate keys), the heap can win because:

The author's analysis for small k values:

Patch Architecture (v2)

v2-0001: O(n) Heap Build

Replaces the current O(n log n) heap construction (repeated insertions) with a bottom-up tuplesort_heap_build() that constructs a valid heap in O(n). This is a standalone improvement to beginmerge() that also simplifies the code path for loser tree initialization.

v2-0002: Loser Tree Implementation

Key Design Decisions and Open Questions

1. Selection Strategy

The most contentious question: how to decide which algorithm to use?

Options discussed:

2. Memory Accounting

Open question whether USEMEM() should be called for the loser array allocation, which is O(k) additional memory.

3. Module Extraction

Late suggestion (from Xuneng Zhou, May 2026) to extract the loser tree as a standalone reusable module (like binaryheap.c), referencing Goetz Graefe's 2025 ACM paper on tree-of-losers priority queues.

Technical Skepticism and Barriers

John Naylor (committer with sorting expertise) raises significant concerns:

  1. Historical pattern: Multiple proposals have improved sorting on some inputs while regressing on low-cardinality — this is a known trap
  2. GUC skepticism: If the patch author cannot define when to enable the feature, users certainly cannot
  3. Need for structured testing: Comprehensive benchmarks across varied workloads, not just the favorable case
  4. Small-k irrelevance: When k is small, the merge step is likely dominated by sorting individual runs, so the improvement matters less precisely where the loser tree is provably better

Andreas Karlsson suggests focusing on finding and quantifying regressions before proceeding, and investigating whether pre-sorted input is a regression case.

Current Status

The thread appears stalled as of early 2026. The author has not yet:

The May 2026 message from Xuneng Zhou suggests renewed interest via the Graefe paper, potentially opening a path to a more complete implementation as a standalone priority queue module.