Try a presorted outer path when referenced by an ORDER BY prefix

First seen: 2026-04-03 11:35:34+00:00 · Messages: 2 · Participants: 1

Latest Update

2026-05-11 · opus 4.7

Overview

Andrei Lepikhov proposes extending PostgreSQL's join planning to opportunistically generate pre-sorted outer paths for NestLoop joins when the outer relation's columns appear as a prefix of query_pathkeys (the ORDER BY / GROUP BY / DISTINCT ordering demand). The motivation is a common "top-N with details" pattern: a driving table joined via several LEFT JOINs to property/lookup tables, then ORDER BY driver.col LIMIT k. Today the planner frequently sorts after all the joins, even though the sort key is functionally determined by the outer side alone.

The Architectural Problem

PostgreSQL's match_unsorted_outer considers an outer path's existing pathkeys when building mergejoin candidates, and considers arbitrary outer orderings only when an index or other cheap sorted path already exists. For NestLoop, the outer ordering is preserved to the join output (since NestLoop emits outer tuples in outer order, with inner matches appended), so a sort on the outer side alone is sufficient to satisfy a top-level ORDER BY whose keys come only from the outer relation.

Two facts combine to make this valuable:

  1. Pathkey preservation through NestLoop + LEFT JOIN. A left outer NestLoop preserves outer ordering in the result, so ORDER BY outer.col can be satisfied below the join. The upper Sort becomes redundant.
  2. Bounded-sort propagation under LIMIT. If a Limit knows its tuple bound, ExecSetTupleBound recursively informs the underlying Sort, which switches to a top-N heapsort (bounded in memory and CPU). Today this propagation stops at joins because there is no clear semantics for "how many outer tuples do I need to produce N join rows." For LEFT OUTER joins, however, each outer tuple produces at least one output row, so the bound can be pushed into the outer subtree safely.

The payoff is large: instead of sorting the fully materialized join output (possibly spilling to disk), you sort only the outer relation with a bounded heap of size N, then do N index/hash lookups into the inner sides.

Why an index isn't the answer

A reviewer might naturally respond "just build an index on popularity." Lepikhov preempts this:

This last case is the strongest architectural justification: without this patch, partitioned/sharded foreign-table configurations cannot exploit the top-N-with-details pattern at all.

Patch Structure

The posted sketch has three parts:

1. Executor: extending ExecSetTupleBound

The existing bound-propagation machinery stops at join nodes. The patch teaches it to recurse into the outer child of a join when that is semantically sound (outer joins, or inner joins where each outer row produces ≥1 row — though the initial patch appears focused on the OUTER JOIN case). This is described as necessary because there is "no proper hook in this subsystem," i.e., extensions cannot plug into bound propagation; the change must live in core.

2. Planner: match_unsorted_outer modification

When building NestLoop paths, if (a) the outer relation's Vars form a prefix of query_pathkeys and (b) no existing sorted path on that prefix is present, synthesize an explicit Sort atop the outer path and generate a NestLoop that inherits those pathkeys. This lets upper-level path generation (grouping, DISTINCT, final ORDER BY) see the join as already sorted and skip the redundant top-level Sort.

3. Regression test adjustments

Plan shapes change in a number of existing regression queries — used by Lepikhov as evidence that the optimization enables better MergeJoin and aggregation plans beyond the motivating top-N case.

Cost Model Concern

Lepikhov flags a honesty issue in the cost model: when costing the outer Sort, he uses the LIMIT tuple bound (top-N heapsort cost), even though at plan time the bound is only propagated through OUTER join semantics. This makes the pre-sorted path look artificially cheap versus alternatives. He dismisses it as low-risk because it concerns in-memory sorts, but this is exactly the kind of local shortcut that can cause systemic plan regressions — and the May follow-up vindicates that concern.

The May 2026 Regression: Planning-Time Blowup

Five weeks later Lepikhov reports a serious real-world regression. On a large query:

Presorted enabled Disabled
Planning time 3194.9 ms 1328.9 ms
Execution time 17.7 ms 11.8 ms
Planner memory 136930 kB 136225 kB

Planning time more than doubled, execution got worse, and memory grew. The culprit is relation_can_be_sorted_early, which the patch calls to check whether a given relation's target can plausibly produce the needed sort prefix. Its cost profile:

  1. Walks rel->reltarget->exprs.
  2. For each expression, calls find_ec_member_matching_expr, which linearly scans the equivalence class's ec_members.
  3. Falls through to find_computable_ec_member, which walks the EC again using parser-level machinery to detect whether the key is computable from the relation.

For queries with CASE expressions as sort keys spanning multiple tables, no ec_member exactly matches any reltarget entry, so every call runs the full slow path and returns false. The patch invokes this for many (join-rel, pathkey-prefix) combinations, compounding quadratically with query size.

Proposed fix: cache useful pathkeys per RelOptInfo

Lepikhov's remediation is architectural rather than micro-optimizing: compute once, at RelOptInfo creation, which query_pathkeys prefixes this relation could satisfy early — a useful_query_pathkeys cache analogous to the existing useful_pathkeys logic for indexes (build_index_pathkeys, truncate_useless_pathkeys). Then match_unsorted_outer consults the cache in O(1) instead of re-deriving it for every join pair.

This is the right shape of fix: the EC membership graph is stable across the join search, so per-relation memoization is natural. The parallel with how index pathkeys are pre-computed is strong precedent.

Key Insights and Implications

Prior Art and Relationship

The patch explicitly builds on two prior threads:

Lepikhov positions this patch as the general solution rather than a point fix for one plan shape.

Status

As of the thread's latest message, the patch is in rework: Lepikhov has acknowledged the planning-time regression, identified its cause in repeated relation_can_be_sorted_early calls, and proposed caching as the structural fix. No committer feedback appears in the provided messages; the thread is effectively a self-authored design evolution at this point.