Is there value in having optimizer stats for joins/foreignkeys?

First seen: 2025-12-01 20:10:25+00:00 · Messages: 21 · Participants: 5

Latest Update

2026-05-06 · opus 4.7

Join Statistics for the PostgreSQL Planner: Closing the Cross-Table Correlation Gap

The Architectural Problem

PostgreSQL's cardinality estimator operates under an implicit independence assumption when reasoning about joins. For a query like:

SELECT * FROM movie_keyword mk JOIN keyword k ON mk.keyword_id = k.id
WHERE k.keyword = 'character-name-in-title';

the optimizer estimates the join result by composing three independent pieces of information:

  1. The selectivity of k.keyword = 'X' on keyword (from pg_statistic MCVs/histogram on keyword.keyword).
  2. The join selectivity 1/max(ndistinct(k.id), ndistinct(mk.keyword_id)) (the textbook formula, refined by get_foreign_key_join_selectivity() when an FK exists).
  3. The cardinality of movie_keyword.

The structural defect: there is no statistical object that connects the skewed frequency distribution of mk.keyword_id with particular values of k.keyword. A foreign key constraint tells the planner that every mk row matches exactly one k row, but it says nothing about the fact that the single row with keyword = 'character-name-in-title' happens to be referenced 41,840 times while other keywords are referenced once. The concrete consequence on JOB 16b is a 1230× underestimation at the innermost join, which cascades up the plan and causes the planner to chain seven nested loops where hash/merge joins would be appropriate. Alexandra Wang's measurement that simply SET enable_nestloop = off nearly halves total JOB runtime (140s → 87s) demonstrates this is a systemic, not a corner-case, issue.

This thread is about whether and how to collect cross-table statistics that capture the joint distribution of the join result, in order to repair this estimate.

Design Space Explored

1. Corey Huinker's Initial Framing: FK-Anchored pg_statistic Rows

Corey opened the thread with a pg_statistic-shaped proposal: add a starefrelid column so each entry in a pg_statistic-like structure describes the distribution of a remote column as observed through the join. In his "toy/color" example, color.color_family's MCV list would be re-weighted by how often each color is referenced from toy, and rows in color that no toy references would be filtered out entirely. He explicitly distinguished this from extended statistics: "those stats aren't really a correlation or a dependency, they're just plain old attribute stats" — viewed through the lens of the referencing table.

The open problem he identified up-front is collection: per-table independent samples cannot be composed into a correct join sample; in the worst case you must sample both sides and actually perform the join.

2. Tom Lane's Position: Declarative via CREATE STATISTICS

Tom's (brief but weighty, given his committer status and role in shaping the extended-stats syntax) intervention was decisive: automated collection on every FK is unlikely to pay off, and the CREATE STATISTICS grammar was deliberately designed to accommodate declarative user-specified joins. This effectively rules out the auto-create-on-FK direction as a blocker-free path and aligns the thread with the extended-statistics framework rather than a parallel pg_statistic-style catalog.

3. Tomas Vondra's Framework: Treat the Join as a Relation

Tomas — the principal author of extended statistics and therefore the most authoritative voice on catalog shape here — pushed a cleaner conceptual model: a join result is just a relation, so its statistics should live in pg_statistic_ext_data using the existing stats kinds (m for MCV, f for functional dependencies, d for ndistinct). The only genuinely new information is the definition of the virtual relation (which rels and which join conditions), which belongs in pg_statistic_ext.

He repeatedly pushed back on restrictions that would paint the design into a corner:

Tomas also made a subtle point about upstream uses of join statistics that are easy to miss: SELECT t1.a, t2.b, COUNT(*) ... GROUP BY 1, 2 requires an ndistinct estimate on the join result to size the aggregate, so even stats kinds that seem unhelpful for estimating the join node itself have consumers above it.

4. Andrei Lepikhov's VIEW Generalization (Rejected)

Andrei proposed generalizing further: let users declare extended statistics on arbitrary relational subtrees (a VIEW) and have the planner match query fragments against them. Tomas rejected this on two grounds. First, for every supported operator we need both (a) an efficient sampling procedure and (b) a selectivity application path; you can't simply execute an arbitrary VIEW to build stats, because if the VIEW's plan relies on bad estimates it may not even complete — the very pathology the stats are meant to repair. Second, Andrei's alternate syntax ON (t1.x, t2.y, t3.z) FROM t1, t2, t3 is ambiguous: the system would have to try every possible equijoin pairing. Tomas's counter-example — (t1.a = t2.a) AND (t1.b = t2.b) cannot be distinguished from (t1.a = t2.b) AND (t1.b = t2.a) without explicit join conditions — is dispositive. The thread converged on the full-syntax form:

CREATE STATISTICS s (mcv) ON t1.c, t2.d
  FROM t1 JOIN t2 ON (t1.a = t2.a AND t1.b = t2.b);

The Sampling Problem and Its Solution

The collection problem is where this feature has historically stalled. Independent per-table samples do not compose into a correct join sample — a small sample from each side almost never produces matching tuples, so the join result is empty or wildly unrepresentative. This is why Tomas had "procrastinated" for years on the catalog question.

The answer both Tomas and Alexandra converged on is Leis et al.'s index-based join sampling (CIDR 2017): sample tuples from one (leading) side and use an index on the other side's join key to probe and complete the join tuple. This produces a statistically correct sample of the join result with cost proportional to the sample size rather than the relation size. The implication for the feature: join stats are feasible only when appropriate indexes exist on the probe side(s). Since FK-referenced columns always have unique indexes, FK joins are the natural first target, but the mechanism is not FK-specific.

Corey raised the concrete implementation question: how does ANALYZE actually perform the probes? The two options he enumerated — SPI with an EphemeralNamedRelation ("yuck") versus direct index lookups akin to the RI trigger machinery — point at real engineering work. The RI-style approach is preferable and is what Alexandra's v1 patch eventually adopts.

A secondary question Tomas raised and left open: what triggers collection? ANALYZE is per-table, but a join stat depends on two tables. Should ANALYZE on either side rebuild it? Only the "primary"? A new ANALYZE option? Corey's answer ("just the main one. Madness lies in the alternative") is pragmatic but clearly provisional.

Evolution of the Patch: PoC → v1

Alexandra's PoC (December 2025 / January 2026) and v1 (April 2026) differ substantially, and the diff reflects exactly the review points above:

Aspect PoC v1
Stats kind New 'c' (join MCV) Reused existing 'm'
Storage New stxdjoinmcv field Existing stxdmcv field
Catalog shape stxotherrel Oid, stxjoinkeys int2vector — 2-way hardcoded stxjoinrels (array of Oids), stxkeyrefs (maps each key to its source rel), stxjoinconds (join conditions) — N-way ready
Sampling Piggybacks on per-column MCV of the primary table's join key Index-based sampling per Leis et al.
Auto-create on FK 0002 patch Dropped
Scope 2-way, FK, equality, IN 2-way, equality, IN (FK no longer required)

The PoC's MCV-piggybacking collection is worth dwelling on because it illustrates why Tomas pushed back. It reused the already-computed single-column MCV of the primary side's join key, then looked up each MCV value in the other table to fetch filter-column values. This is cheap (no real join) but only correct under a strong assumption: the primary side's MCV values for the join key are representative of the join result's distribution of filter values. That holds for FK joins where the join-key frequencies directly encode the weighting, but fails for less-dependent joins and cannot be extended to ndistinct or functional-dependency stats. v1's switch to index-based sampling decouples collection from single-table ANALYZE assumptions and is what makes the feature generalizable.

Measurement Discipline

Tomas specifically asked for estimate accuracy measurements, not just runtime, arguing that a plan could be fast by luck today but brittle under small workload shifts. Alexandra's v1 response delivers this via q-error (max(est/act, act/est)):

The unchanged cases are diagnostic: LIKE patterns, no filter on keyword, or values below the MCV threshold — i.e., exactly the cases an MCV-based join stat can't help, confirming the patch is not accidentally changing estimates it shouldn't. The "cross-DB ANALYZE noise is ~30% of queries, up to 19×" calibration is a rare but correct methodological touch: without it, q-error deltas could be dismissed as sampling noise.

Runtime: median warm 114s → 74s (1.54×), which is notably better than enable_nestloop = off (87s), because selective nested loops remain optimal when cardinalities are correctly small.

Open Design Questions at Thread's End

  1. Collection trigger and cost accounting. ANALYZE on which side(s)? How is the index-probe I/O budgeted? Alvaro Herrera's (offline) suggestion of a dedicated autovacuum option is one direction.
  2. N-way collection. v1's catalog supports it; the sampling algorithm supports it (chain index probes); but collection code is still 2-way.
  3. Expanding predicate scope. LIKE, range, and BETWEEN predicates are not covered. Corey flagged element_mcv for arrays as a future concern. These have to be tackled one operator at a time per Tomas's two-part rule (sampleable + applicable in selectivity).
  4. Interaction with eqjoinsel_inner. The orthogonal improvement of using per-table extended MCVs in the existing join estimator didn't pay off on JOB, but may pay off on other workloads. It remains a live question whether to pursue it before or after join stats land.
  5. Catalog column semantics of stxkeyrefs. Tying each stxkeys entry to its source relation via a parallel array is Alexandra's current answer to Corey's question about how to describe columns across N tables; whether the representation scales cleanly to expressions (not just Vars) is untested.

Why This Matters

The JOB benchmark is, empirically, a catalog of exactly the cardinality failures that occur in real analytic workloads: data skew, correlated attributes across dimension and fact tables, and deep join trees where early estimation errors compound. PostgreSQL's 1.6× slowdown relative to nestloop=off on this workload is not an abstract concern — it's what users hit whenever they run a real star-schema query. Prior attempts have died in the catalog-design and sampling phases; this thread's contribution is a concrete path through both, grounded in peer-reviewed sampling theory and validated by q-error measurements on a standard benchmark. The remaining work is largely scope expansion and productionization rather than open research.