Is there value in having optimizer stats for joins/foreignkeys?

First seen: 2025-12-01 20:10:25+00:00 · Messages: 36 · Participants: 7

Latest Update

2026-06-01 · claude-opus-4-6

Monthly Summary: Join Statistics for the PostgreSQL Planner (May 2026)

Overview

May 2026 saw the "join statistics" patch transition from conceptual design convergence to active committer-level review, with rapid iteration through patch versions v5–v7 and substantive architectural feedback from Tom Lane. The thread matured from "is this feasible?" to "what exactly should the catalog look like and what are the remaining blockers before commit?"

Key Developments

Patch Stabilization (v5–v7)

Alexandra posted three rapid revisions addressing bugs identified by Jian He (DDL interactions with join statistics) and CI platform instabilities:

Tom Lane's Architectural Review (Major Milestone)

Tom provided the first detailed committer-level design review since the thread's early exchanges:

  1. Scope endorsement: Explicitly blessed 2-way-only and index-sampling-only restrictions for v1, provided the catalog is N-way ready. This removes ambiguity about commit requirements.
  2. Catalog simplification: Proposed including the anchor relation's OID as the first element of stxjoinrels[] to eliminate special-casing in expression varno resolution.
  3. Eliminate stxkeys[]: Floated replacing the dual-track stxkeys[]/stxexprs representation with expression trees for all target columns (simple Vars as Var nodes). This would also eliminate stxkeyrefs[].
  4. Permissions gap flagged: Current same-owner requirement is unworkable; a concrete permissions model is a hard requirement before commit.
  5. Documentation requested as a design-clarification exercise.

Permissions Model Convergence

Both Tom and Tomas weighed in with concrete proposals:

New Reviewer: Lifecycle Semantics Questions (Chengpeng Yan)

Three substantive questions that sharpened previously-open issues:

  1. Index dependency contradiction: CREATE STATISTICS blocks DROP INDEX via catalog dependency, but ANALYZE re-discovers indexes at runtime. Is the index part of the persistent contract or merely a creation-time proof?
  2. Cross-table staleness: Join stats owned by the anchor relation become stale when the probed table changes. Concrete failure scenario for the ANALYZE-trigger problem.
  3. MCV frequency semantics for non-unique joins: When one anchor row produces multiple joined rows (non-FK joins), raw_sel normalization diverges. Correctness concern for non-FK joins not previously articulated.

Tomas Vondra: N-way Is Important

Tomas strengthened his position that N-way support is not merely nice-to-have, citing fact-table-joining-two-correlated-dimensions as a concrete case that 2-way stats cannot handle. He endorses enforcing index requirements at DDL time rather than silently failing during ANALYZE.

Current State

The patch is architecturally sound but has several concrete blockers before commit consideration:

The measurement results remain strong: geometric mean q-error improvement from 103× to 4.2× on the keyword/movie_keyword join, JOB median warm runtime 114s → 74s (1.54× speedup).

History (1 prior analysis)
2026-06-01 · claude-opus-4-6

Incremental Update: Catalog Unification Discussion, pg_stats_ext View Decisions, and Corey's "Optimizer View" Architecture Debate

Summary

The thread moved into two parallel discussions: (1) a concrete implementation question about how to handle the pg_stats_ext view after removing stxkeys[], and (2) a renewed architectural debate initiated by Corey Huinker advocating for an "optimizer view" approach where join statistics would be anchored to pg_class objects rather than extended statistics catalog entries. Tomas Vondra pushed back on the view approach, and Tom Lane weighed in on preserving column ordering and pg_upgrade concerns.

Key Technical Developments

1. Alexandra Confirms stxkeys[] Removal as Prerequisite Patch

Alexandra is implementing the stxkeys[] elimination (representing all target columns as Var nodes in stxexprs) as a separate prerequisite commit before the main join stats patch. This is the first concrete confirmation that Tom Lane's "radical proposal" from the prior round is being adopted, and as a standalone refactoring rather than bundled into the join stats work.

2. pg_stats_ext View: Unify vs. Preserve Backward Compatibility

Alexandra asked whether to (a) keep attnames and exprs as separate view columns (reconstructing attnames from Var nodes), or (b) unify into a single exprs column. Tomas favored unification — maintaining the split requires extra work for no benefit. Tom Lane added a nuance: the current implementation already reorders columns (expressions get shoved to the end), so any rewrite should preserve user-written column order, and tools parsing the view closely should "fail noisily rather than perhaps silently mis-associate stats with columns." Tom also flagged pg_upgrade test coverage as important given that extended stats are now transferred during upgrades.

3. Corey's "Optimizer View" Architecture (Substantial New Proposal)

Corey articulated a comprehensive alternative architecture where:

  • Join statistics would be anchored to a pg_class entry (a new relkind or a view-like object)
  • The join definition lives in the view/relation definition
  • Per-column stats go in pg_statistic (no catalog changes needed)
  • Extended stats go in pg_statistic_ext (no catalog changes needed)
  • Export/import works "for free" because existing infrastructure handles relations

He further proposed a "third way" where statistics objects themselves become pg_class objects, with stxkeys/stxexprs becoming pg_attribute rows, making per-column stat targets adjustable via ALTER STATISTICS foo ALTER COLUMN.

The key motivating argument: a view-based approach gives you per-column weighted statistics (e.g., the MCV of B.name filtered/weighted by how often B rows actually join to A) using the existing pg_statistic infrastructure, plus the ability to declare arbitrary extended statistics subsets on the join result.

4. Tomas Vondra's Rebuttal of the View Approach

Tomas pushed back on several grounds:

  • It "uses views as a workaround to store the join definition, nothing else"
  • The hard problems (sampling, query matching) are identical regardless of catalog representation
  • It's "weird to require creating a new relation just for this"
  • CREATE STATISTICS was already envisioned to cover joins

However, he acknowledged it might be "independently useful" and asked Corey to elaborate on the concrete wins. He also noted that the catalog/grammar work is the minor part of this patch — sampling and plan-time matching are what matter.

5. Corey's DB2 Precedent Correction

Corey corrected the earlier assumption that DB2 lacked extended statistics when they introduced view-based join stats. DB2's RUNSTATS supports multi-column distribution statistics (equivalent to our CREATE STATISTICS), so their choice of view-based join stats was made alongside extended stats capability, not in its absence. This strengthens the view approach's credibility as a deliberate design choice rather than a workaround.

6. Corey's Sampling Simplification Argument

Corey suggested the sampling implementation could be simplified: replace the anchor table with an EphemeralNamedRelation of the sampled rows and "just run that, letting SPI figure out which indexes can be used." This contrasts with Alexandra's current direct-index-probe approach and suggests a potentially simpler (if less performant) implementation path.

7. Vancouver Unconference Reference

Multiple participants reference discussions from a recent PGCon/Vancouver unconference on this topic, suggesting offline consensus-building that isn't fully captured in the mailing list. Corey notes "Vancouver was a whirlwind of ideas around this topic."