Extended statistics improvement: multi-column MCV missing values

First seen: 2026-05-18 16:09:03+00:00 · Messages: 2 · Participants: 2

Latest Update

2026-05-20 · claude-opus-4-6

Extended Statistics Improvement: Multi-Column MCV Missing Values

Core Problem

PostgreSQL's extended statistics system has a significant blind spot in how it handles multi-column Most Common Values (MCV) lists when a queried value combination is absent from the list. Currently, when the planner looks up a combination like (col_a=1, col_b=1, col_c=1) in a multi-column MCV list and doesn't find it, it completely discards the MCV information and falls back to multiplying individual per-column selectivities under an independence assumption. This is architecturally wasteful because the MCV list implicitly encodes an upper bound on the frequency of any missing combination.

Why This Matters Architecturally

The independence assumption is PostgreSQL's most naive estimation strategy for multi-column predicates. The entire purpose of extended statistics (introduced in PG10+) is to overcome this assumption by capturing cross-column correlations. However, the current implementation only leverages MCV data for hits — combinations that appear in the list. For misses, the system behaves as if the extended statistics don't exist at all.

In the demonstrated test case, this leads to an estimate of 5,909 rows versus an actual of 0 rows — a ~6000x overestimate. This causes the planner to choose an expensive sequential/index scan with filtering rather than a targeted bitmap index scan, resulting in a 155x performance difference (14.6ms vs 0.094ms).

The pathological scenario is specifically constructed but represents a real-world pattern: when individual column values are very common (appearing as MCVs individually) but their combination is rare or nonexistent due to anti-correlation. This is precisely the scenario extended statistics are designed to handle, yet they currently fail silently.

Proposed Solutions

Solution 1: Cap Selectivity Using Last MCV Item Frequency

Mechanism: If a value combination is absent from the multi-column MCV list, bound its selectivity estimate by the frequency of the least-common item in the MCV list (the item at index = MCV_list_size - 1).

Rationale: By construction, the MCV list contains the N most frequent combinations. Any combination not in the list must have a frequency ≤ the least frequent item in the list. This is a strict mathematical bound, not a heuristic.

Result in test case: Estimate drops from 5,909 to 117 rows (frequency 0.001167 × 100,000), which leads to a bitmap index scan plan.

Tradeoffs:

Solution 2: Mirror Single-Column Non-MCV Estimation Using ndistinct

Mechanism: Apply the same logic PostgreSQL uses for single-column values not in the MCV list: distribute the remaining probability mass (1 - Σ MCV frequencies) uniformly across the remaining distinct value combinations (ndistinct - MCV_list_size).

Formula: P(missing combination) = (1 - sum(MCV_frequencies)) / (ndistinct_combinations - MCV_list_size)

Result in test case: Estimate drops to 7 rows ((1 - 0.3946) / (8511 - 100) × 100,000 ≈ 7), which is much closer to the actual 0.

Tradeoffs:

Key Design Considerations

Interaction with Existing Statistics Infrastructure

The ndistinct extended statistics (CREATE STATISTICS (ndistinct)) already exist as a separate statistic type. Solution 2 can leverage these when both MCV and ndistinct are defined on the same column set. However, requiring users to create both statistics types is an ergonomic concern — ideally the MCV statistic should be self-contained.

Partial Column Matches

The proposal mentions needing ndistinct for "partial or entire column match." This refers to cases where a query constrains only a subset of the MCV's columns (e.g., WHERE col_a=1 AND col_b=1 when the MCV covers (col_a, col_b, col_c)). Computing the appropriate ndistinct for subsets adds complexity.

AND-Clause Limitation

As noted by the respondent, the upper-bound approach (Solution 1) only applies to conjunctive (AND) clauses. For OR-clauses or more complex Boolean expressions over the MCV columns, different reasoning would be needed.

Complexity vs. Benefit Tradeoff

The respondent raises an important architectural concern: while estimate_num_groups can compute ndistinct without extended statistics, the added code complexity of integrating this into the MCV miss path needs justification. The simpler cap (Solution 1) may provide most of the practical benefit with minimal code changes.

Implementation Implications

The relevant code paths are in src/backend/statistics/mcv.c (MCV selectivity estimation) and src/backend/statistics/extended_stats.c. The key function is likely mcv_clauselist_selectivity() which currently returns early without contribution when no matching MCV item is found. The patch would need to:

  1. Detect the "miss" case after scanning the MCV list
  2. For Solution 1: return min(independent_estimate, last_mcv_frequency)
  3. For Solution 2: compute (1 - mcv_totalfreq) / (ndistinct - mcv_nitems) using available ndistinct information

Storage changes (for embedded ndistinct) would affect pg_statistic_ext_data and the MCV serialization format in src/backend/statistics/mcv.c.