Extended Statistics Improvement: Multi-Column MCV Missing Values — May 2026 Summary
Problem Statement
PostgreSQL's extended statistics system wastes multi-column MCV information when a queried value combination is absent from the MCV list. Currently, on a "miss," the planner discards all MCV data and falls back to naive independence assumption (multiplying per-column selectivities). This defeats the purpose of extended statistics, which exist precisely to capture cross-column correlations.
The demonstrated pathology: individual column values are very common but their combination is rare/nonexistent due to anti-correlation. In the test case, this produces a ~6000x overestimate (5,909 estimated vs 0 actual rows) and a 155x performance regression (14.6ms vs 0.094ms) due to wrong plan choice.
Two Proposed Solutions
Solution 1: Cap at Last MCV Item Frequency
- If a combination is absent from the MCV list, its true frequency must be ≤ the least-common item in the list (mathematical invariant, not a heuristic)
- O(1) computation, no additional storage, no format changes
- Test case: estimate drops from 5,909 → 117 rows, enabling bitmap index scan
- Only applies to AND-clauses (conjunctions)
Solution 2: Distribute Remaining Mass Using ndistinct
- Formula:
P(miss) = (1 - Σ MCV_frequencies) / (ndistinct_combinations - MCV_list_size) - Mirrors PostgreSQL's single-column non-MCV estimation logic
- Test case: estimate drops to ~7 rows (much closer to actual 0)
- Requires ndistinct information — either from existing
CREATE STATISTICS (ndistinct)or embedded directly in the MCV structure - Risk of underestimation if ndistinct estimate is inaccurate
Key Development: v1 Patch Submitted
Enrique Sánchez submitted a 4-commit patch series implementing Solution 1 (the MCV frequency cap). This moves the thread from design discussion to active code review.
Patch Details
- Implements the cap for AND-clauses involving
=,IN, andANYoperators - Split into 4 commits for reviewability
- Proposes indirect OR-clause improvement via inclusion-exclusion:
P(A OR B) = P(A) + P(B) - P(A AND B)— improving the overlap term benefits OR estimates without separate logic
Solution 2 Implementation Strategy (Planned Follow-up)
Enrique outlines two approaches and favors embedding ndistinct directly in the MCV struct:
- External: Check for existing ndistinct extended statistic on same column set
- Embedded (preferred): Store ndistinct within MCV structure itself, requiring a bump to
mcvlist->type(serialization version)
The rationale for embedding: negligible storage cost relative to MCV list size, makes MCV self-contained, avoids requiring users to create separate ndistinct statistics.
Affected Code Paths
src/backend/statistics/mcv.c— MCV selectivity estimation, serialization formatsrc/backend/statistics/extended_stats.c— extended statistics infrastructure- Key function:
mcv_clauselist_selectivity()— currently returns without contribution on miss - Storage impact (Solution 2 only):
pg_statistic_ext_dataformat, MCV serialization
Status at Month End
The development strategy is sound: commit the mathematically safe upper bound (Solution 1) first as a minimal, correct fix, then layer the more accurate ndistinct-based estimation (Solution 2) as a follow-up. The v1 patch awaits community review.