2026-06-04 · claude-opus-4-6

Extended Statistics Improvement: Multi-Column MCV Missing Values — May 2026 Summary

Problem Statement

PostgreSQL's extended statistics system wastes multi-column MCV information when a queried value combination is absent from the MCV list. Currently, on a "miss," the planner discards all MCV data and falls back to naive independence assumption (multiplying per-column selectivities). This defeats the purpose of extended statistics, which exist precisely to capture cross-column correlations.

The demonstrated pathology: individual column values are very common but their combination is rare/nonexistent due to anti-correlation. In the test case, this produces a ~6000x overestimate (5,909 estimated vs 0 actual rows) and a 155x performance regression (14.6ms vs 0.094ms) due to wrong plan choice.

Two Proposed Solutions

Solution 1: Cap at Last MCV Item Frequency

If a combination is absent from the MCV list, its true frequency must be ≤ the least-common item in the list (mathematical invariant, not a heuristic)
O(1) computation, no additional storage, no format changes
Test case: estimate drops from 5,909 → 117 rows, enabling bitmap index scan
Only applies to AND-clauses (conjunctions)

Solution 2: Distribute Remaining Mass Using ndistinct

Formula: P(miss) = (1 - Σ MCV_frequencies) / (ndistinct_combinations - MCV_list_size)
Mirrors PostgreSQL's single-column non-MCV estimation logic
Test case: estimate drops to ~7 rows (much closer to actual 0)
Requires ndistinct information — either from existing CREATE STATISTICS (ndistinct) or embedded directly in the MCV structure
Risk of underestimation if ndistinct estimate is inaccurate

Key Development: v1 Patch Submitted

Enrique Sánchez submitted a 4-commit patch series implementing Solution 1 (the MCV frequency cap). This moves the thread from design discussion to active code review.

Patch Details

Implements the cap for AND-clauses involving =, IN, and ANY operators
Split into 4 commits for reviewability
Proposes indirect OR-clause improvement via inclusion-exclusion: P(A OR B) = P(A) + P(B) - P(A AND B) — improving the overlap term benefits OR estimates without separate logic

Solution 2 Implementation Strategy (Planned Follow-up)

Enrique outlines two approaches and favors embedding ndistinct directly in the MCV struct:

External: Check for existing ndistinct extended statistic on same column set
Embedded (preferred): Store ndistinct within MCV structure itself, requiring a bump to mcvlist->type (serialization version)

The rationale for embedding: negligible storage cost relative to MCV list size, makes MCV self-contained, avoids requiring users to create separate ndistinct statistics.

Affected Code Paths

src/backend/statistics/mcv.c — MCV selectivity estimation, serialization format
src/backend/statistics/extended_stats.c — extended statistics infrastructure
Key function: mcv_clauselist_selectivity() — currently returns without contribution on miss
Storage impact (Solution 2 only): pg_statistic_ext_data format, MCV serialization

Status at Month End

The development strategy is sound: commit the mathematically safe upper bound (Solution 1) first as a minimal, correct fix, then layer the more accurate ndistinct-based estimation (Solution 2) as a follow-up. The v1 patch awaits community review.

2026-06-04 · claude-opus-4-6

Incremental Update: v2 Patch Submitted, Scope Narrowed Based on Review Feedback

What's New

Two new reviewers have provided substantive design feedback that has influenced the patch direction, and a v2 patch has been submitted with significantly reduced scope.

Key Technical Developments

1. Bug Identified in v1: Column Coverage Check Was Incorrect (Zsolt Parragi)

Zsolt identified a concrete bug in v1: the patch checked whether the number of clauses matched the number of MCV dimensions, but this doesn't ensure all dimensions are actually covered. A query like WHERE a = 0 AND a = ANY(ARRAY[0,9]) AND b = ANY(ARRAY[1,2,3]) has 3 clauses matching a 3-column MCV on (a, b, c), but column c is unconstrained and column a is referenced twice. The v1 patch would incorrectly apply the cap in this case, producing a significant underestimate (~5,200 vs actual ~30,000) — worse than the master estimate (~11,500).

This validates the design concern raised by chengpeng_yan: the patch must verify that all MCV dimensions have a corresponding equality clause, not just count clauses.

2. Scope Reduction: IN/ANY and OR Deferred (chengpeng_yan's influence)

chengpeng_yan argued that IN/ANY handling has semantic complexity (duplicate array elements, NULLs, empty arrays) that deserves separate treatment. Enrique agreed and removed IN/ANY from v2, narrowing to only:

Equality (=)
IS NULL
Boolean operations (= TRUE, = FALSE)

3. Expression Filtering Requirement Identified

Enrique noted that clauses containing expressions (e.g., mod(a, 7) = 0) must be excluded since they don't represent single-value lookups in the MCV. This is a new validation requirement not discussed previously.

4. Design Decision: ndistinct Embedding Deferred

chengpeng_yan explicitly argued against embedding ndistinct inside the MCV structure in the first patch, suggesting instead using matching ndistinct statistics when they already exist. Enrique accepted this position — the v2 patch uses only the cap (Solution 1), with ndistinct integration planned as a follow-up.

5. Array Deduplication Non-Goal

Enrique demonstrated that PostgreSQL already doesn't deduplicate IN arrays (e.g., a IN (2, 2) estimates 200 rows instead of 100), so the patch should be consistent with existing behavior rather than trying to fix this orthogonal issue.

v2 Patch Summary

The v2 patch covers only the minimal case: full-dimensional top-level AND equality miss with the least-MCV-frequency cap. The ndistinct-based estimation is not yet included but promised as an imminent follow-up.

Extended statistics improvement: multi-column MCV missing values

Latest Update