2026-05-06 · opus 4.7

Wrong Results in `remove_useless_groupby_columns()` — Collation and Opfamily Mismatches

Core Problem

remove_useless_groupby_columns() is a planner optimization in src/backend/optimizer/plan/planner.c that exploits functional dependencies proven by unique indexes: if a base relation has a NOT NULL unique index on columns (c1, ..., ck) and all those columns appear in the GROUP BY list, then any other GROUP BY columns from the same relation are functionally dependent on the unique key and can be dropped. Removing columns from GROUP BY shrinks the sort/hash key and can enable additional plan shapes.

The bug Richard Guo identifies is that the matching between index key columns and GROUP BY columns is done by attno alone. This ignores two crucial semantic dimensions of equality in PostgreSQL:

Collation: For collatable types (text/varchar/etc.), "uniqueness" is collation-dependent when the collation is nondeterministic. Under ICU's und-u-ks-level2 (case-insensitive), 'foo' and 'FOO' are equal; under "C" they are distinct. A unique index built under "C" proves no uniqueness for the column's own collation ci.
Operator family (opfamily): The equality semantics of a unique index are those of its index opclass's opfamily. GROUP BY, in contrast, uses the SortGroupClause.eqop, which is normally the type's default equality (via the default btree/hash opfamily). When these differ — the canonical example being record_image_ops (bytewise) vs. the default record_ops (logical per-field equality) — the index's notion of distinctness is strictly finer than GROUP BY's, so rows the index considers distinct may collapse into one group.

Why this produces wrong results

Both reproducers exploit the same pattern:

Two rows are distinct under the unique index's equality, so the index is legal.
The two rows are equal under GROUP BY's equality, so they form a single group.
The planner, trusting the index, declares the extra GROUP BY column b functionally determined by a and drops it.
At execution, the two rows merge into one group, but b — which took different values in those rows — is no longer aggregated/tracked, so one row's b value is silently lost.

This is a correctness bug, not merely a missed optimization: the query returns fewer rows than required by the grouping semantics.

Architectural Significance

This is part of a broader class of bugs Richard has been mining (he opens with "I seem to have fallen into a rabbit-hole around equality-relation issues"), where planner code written in an era of deterministic collations and implicit single-opfamily-per-type assumptions has not been systematically audited for:

Nondeterministic collations (added in PG12)
Multiple opfamilies per type (especially record_image_ops, array equality variants, JSON equality etc.)

Any planner proof based on "column X uniquely determines column Y" must be checked against the specific equality operator that the consuming clause (GROUP BY, DISTINCT, join, semijoin) will use. Unique indexes, EquivalenceClasses, and functional-dependency reasoning all need opfamily + collation qualifiers.

Similar bugs have been found in:

innerrel_is_unique() / unique-join detection
DISTINCT removal
Self-join elimination
EquivalenceClass construction with nondeterministic collations

Proposed Fix

Richard's patch (attached but not quoted in the email) conceptually must extend the matching loop inside remove_useless_groupby_columns() so that an index column i matches a GROUP BY item only when:

index->indexkeys[i] == var->varattno (existing check), and
The index column's collation equals the Var's collation (or the column is non-collatable), and
The index column's opfamily contains the SortGroupClause.eqop — i.e., the GROUP BY equality operator is a member of the same opfamily that the unique index relies upon for its uniqueness guarantee.

Condition (3) is the standard PostgreSQL idiom for "these two equality notions agree," used in many places (e.g., op_in_opfamily, get_mergejoin_opfamilies). Condition (2) mirrors what was done elsewhere for nondeterministic collations: compare indcollation[i] against the Var's varcollid.

Key Tradeoffs and Design Considerations

Conservative vs. precise: The simplest fix is to bail out whenever collation/opfamily doesn't match, losing the optimization for those cases. A more aggressive fix could still use the index when the GROUP BY equality is known to be a refinement of the index equality — but in general that is not safe (and for the wrong-results case here, the relationship is the opposite: index equality is finer than GROUP BY equality).
Backpatching: Because this is a wrong-results bug (data loss in query output), it is a backpatch candidate to all supported branches, subject to whether the fix is minimal enough. The nondeterministic-collation aspect only applies from PG12; the opfamily aspect (with record_image_ops) is older.
Test coverage: The two reproducers Richard provides are excellent regression test material and target the two independent failure axes.

Technical Insights

The SortGroupClause.eqop field is the authoritative source of GROUP BY equality semantics and must drive any functional-dependency reasoning — attno alone is never sufficient in a system with pluggable opfamilies.
record_image_ops is a recurring trap because it is the canonical example of a non-default opfamily on a commonly used type, and its bytewise equality is strictly finer than logical record equality (numeric values with different display scales being the classic trigger).
Nondeterministic collations break a long-standing planner assumption that "same attno ⇒ same equality." Each site making that assumption must be audited.

Wrong results in remove_useless_groupby_columns()

Latest Update

Wrong Results in `remove_useless_groupby_columns()` — Collation and Opfamily Mismatches

Core Problem

Why this produces wrong results

Architectural Significance

Proposed Fix

Key Tradeoffs and Design Considerations

Technical Insights

Wrong results in remove_useless_groupby_columns()

Latest Update

Wrong Results in remove_useless_groupby_columns() — Collation and Opfamily Mismatches

Core Problem

Why this produces wrong results

Architectural Significance

Proposed Fix

Key Tradeoffs and Design Considerations

Technical Insights

Wrong Results in `remove_useless_groupby_columns()` — Collation and Opfamily Mismatches