Eager aggregation, take 3

First seen: 2024-03-04 08:27:24+00:00 · Messages: 131 · Participants: 11

Latest Update

2026-06-01 · claude-opus-4-6

Eager Aggregation in PostgreSQL: Deep Technical Analysis

Core Problem

Eager aggregation is a query optimization technique that partially pushes a GROUP BY aggregation past a join operation, performing partial aggregation on one side of the join before the join executes, then finalizing the aggregation after all relations are joined. The key insight is that if partial aggregation dramatically reduces the row count of one join input, the join itself becomes much cheaper, potentially yielding a significantly better overall plan.

The canonical example: given SELECT G, AGG(A) FROM R1 JOIN R2 ON J GROUP BY G, if R1 has millions of rows but only a few distinct values of the join key, partially aggregating R1 before the join can reduce millions of rows to a handful, making the subsequent join orders of magnitude faster.

This technique was formally described in the 1995 VLDB paper "Eager Aggregation and Lazy Aggregation" and had two prior unsuccessful implementation attempts in PostgreSQL (2017 by Antonin Houska, 2022 reopened and closed again). Both fell into the "rebase-feedback-rebase" death spiral.

Architectural Challenges

1. Correctness of the Transformation

The central theoretical question: under what conditions can partial aggregation be commuted with a join without changing query semantics?

The proof (developed through extensive discussion between Richard Guo and Robert Haas) establishes that the transformation GROUP BY G, AGG(A) on (R1 JOIN R2) equals GROUP BY G, AGG(agg_A) on ((GROUP BY G1, AGG(A) AS agg_A on R1) JOIN R2) when:

  1. AGG is decomposable — it supports partial/finalize two-stage computation
  2. G1 includes all columns from R1 in both G and J — the grouping keys for the partial aggregation must cover both the original grouping expressions and all join-relevant columns
  3. Operator compatibility — the grouping operator must be compatible with the join operator (equality implies image equality)

The crucial invariant is the "same destiny" principle: all rows within a partial group must either all match or all fail to match any given row on the other side of the join. This is guaranteed by including all join-relevant columns in the partial grouping keys.

2. Outer Join Safety

Partial aggregation cannot be pushed to the nullable side of an outer join. If it were, NULL-extended rows produced by the outer join would not be available during partial aggregation, potentially changing grouping behavior. The patch uses Tom Lane's outer-join-aware-Var infrastructure to detect when columns are nullable by an outer join above the target relation.

A bug discovered post-commit (May 2026) revealed that anti-joins and semi-joins were not properly handled. Pushing partial aggregation past an anti-join is semantically incorrect because the aggregate of unmatched rows changes when pre-aggregation reduces duplicates that would have been eliminated by the anti-join.

3. The RelOptInfo Design Controversy

The most architecturally significant debate concerned how to represent grouped paths within the planner's data structures. Three competing approaches were discussed:

Approach A: Separate RelOptInfos per aggregation location — One RelOptInfo for PartialAgg(t1 JOIN t2) and another for t1 JOIN PartialAgg(t2). Tom Lane initially favored this because it preserves the invariant that all paths in a RelOptInfo produce the same row set. However, Richard demonstrated this leads to 2^(n-1) RelOptInfos for n relations — exponential explosion.

Approach B: GroupPathInfo annotation (like ParamPathInfo) — Annotate each path with metadata about where partial aggregation occurs. Richard explored this but concluded it was unnecessary because, unlike parameterized paths where fewer required outer rels means fewer join restrictions, the location of partial aggregation imposes no such constraints on further planning.

Approach C (adopted): Push only to lowest level — A heuristic that avoids the problem entirely by only pushing partial aggregation to the lowest applicable join level. This ensures all grouped paths for the same grouped relation produce the same row set, eliminating the need for special comparison logic or multiple RelOptInfos.

Tom Lane's key objection: "a partial-aggregation path does not generate the same data as an unaggregated path, no matter how fuzzy you are willing to be about the concept. So I'm having a very hard time accepting that it ought to be part of the same RelOptInfo." This was resolved by the final design using a separate grouped_rel field on each RelOptInfo, effectively creating companion RelOptInfos without exponential growth.

4. Cost Estimation Concerns

Robert Haas repeatedly raised the concern that PostgreSQL's aggregate cardinality estimates are the weakest part of its statistics system. After aggregation, MCVs become useless, and the distribution of output columns is entirely unknown. This means:

The pragmatic resolution: benchmark results (TPC-DS at scale 10) showed significant improvements (3-4x on q4, q11) with no notable regressions, suggesting the cost model works well enough in practice despite theoretical concerns.

5. Planning Overhead

The patch inherently multiplies planning work by exploring partial aggregation at each join level. Several mechanisms limit this:

TPC-DS query 64 (38-table join) showed the worst planning time increase: 7.2s → 9.4s. For typical queries, planning overhead is negligible.

6. Memory Blowout Risk

Aggregates like string_agg() and array_agg() accumulate unbounded transition state data. If partial aggregation creates many groups, materializing all their transition states (e.g., in a hash table) could exhaust memory.

The solution evolved through discussion: rather than hard-coding excluded functions, the patch reuses the aggtransspace catalog field. A negative value of aggtransspace indicates that the transition state can grow unboundedly. This preserves PostgreSQL's extensibility model — authors of custom aggregates can mark their functions appropriately.

Key Design Decisions

Adopted Heuristics

  1. Always push to lowest applicable level: Eliminates exponential path explosion while still finding beneficial plans
  2. Minimum group size threshold: Prunes obviously unprofitable cases early
  3. Image equality requirement: Uses BTEQUALIMAGE_PROC to ensure grouping operators don't merge rows that join operators would distinguish (prevents issues with types like NUMERIC where 0 ≠ 0.0 in byte representation)
  4. RLS safety check: Prevents pushing non-leakproof aggregates past security quals

What Was NOT Implemented

Post-Commit Issues

After the October 2025 commit:

  1. Non-deterministic collation bug (March 2026): Was checking data type's default collation rather than the expression's actual collation
  2. Volatile function bug (March 2026): Pushing aggregates with volatile functions below a join alters their execution count
  3. Anti-join/Semi-join bug (May 2026): The "same destiny" invariant is violated for anti-joins because pre-aggregation changes which rows are "unmatched"

Performance Results

TPC-DS Scale Factor 1 (with nestloop disabled due to CTE estimation issues):

The dramatic improvements in q4/q11 demonstrate that eager aggregation can be transformative for star-schema queries where dimension table joins inflate row counts before aggregation.