Technical Analysis: Set calcSumX2 = true in numeric_(poly_)deserialize
Core Problem
This thread addresses a subtle but semantically important bug in PostgreSQL's aggregate deserialization functions for variance/standard deviation aggregates. The issue lies at the intersection of the aggregate state machine lifecycle and how external consumers (extensions) may interact with deserialized aggregate state.
Background: PostgreSQL's Aggregate Combine/Serialize/Deserialize Pipeline
PostgreSQL's parallel aggregation framework splits aggregate computation into phases:
- Transition (
transfn): accumulates values into internal state - Serialize (
serialfn): converts internal state tobyteafor inter-process transfer - Deserialize (
deserialfn): reconstructs internal state frombytea - Combine (
combinefn): merges two partial states
For variance-family aggregates (var_pop, var_samp, stddev_pop, stddev_samp, regr_*), the internal state tracks three quantities:
- N (count)
- sumX (sum of values)
- sumX2 (sum of squared values) — needed for variance computation
The calcSumX2 flag in the aggregate state structure controls whether do_numeric_accum() and do_int128_accum() will accumulate the squared-value sum. This flag exists because simpler aggregates (like avg) share the same state structure but don't need sumX2.
The Bug
Both numeric_poly_deserialize() and numeric_deserialize() construct their output state with calcSumX2 = false:
/* numeric_poly_deserialize */
result = makeInt128AggStateCurrentContext(false);
/* numeric_deserialize */
result = makeNumericAggStateCurrentContext(false);
This is semantically incorrect for variance/stddev aggregates. Within PostgreSQL's internal parallel aggregation pipeline, the bug is latent because:
- Deserialized state flows directly into
numeric_(poly_)combine, which does arithmetic on the sumX2 field directly without consulting thecalcSumX2flag - The combine function itself creates fresh state with
calcSumX2 = true(the correct value)
However, for external consumers — particularly extensions that deserialize aggregate state and then feed additional values through the transition function — the bug is active: do_numeric_accum() / do_int128_accum() check state->calcSumX2 before accumulating squared values, so deserialized state will silently stop tracking sumX2 while N and sumX continue updating correctly. This produces incorrect variance/stddev results.
Architectural Significance
This is an example of a broader class of issues in PostgreSQL: functions that are "correct" only within the specific call graph the core executor uses, but violate semantic contracts that extensions reasonably rely on. The serialize/deserialize API is public and documented — extensions should be able to round-trip aggregate state through external storage and resume accumulation.
The fix also relates to the principle of least surprise: if numeric_combine already sets calcSumX2 = true (acknowledging the semantic meaning), the deserialize functions should be consistent.
Proposed Fix
The patch is minimal — a two-line change:
/* numeric_poly_deserialize: change false → true */
result = makeInt128AggStateCurrentContext(true);
/* numeric_deserialize: change false → true */
result = makeNumericAggStateCurrentContext(true);
Risk Assessment
- Zero risk to in-tree behavior: The combine functions don't read the flag, so the change is invisible to the existing parallel aggregation pipeline.
- Correctness for extensions: Enables the valid use case of deserializing state and continuing to accumulate via transition functions.
- Consistency: Aligns with
numeric_combine's existing practice of settingcalcSumX2 = true.
Design Considerations
-
Why was it
falseoriginally? Likely an oversight during the original implementation of parallel aggregation (circa PostgreSQL 9.6). The developer may have usedfalseas a default without considering that the deserialized state's flag should match the aggregate's semantic requirements. -
Should the serialize format encode the flag? A more robust approach might serialize the
calcSumX2flag itself, but this would require a format change and is overkill given that the deserialize functions are already specific to variance-family aggregates (they're only registered as deserializers for those aggregates). -
Could this break anything? Setting the flag to
truein deserialized state means that if someone were to call the transition function on deserialized state, it would now attempt to accumulate sumX2. Since the deserialized state already contains a valid sumX2 value (it was serialized from a complete state), this is correct behavior.
Open Questions
- Whether the committers consider this worth backpatching (it affects extensions, not core behavior)
- Whether documentation should note the expected lifecycle of deserialized aggregate state for extension authors