Deep Technical Analysis: Fix Overflow and Underflow in regr_r2() (and regr_intercept())
The Core Problem
PostgreSQL's statistical regression aggregate functions operate on float8 (IEEE 754 double-precision) values internally. When input data has very large or very small magnitudes, intermediate products in the final-stage computations can overflow to infinity or underflow to zero, even when the mathematically correct result is perfectly finite and well-defined.
The specific function initially reported is regr_r2(), which computes the coefficient of determination (R²). Its formula involves:
R² = (Sxy * Sxy) / (Sxx * Syy)
where Sxy, Sxx, and Syy are sums of squares/cross-products accumulated by float8_regr_accum(). When inputs are scaled near the extremes of double-precision range (e.g., values around 1e-100), the products Sxy * Sxy and Sxx * Syy can individually underflow to zero, producing 0/0 = NaN, even though the ratio itself is finite. The thread's motivating example demonstrates perfectly correlated data returning NaN instead of 1.0.
This is architecturally significant because:
-
Consistency: The sister function
corr()already received an analogous fix in commit6498287696dfor PostgreSQL 19, using a numerically stabilized computation.regr_r2()computes essentially the same quantity (R² = corr²) but lacked the same protection. -
SQL Standard compliance:
regr_r2()is a SQL-standard aggregate. ReturningNaNfor valid, perfectly correlated input violates reasonable expectations of correctness. -
Broader pattern: The investigation revealed the same class of vulnerability in
regr_intercept(), which computes(Sy - Sx * Sxy / Sxx) / N— the three-term productSx * Sxy / Sxxis equally susceptible to intermediate overflow/underflow.
Evolution of the Solution
regr_r2(): Direct Stabilization (v1 → v2)
v1 (Chengpeng Yan): Proposed extracting the stabilized denominator computation from corr() into a shared helper function, with regr_r2() calling it as a fallback when the direct product overflows or underflows.
v2 (Dean Rasheed): Refined the approach significantly:
- Rejected the helper function — the shared code is only a few lines, and introducing a helper for two call sites adds indirection without meaningful benefit. The fix is kept local to
float8_regr_r2(). - Removed the premature
Sxy == 0optimization — the original patch added a fast-path return of0.0whenSxy == 0, but Dean correctly identified this as an optimization for an extremely rare case that adds a branch to the common code path. The general fallback code handles it fine. - Improved test reliability — the original overflow test case (
1e154 * g) only produced the expected result of1due to reducedextra_float_digitssettings. Dean substituted1e100 + g * 1e95, which triggers the overflow on HEAD but produces1reliably across platforms andextra_float_digitssettings.
The stabilization strategy mirrors corr(): when the direct computation yields zero or infinity, fall back to a formulation that avoids computing the raw products, instead working with ratios that stay in representable range.
regr_intercept(): The frexp/ldexp Technique
Tom Lane's audit of other float8_regr_accum-consuming aggregates identified regr_intercept() as vulnerable. The expression:
(Sy - Sx * Sxy / Sxx) / N
contains the sub-expression Sx * Sxy / Sxx where all three terms can be extremely large or small simultaneously.
Key design debate — evaluation order:
- Tom's initial suggestion: Parenthesize as
Sx * (Sxy / Sxx)to computeSxy / Sxxfirst, since both sums-of-products are likely similarly scaled. - Dean's counterexample: Demonstrated a case where
Sxy / Sxxoverflows (Sxx ~ 1e-319,Sxy ~ 1e-9→ ratio ~1e310) while the current unparenthesized order works fine. - Dean's alternative:
Sxy * (Sx / Sxx)— sinceSxandSxxtrack similar magnitudes (both grow with x-values), their ratio is unlikely to overflow. But acknowledged this may also have counterexamples.
Tom's log-space idea: Compute exp(log(Sx) + log(Sxy) - log(Sxx)) to avoid intermediate overflow entirely. Rejected as requiring extensive special-case handling for zero and negative values of Sx and Sxy.
Final solution — frexp/ldexp decomposition: Dean proposed the elegant approach of using C99's frexp() and ldexp() to separate mantissa and exponent:
offset = Sx * Sxy / Sxx;
if (offset == 0 || isinf(offset))
{
// Decompose into mantissa × 2^exponent
m_Sx = frexp(Sx, &n_Sx);
m_Sxy = frexp(Sxy, &n_Sxy);
m_Sxx = frexp(Sxx, &n_Sxx);
// Compute in mantissa space (all values in [0.5, 1.0))
m_offset = m_Sx * m_Sxy / m_Sxx;
// Reconstruct with combined exponent
n_offset = n_Sx + n_Sxy - n_Sxx;
offset = ldexp(m_offset, n_offset);
}
PG_RETURN_FLOAT8((Sy - offset) / N);
This is effectively manual extended-exponent arithmetic. The mantissas from frexp() are in [0.5, 1.0), so their product and quotient cannot overflow or underflow. The exponent arithmetic happens in integer space. ldexp() then reconstructs the result, which may itself overflow/underflow only if the true mathematical result is unrepresentable — not due to intermediate computation artifacts.
Key design considerations:
- Portability:
frexp()andldexp()are C99 standard, and PostgreSQL already usesldexp()inpg_prng.c. - Inf/NaN guard: Tom noted that
frexp()behavior on Inf/NaN inputs may be implementation-defined, but this only matters in the fallback path (where inputs shouldn't be Inf/NaN since they come from accumulation of finite values). - Final subtraction overflow: Dean analyzed whether
Sy - offsetcould overflow before dividing byN, concluding it cannot in practice — for non-degenerate (non-horizontal/vertical) regression lines,SxxandSyygrow quadratically relative toSxandSy, so they overflow first, and the intercept cannot exceed approximately1e169.
Versioning and Backpatch Decision
There is consensus that this is technically a bug fix but should not be back-patched:
- No prior complaints about the behavior suggest it rarely affects users in practice.
- The analogous
corr()fix (6498287696d) was applied to HEAD only and is new as of v19. - Consistency argues for shipping the
regr_r2()fix alongside thecorr()fix in v19. - Tom Lane suggested getting RMT (Release Management Team) concurrence for including it in v19, which Dean pursued.
Architectural Significance
This thread illustrates a recurring challenge in PostgreSQL's floating-point aggregate implementations: the final-stage computation (performed in float8_* finalize functions) operates on accumulated statistics that can span the full double-precision range. While the accumulation itself uses Kahan-style or Youngs-Cramer algorithms for numerical stability of summation, the ratio computations in the finalize step have historically been naive.
The frexp/ldexp technique demonstrated for regr_intercept() is particularly noteworthy as a general pattern for avoiding intermediate overflow/underflow in any product-of-ratios computation — it's simpler and more robust than log-space computation, handles sign correctly automatically (frexp preserves sign), and requires no special-casing for zero/negative values beyond what frexp already provides.