[PATCH] Fix overflow and underflow in regr_r2()

First seen: 2026-04-27 11:18:38+00:00 · Messages: 10 · Participants: 3

Latest Update

2026-05-18 · claude-opus-4-6

Deep Technical Analysis: Fix Overflow and Underflow in regr_r2() (and regr_intercept())

The Core Problem

PostgreSQL's statistical regression aggregate functions operate on float8 (IEEE 754 double-precision) values internally. When input data has very large or very small magnitudes, intermediate products in the final-stage computations can overflow to infinity or underflow to zero, even when the mathematically correct result is perfectly finite and well-defined.

The specific function initially reported is regr_r2(), which computes the coefficient of determination (R²). Its formula involves:

R² = (Sxy * Sxy) / (Sxx * Syy)

where Sxy, Sxx, and Syy are sums of squares/cross-products accumulated by float8_regr_accum(). When inputs are scaled near the extremes of double-precision range (e.g., values around 1e-100), the products Sxy * Sxy and Sxx * Syy can individually underflow to zero, producing 0/0 = NaN, even though the ratio itself is finite. The thread's motivating example demonstrates perfectly correlated data returning NaN instead of 1.0.

This is architecturally significant because:

  1. Consistency: The sister function corr() already received an analogous fix in commit 6498287696d for PostgreSQL 19, using a numerically stabilized computation. regr_r2() computes essentially the same quantity (R² = corr²) but lacked the same protection.

  2. SQL Standard compliance: regr_r2() is a SQL-standard aggregate. Returning NaN for valid, perfectly correlated input violates reasonable expectations of correctness.

  3. Broader pattern: The investigation revealed the same class of vulnerability in regr_intercept(), which computes (Sy - Sx * Sxy / Sxx) / N — the three-term product Sx * Sxy / Sxx is equally susceptible to intermediate overflow/underflow.

Evolution of the Solution

regr_r2(): Direct Stabilization (v1 → v2)

v1 (Chengpeng Yan): Proposed extracting the stabilized denominator computation from corr() into a shared helper function, with regr_r2() calling it as a fallback when the direct product overflows or underflows.

v2 (Dean Rasheed): Refined the approach significantly:

The stabilization strategy mirrors corr(): when the direct computation yields zero or infinity, fall back to a formulation that avoids computing the raw products, instead working with ratios that stay in representable range.

regr_intercept(): The frexp/ldexp Technique

Tom Lane's audit of other float8_regr_accum-consuming aggregates identified regr_intercept() as vulnerable. The expression:

(Sy - Sx * Sxy / Sxx) / N

contains the sub-expression Sx * Sxy / Sxx where all three terms can be extremely large or small simultaneously.

Key design debate — evaluation order:

Tom's log-space idea: Compute exp(log(Sx) + log(Sxy) - log(Sxx)) to avoid intermediate overflow entirely. Rejected as requiring extensive special-case handling for zero and negative values of Sx and Sxy.

Final solution — frexp/ldexp decomposition: Dean proposed the elegant approach of using C99's frexp() and ldexp() to separate mantissa and exponent:

offset = Sx * Sxy / Sxx;
if (offset == 0 || isinf(offset))
{
    // Decompose into mantissa × 2^exponent
    m_Sx  = frexp(Sx,  &n_Sx);
    m_Sxy = frexp(Sxy, &n_Sxy);
    m_Sxx = frexp(Sxx, &n_Sxx);

    // Compute in mantissa space (all values in [0.5, 1.0))
    m_offset = m_Sx * m_Sxy / m_Sxx;
    // Reconstruct with combined exponent
    n_offset = n_Sx + n_Sxy - n_Sxx;
    offset = ldexp(m_offset, n_offset);
}
PG_RETURN_FLOAT8((Sy - offset) / N);

This is effectively manual extended-exponent arithmetic. The mantissas from frexp() are in [0.5, 1.0), so their product and quotient cannot overflow or underflow. The exponent arithmetic happens in integer space. ldexp() then reconstructs the result, which may itself overflow/underflow only if the true mathematical result is unrepresentable — not due to intermediate computation artifacts.

Key design considerations:

Versioning and Backpatch Decision

There is consensus that this is technically a bug fix but should not be back-patched:

Architectural Significance

This thread illustrates a recurring challenge in PostgreSQL's floating-point aggregate implementations: the final-stage computation (performed in float8_* finalize functions) operates on accumulated statistics that can span the full double-precision range. While the accumulation itself uses Kahan-style or Youngs-Cramer algorithms for numerical stability of summation, the ratio computations in the finalize step have historically been naive.

The frexp/ldexp technique demonstrated for regr_intercept() is particularly noteworthy as a general pattern for avoiding intermediate overflow/underflow in any product-of-ratios computation — it's simpler and more robust than log-space computation, handles sign correctly automatically (frexp preserves sign), and requires no special-casing for zero/negative values beyond what frexp already provides.