Question about hashed ScalarArrayOpExpr equality semantics

First seen: 2026-05-11 04:34:38+00:00 · Messages: 2 · Participants: 2

Latest Update

2026-05-14 · claude-opus-4-6

Deep Technical Analysis: Hashed ScalarArrayOpExpr Equality Semantics

Core Problem

This thread surfaces a subtle but architecturally significant semantic gap in PostgreSQL's hashed evaluation path for ScalarArrayOpExpr (SAOP). The issue concerns three-valued logic correctness when a user-defined equality operator marked as HASHES returns NULL for non-NULL inputs.

Background: ScalarArrayOpExpr Evaluation Paths

ScalarArrayOpExpr implements expressions like x = ANY(array) and x <> ALL(array). PostgreSQL has two evaluation strategies:

  1. Linear path: Iterates over every array element, calling the comparison operator for each. Faithfully implements SQL three-valued logic — if any comparison returns NULL (rather than true/false), the overall result can be NULL (UNKNOWN) rather than a definite true or false.

  2. Hashed path (introduced in PostgreSQL 14, commit by David Rowley): When the array has ≥ MIN_ARRAY_SIZE_FOR_HASHED_SAOP (9) elements and the operator is hashable, PostgreSQL builds a hash table from the array elements and does a hash probe. This is an O(1) lookup per scalar instead of O(n).

The problem: these two paths can produce different results for the same inputs when the equality operator returns NULL for non-NULL arguments.

The Semantic Contract of HASHES

The crux of the question is: what invariants does the HASHES marking imply about an operator's behavior?

PostgreSQL's hash infrastructure (hash joins, hash indexes, hash aggregation, and the hashed SAOP path) all fundamentally depend on the assumption that if two values hash to the same bucket and the equality operator is called, it returns a definitive boolean — either true or false. A hash table probe cannot meaningfully represent a "maybe matches" state. The hash table either finds a match or it doesn't.

The existing code in test_opexpr_is_hashable() (in nodeSubplan.c or the SAOP evaluation code) already documents this assumption explicitly:

"it can't yield NULL for non-null inputs, either (see nodeSubplan.c). However, hash indexes and hash joins assume that too."

This means the documented contract is that declaring an operator as HASHES (or placing it in a hash operator class) carries an implicit guarantee that the operator will never return NULL for non-NULL inputs. This is a stronger requirement than mere strictness (which only guarantees NULL output when at least one input is NULL).

The Demonstrated Discrepancy

The reporter constructs a deliberately pathological operator === that:

With 8 array elements (linear path): 42 === ANY(ARRAY[1,2,3,4,5,6,7,42]) returns NULL because the comparison 42 === 42 yields NULL, and SQL three-valued ANY semantics propagate that as UNKNOWN.

With 9 array elements (hashed path): 42 === ANY(ARRAY[1,2,3,4,5,6,7,8,42]) returns FALSE because the hash probe either finds a match (true) or doesn't (false) — there's no mechanism to propagate a NULL comparison result from within the hash table.

Architectural Implications

This is fundamentally not a bug but rather a case of user-violated contract. The analysis breaks down as follows:

  1. The hashed path is correct in its assumptions. The existing code comment explicitly states that hashable operators must not return NULL for non-NULL inputs. The hash table implementation is architecturally incapable of representing three-valued logic outcomes — this is inherent to hash-based evaluation, not a design oversight.

  2. PostgreSQL does not enforce this contract at CREATE OPERATOR time. When a user declares HASHES on an operator, PostgreSQL trusts the declaration. There is no runtime validation that the operator actually satisfies the required invariants. This is consistent with PostgreSQL's general approach to operator class declarations — they are trusted metadata.

  3. The linear path is "accidentally" more correct only because it evaluates each comparison individually and can observe NULLs. But the linear path is not the normative behavior — both paths should produce the same result for well-behaved operators, and the contract defines what "well-behaved" means.

  4. Hash joins and hash indexes have the same vulnerability. If you used this pathological operator in a hash join or hash index, you would get similarly incorrect results. The SAOP hashed path is not special in this regard.

Potential Improvements (Not Proposed, But Implied)

While the thread does not propose a patch, the discussion implicitly raises the question of whether PostgreSQL should:

Summary

The observed behavior is a consequence of a violated operator contract, not a bug. The HASHES declaration carries an implicit but explicitly documented invariant that the operator must return a definite boolean for non-NULL inputs. PostgreSQL's hash-based evaluation paths (including hashed SAOP, hash joins, and hash indexes) all depend on this invariant. The linear and hashed SAOP paths will produce identical results for any operator that correctly satisfies its declared contract.