Make transformAExprIn() Return a Flattened Bool Expression Directly
Core Problem
In PostgreSQL's query transformation pipeline, the function transformAExprIn() (located in src/backend/parser/parse_expr.c) is responsible for converting SQL IN clauses into internal expression trees. When an IN list contains a mix of constant values and non-constant expressions (e.g., column references from other tables), the function currently produces a nested binary tree of OR expressions rather than a flat list.
For example, given:
SELECT * FROM t1, t2, t3 WHERE t1.id IN (1, 2, t2.id, t3.id, 3);
The current behavior produces a nested structure like:
OR(OR(ScalarArrayOp(t1.id, {1,2,3}), OpExpr(t1.id = t2.id)), OpExpr(t1.id = t3.id))
This is a right-deep (or left-deep) binary tree of BoolExpr OR nodes. PostgreSQL's internal BoolExpr node is designed to hold a flat list of arguments — makeBoolExpr() takes a List *args — but transformAExprIn() constructs the result by progressively wrapping previous results in new OR nodes, creating unnecessary nesting.
Why This Matters Architecturally
The Parse-Analyze → Planner Pipeline
PostgreSQL processes queries through distinct phases: parsing → analysis/transformation → rewriting → planning → execution. The transformation phase (where transformAExprIn() lives) produces the query tree that feeds into the planner. The planner's preprocess_qual_conditions() function does flatten nested OR/AND expressions via pull_ors()/pull_ands() in prepqual.c, so the final plan is identical regardless of whether the input is nested or flat.
However, producing the flat structure directly in the parser/analyzer has several advantages:
-
Reduced planner work: The planner doesn't need to traverse and rebuild the expression tree to flatten it. While this is not expensive for small IN lists, it represents unnecessary work that scales with the size of the IN clause.
-
Cleaner intermediate representation: Any code that inspects the raw parse tree (query rewrite rules, pg_stat_statements normalization, debug output, extensions using post-parse hooks) sees a canonical flat structure.
-
Principle of least surprise: BoolExpr nodes are semantically flat lists. Producing a binary tree of BoolExpr nodes during transformation is an implementation artifact, not intentional structure.
-
Consistency: Other parts of the parser already produce flat BoolExpr lists (e.g., when transforming explicit
a OR b OR csyntax). The IN-clause handling is an outlier.
Proposed Solution
V1 Patch (ChangAo Chen)
The original patch modifies transformAExprIn() to collect all the transformed sub-expressions (both the ScalarArrayOpExpr for constants and individual OpExpr nodes for non-constant elements) into a single flat list, then constructs a single BoolExpr OR node with all elements as direct children.
The key transformation changes from the iterative approach:
// Pseudo-code of current behavior:
result = NULL;
foreach(element in rexprs) {
cmp = make_op_expr(...);
if (result == NULL)
result = cmp;
else
result = makeBoolExpr(OR, list_make2(result, cmp));
}
// Combine with ScalarArrayOp for constants
if (result && have_constants)
result = makeBoolExpr(OR, list_make2(scalar_array_op, result));
To:
// Pseudo-code of proposed behavior:
flat_list = NIL;
if (have_constants)
flat_list = lappend(flat_list, scalar_array_op);
foreach(element in rexprs) {
cmp = make_op_expr(...);
flat_list = lappend(flat_list, cmp);
}
if (list_length(flat_list) > 1)
result = makeBoolExpr(OR, flat_list);
else
result = linitial(flat_list);
V2 Patch (Jian Guo)
The refined v2 patch keeps all core logic inside the existing foreach(l, rexprs) loop to minimize code intrusion and preserve the function's structural integrity. It also adds test cases covering:
- Single-element IN lists (no OR needed)
- Multi-element constant-only IN lists
- Multi-element with mixed constants and column references
- Row-expression IN lists (
(a, b) IN ((1, 2), (3, 4))) - Various combinations of the above
Technical Details
How transformAExprIn() Works
The function handles two distinct cases:
-
Scalar IN:
expr IN (val1, val2, ...)— splits the right-hand side into constants (grouped into aScalarArrayOpExprfor efficient evaluation) and non-constants (each becoming an individualOpExpr). -
Row IN:
(expr1, expr2) IN ((v1, v2), (v3, v4))— each tuple comparison becomes aRowCompareExpror conjunction ofOpExprnodes.
The patch primarily affects case 1 when there's a mix of constants and non-constants, which is when nesting occurs in the current code.
Performance Implications
The optimization is essentially zero-cost at parse time — collecting items into a list and building a single BoolExpr is actually simpler than the current approach of wrapping in nested BoolExprs. The savings come later:
pull_ors()in the planner won't need to recursively flatten the expression- Any tree-walking code (expression cost estimation, clause selectivity) processes a shorter tree
- Memory allocation is reduced (fewer intermediate BoolExpr nodes)
The savings are modest for typical queries but the patch represents a correctness-of-representation improvement that eliminates an unnecessary normalization step.
No Behavioral Change
As demonstrated by the thread's EXPLAIN output, the final execution plan is identical before and after the patch. The planner already handles both representations. This is purely an internal representation optimization that produces a more canonical form earlier in the pipeline.
Potential Concerns
-
Regression risk: Any code that assumes a specific nesting structure from
transformAExprIn()could break. However, since the planner already flattens these, downstream code should handle flat lists. -
View/rule storage: Since the raw parse tree is stored for views and rules, this changes what gets stored. However, the flat form is equally valid and actually preferable.
-
pg_dump compatibility: Expression deparsing should handle flat OR lists correctly since they're already produced by other code paths.