[PATCH] Add contrib/anyarray: intarray-style operations and indexes for any array type

First seen: 2026-05-18 10:17:39+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-20 · claude-opus-4-6

Technical Analysis: contrib/anyarray — Generalizing intarray to Any Array Type

Core Problem

PostgreSQL's contrib/intarray module has long been one of the most popular extensions for users who need set-style operations and boolean query matching on arrays of integers. However, it is hardcoded to int4[], leaving users of bigint[], uuid[], text[], and other array types without equivalent functionality. Users must either:

  1. Cast their data to int4[] (lossy, impossible for non-numeric types)
  2. Write application-level filtering (no index support)
  3. Maintain third-party forks that duplicate intarray's logic per type

This patch proposes a new contrib/anyarray module that generalizes all of intarray's capabilities to any element type with a default btree opclass, using PostgreSQL's polymorphic type system and type cache infrastructure.

Architectural Design

Module Structure (4,657 lines added across 21 files)

The implementation is cleanly layered into five C source files:

File Responsibility
anyarray.c Core infrastructure: type metadata caching via fn_extra, generic datum comparator
anyarray_op.c Set-style operations: sort, uniq, idx, subarray, intersect, union, difference
anyarray_bool.c The anyquery type: parser, evaluator, I/O functions
anyarray_gin.c GIN opclasses with per-concrete-type wrappers
anyarray_gist.c Polymorphic GiST opclass using signature-based bit vectors

Type Metadata Caching Pattern

The central architectural decision is how to discover and cache element-type-specific operations at runtime. The patch uses fcinfo->flinfo->fn_extra to store an AnyArrayTypeInfo struct that holds:

This is looked up via lookup_type_cache() on first call and reused across subsequent invocations of the same function at the same call site. The design correctly handles the case where element types change between calls (e.g., in a polymorphic function called with different array types), by checking meta->element_type == element_type and refreshing if needed.

The anyquery Type — Lazy-Parsing Boolean Queries

One of the most interesting design decisions is the anyquery type. Unlike intarray's query_int which stores integer leaves, anyquery stores text tokens that are parsed lazily through the element type's input function at @@-evaluation time. This enables:

-- Same anyquery value works against different element types
SELECT arr @@ '1 & 2'::anyquery FROM int8_table;
SELECT arr @@ 'apple & banana'::anyquery FROM text_table;

The on-disk format is a postfix (reverse Polish notation) item array plus a string heap:

[varlena header][size][str_off][str_len][items...][padding][strings...]

Each item is either a VAL (with payload = offset into string heap) or an OPR (AND/OR/NOT). The left field on binary operators stores the relative offset to the left operand in the postfix array, enabling tree traversal without recursion depth proportional to expression size (though the evaluator still recurses).

Parser: A recursive-descent parser with standard precedence (NOT > AND > OR) and parenthesized sub-expressions. Includes check_stack_depth() for safety against deeply nested inputs. Uses the soft-error (errsave/ereturn) pattern for input validation, which is modern PostgreSQL style.

Evaluator: Binary search on a sorted copy of the array elements. Parsed datum values are cached per VAL item to avoid re-parsing the same token when evaluating OR branches.

GiST Signature Index

The GiST opclass (anyarray_gist_ops) uses the classic signature-based approach from intarray's gist__intbig_ops:

  1. Compress: Hash each array element into a fixed-size bit vector (default 252 bytes = 2016 bits)
  2. Union: Bitwise OR of children's signatures
  3. ALLISTRUE optimization: When all bits are set, store only a flag (no bit vector)
  4. Consistent: For containment/overlap queries, check whether the query elements' bits are set in the key's signature

Key design choices:

The GiST approach is fully polymorphic because the hash function is discovered dynamically from the type cache. This is a significant advantage over GIN.

GIN Opclasses — The fn_expr Limitation

The GIN implementation reveals an important PostgreSQL internals limitation: GIN's extractQuery support function does not receive fn_expr, so when the query datum is an anyquery (which carries only text tokens), the element type needed to parse those tokens cannot be discovered dynamically.

The workaround is to provide one GIN opclass per concrete element type:

Each is a thin C wrapper that hard-codes the element OID and delegates to shared logic. This is pragmatic but doesn't scale — users with other element types must add their own wrappers or rely solely on GiST.

For the standard array operators (strategies 1-4: overlap, contains, contained-by, equal), the GIN opclass mirrors core's array_ops behavior. Strategy 5 (@@) extracts all VAL tokens from the anyquery, parses them via the element type's input function, and uses them as GIN keys.

The consistent function for @@ maps each VAL in postfix order to its corresponding check[] entry and recursively evaluates the boolean tree. For queries like !foo that have no required values, GIN_SEARCH_MODE_ALL is used (full index scan).

Technical Concerns and Potential Issues

1. Operator Name Conflicts

The patch defines operators #, &, |, - on anyarray/anyelement. Since these are polymorphic types, there's a real risk of ambiguity with other extensions or even future core additions. The intarray precedent helps, but the broader applicability here (any array type) increases collision risk.

2. Extension Not Marked Trusted

The extension is trusted = false because it defines internal storage types for the GiST key. This limits adoption in hosted/managed PostgreSQL environments where users lack superuser access.

3. NULL and Multi-Dimensional Array Restrictions

All operations reject NULLs and multi-dimensional arrays via ANYARRAY_CHECK_ARRAY. This mirrors intarray but may surprise users who expect graceful handling of NULLs.

4. GIN Scalability

The per-type GIN wrapper approach requires N opclasses for N element types. This is a known limitation documented in the patch but represents a maintenance burden. A potential future fix would require changes to the GIN AM itself to propagate fn_expr or indexed-column type information into extractQuery.

5. Memory Management in Boolean Evaluation

The do_boolop function deconstructs the array, sorts it, then parses each query token on demand. For large arrays and complex queries, this could be expensive. The parsed-datum cache (ctx.parsed[]) mitigates repeated parsing but doesn't address the O(n log n) sort cost per row in sequential scans.

6. GiST picksplit Quality

The Hamming-weight-based picksplit is acknowledged as non-optimal. The comment notes that "fancier Guttman-style splits used by intbig can be added later." For production use with high-cardinality arrays, this could lead to suboptimal tree structure and more false positives.

7. Included regression.diffs and regression.out

The patch accidentally includes src/test/regress/regression.diffs (empty file) and src/test/regress/regression.out (containing a local test failure message). These should be removed before any commit consideration.

Comparison with Existing Approaches

Feature intarray Built-in GIN array_ops anyarray (this patch)
Element types int4 only Any Any with btree opclass
Boolean queries query_int (@@) No anyquery (@@)
GiST index Yes (signature) No Yes (signature, polymorphic)
GIN index Yes (int4-specific) Yes (no @@) Yes (per-type, adds @@)
Set operations Yes (int4) No Yes (polymorphic)

Assessment

This is a substantial, well-structured first submission that addresses a genuine gap in PostgreSQL's extension ecosystem. The code follows established PostgreSQL contrib conventions (Makefile + meson.build, pgindent typedefs, SGML documentation, regression tests with cross-AM consistency checks). The architectural decisions are sound — particularly the lazy-parsing anyquery design and the polymorphic GiST approach.

However, as a first posting with no community responses yet, it faces the typical challenges for new contrib modules: justifying inclusion in core contrib vs. recommending it as an external extension on PGXN, addressing the operator namespace pollution concerns, and potentially requiring changes to the GIN AM to eliminate the per-type wrapper limitation.