Support LIKE with Nondeterministic Collations
Core Problem
PostgreSQL's nondeterministic collations (introduced by Peter Eisentraut) enable case-insensitive, accent-insensitive, and other Unicode-aware string comparisons through ICU. However, the LIKE operator was explicitly blocked for nondeterministic collations with a "not supported" error. The fundamental challenge is that nondeterministic collations break a key assumption of the traditional LIKE algorithm: that string equality implies equal length. With nondeterministic collations, 'foo' = '.foo.' can be true (under punctuation-ignoring collations), and U&'e\0301' = 'é' (NFD vs NFC forms). This makes character-by-character matching invalid.
Architectural Significance
LIKE is one of the most frequently used pattern-matching operators in SQL, and nondeterministic collations are PostgreSQL's mechanism for proper Unicode-aware text processing. The inability to combine them was a significant functional gap. Furthermore, the \d command in psql and internal catalog queries use LIKE patterns against system catalogs — any regression in LIKE optimization directly impacts the usability of the entire system.
Algorithm Design
Traditional LIKE (Deterministic)
The existing algorithm in like_match.c works character-by-character: _ advances one character in both pattern and text, % triggers scanning with an optimization that compares the first literal byte after % to avoid unnecessary recursion.
New Algorithm (Nondeterministic)
The SQL standard provides clear semantics: partition the pattern at wildcard characters, then verify that the input string can be partitioned into substrings where each fixed segment equals the corresponding pattern segment under the collation, _ matches exactly one character (codepoint), and % matches any sequence.
The implementation works by:
- When encountering a
%, extracting the next literal substring (up to the next wildcard or end of pattern) - Iterating through all possible lengths of the input text trying
pg_strncoll()to find a prefix that equals the pattern substring under the collation - Recursing for remaining pattern segments
This yields O(n²) worst-case complexity (acknowledged by Heikki Linnakangas), compared to O(n) for deterministic LIKE.
Key Optimization: End-of-Pattern Shortcut
Jian He identified that when the literal substring after % is the final element of the pattern (no more wildcards follow), the algorithm can compare suffixes of the remaining text against the pattern substring directly, avoiding full recursion. Peter Eisentraut implemented this as a "Shortcut: If this is the end of the pattern..." optimization.
The _ (Underscore) Semantic Debate
A significant design discussion arose around what _ means with nondeterministic collations. Two possible definitions:
- Single codepoint (chosen):
_advances exactly one Unicode codepoint in the input - Grapheme cluster / equivalent-to-single-character:
_matches any substring that is "equal to" some single character under the collation
The team chose definition (1) because:
- It aligns with the SQL standard's explicit definition
- Definition (2) would make
_effectively equivalent to%in some cases - Definition (2) would be computationally intractable (checking equality against all possible single characters)
- It's consistent with how all other PostgreSQL string functions (trim, overlay, position) define "character"
This creates counterintuitive results: U&'e\0301' LIKE 'é' is true but U&'e\0301' LIKE '_' is false, because the input is two codepoints while _ matches exactly one. Peter correctly noted this is a fundamental limitation of equating "character" with "codepoint" rather than "grapheme cluster."
Indexing Implications
The patch explicitly disables index optimization for LIKE with nondeterministic collations (match_pattern_prefix returns NIL). Daniel Verite noted that ICU's sort key prefix bounds could theoretically enable index scans for col LIKE 'prefix%' patterns by using U+FFFF as an upper bound. Jeff Davis raised a valid concern: this trick assumes a relationship between string prefixes and collation ordering that isn't guaranteed for arbitrary collation definitions (e.g., hash-based orderings). This optimization was left for future work.
Regression Impact
A post-commit regression was identified in May 2026 by Jelte Fennema-Nio: the change to match_pattern_prefix (which returns NIL for nondeterministic collations) inadvertently affected cases where indexes should still be usable, making psql \d and catalog queries significantly slower. This suggests the condition if (expr_coll && !get_collation_isdeterministic(expr_coll)) may be overly broad or triggered in unexpected contexts.
Bug Fixes During Development
- First-byte optimization bypass: The existing code had a shortcut comparing the first byte after
%to avoid recursion — this is invalid for nondeterministic collations where multi-codepoint sequences can match. Fixed by always recursing. - Backslash at pattern end: Jacob Champion's libfuzzer found that a trailing backslash caused
NextByte()to be called twice, wrappingp1lento INT_MAX and walking off the buffer. Fixed with an ERROR case. - CHECK_FOR_INTERRUPTS: Heikki noted that
repeat('x', 100000) LIKE '%xxxy%'was uninterruptible. Added interrupt checks in the inner loop. - LIKE_FALSE vs LIKE_ABORT: Peter noted ongoing uncertainty about the correct return value in recursive matching contexts — LIKE_ABORT propagates upward to stop further backtracking, while LIKE_FALSE allows the caller to try other partitions.
Collation Resolution
Jian He identified a missing test case: when columns have conflicting collations, the system correctly raises ERROR 42P22 ("could not determine which collation to use for LIKE") with a hint to use COLLATE explicitly. This is handled by GenericMatchText and is existing behavior that the patch preserves.