Technical Analysis: LIKE Fails to Match Literal Backslashes with Nondeterministic Collations
Core Problem
PostgreSQL's LIKE operator produces incorrect results when matching patterns containing literal backslashes (\\) against columns using nondeterministic collations. This is a correctness bug — the fundamental contract of pattern matching is violated.
Background: Nondeterministic Collations and LIKE
Nondeterministic collations (introduced in PostgreSQL 12) allow comparisons that are not purely byte-by-byte — for example, case-insensitive or accent-insensitive matching via ICU. When LIKE operates under such collations, it cannot use the standard byte-comparison fast path in like_match.c. Instead, it must transform the pattern and delegate substring comparisons to the collation-aware comparison infrastructure.
The support for LIKE with nondeterministic collations was added in commit 85b7efa1cd. That commit introduced a code path that pre-processes the LIKE pattern, extracting literal segments between wildcards (%, _) and comparing them using the collation's comparison function rather than raw byte equality.
The Bug
In the pattern pre-processing logic, when the code encounters a backslash (\), it's supposed to treat the next character as a literal (the standard SQL escape mechanism). The pattern \\ means "a literal backslash character." However, the buggy code was incorrectly escaping (consuming) backslashes without properly including the escaped character in the output buffer. This meant that a pattern like 'foo\\bar' (matching the literal string foo\bar) would fail to match correctly because the literal backslash was being dropped or mishandled during pattern transformation.
The bug has existed since the original nondeterministic collation LIKE support was introduced, meaning all versions with this feature have been affected.
Proposed Solutions
v1: Direct Character Inclusion (Initial Patch)
Nitin Motiani's first patch took the minimal approach: when encountering a literal \ in the internal pattern (represented as \\), always include the character after the backslash in the final buffer. The patch relied on the fact that end-of-string checking was already handled in an earlier code block that processes escape characters.
Criticism: Zsolt Parragi pointed out that this approach is fragile — it relies on implicit ordering guarantees from the outer loop without documenting them. The correctness depends on non-local reasoning about the control flow.
v2: afterescape Flag Pattern (Revised Patch)
The revised approach mirrors the existing do_like_escape function in the same file (like_match.c), which uses an explicit afterescape boolean flag. This is a well-established pattern within PostgreSQL's LIKE implementation:
- When a
\is encountered, setafterescape = trueand continue to the next iteration - On the next iteration, if
afterescapeis true, treat the current character as a literal regardless of what it is, then reset the flag
This makes the state machine explicit and self-documenting.
Multibyte Encoding Concern
The v2 patch also addresses a subtle correctness issue with multibyte encodings. The original byte-by-byte copy could malfunction if an encoding uses \ (0x5C) as a non-first byte in a multi-byte character sequence. The fix uses NextChar to properly advance past the full multi-byte character and then copies all bytes of that character.
The author noted that CopyAdvChar (a macro that combines advancing and copying) exists in do_like_escape but isn't defined for all code path variants (the file uses macro-based code generation for different encoding scenarios). Rather than defining new macro variants, the patch uses NextChar followed by an explicit memcpy of the character's bytes.
Architectural Implications
This bug highlights a recurring challenge in PostgreSQL's collation infrastructure: when new collation semantics are added, all pattern-matching code paths must be carefully audited. The LIKE implementation uses preprocessor-based code generation (like_match.c is included multiple times with different macro definitions) which makes it particularly easy to miss edge cases in one variant.
The fix is confined to the nondeterministic collation code path and does not affect the standard deterministic LIKE, which handles backslash escaping correctly through a different mechanism.
Open Questions
- Whether to define
CopyAdvCharfor the nondeterministic collation case to maintain consistency withdo_like_escape - Whether additional edge cases exist in the nondeterministic LIKE path (this bug's long lifespan suggests insufficient test coverage for this code path)