Wrong results with equality search using trigram index and non-deterministic collation

First seen: 2024-09-17 06:00:18+00:00 · Messages: 5 · Participants: 2

Latest Update

2026-05-06 · opus 4.7

Wrong Results with Trigram Index and Non-Deterministic Collations

Core Problem

This thread exposes a correctness bug at the intersection of two PostgreSQL features that were not designed with awareness of each other:

  1. Non-deterministic collations (added in PG12 via ICU), which allow strings with different byte sequences to compare as equal. In the reporter's example, the faux_cn collation declares l = r, so 'right' = 'light' must return true.
  2. The pg_trgm extension, which implements GIN indexes over trigrams (three-character substrings) for fast LIKE, similarity, and equality searches.

The bug manifests as a query returning different results depending on plan choice:

This is a fundamental invariant violation in PostgreSQL: an index must never change query semantics. As Laurenz Albe bluntly states, "An index is not allowed to change the semantics of a query." The optimizer is entitled to assume that any plan over any valid index yields the same result set as a seq scan with the relevant qual evaluated.

Why the Bug Exists

pg_trgm extracts trigrams from strings using byte-level (or at best, default-collation-aware) processing. Under a non-deterministic collation:

The equality operator = is collation-sensitive and declares these equal, but the GIN index's extractValue / extractQuery support functions produce different trigram sets that have no nontrivial overlap sufficient to recheck equality. The index recheck then filters based on... what the recheck qual does, which itself may or may not honor the collation correctly once the rows are fetched. The net effect: the index acts as a lossy filter that pre-discards rows the collation says are equal.

Additionally, internally pg_trgm compares trigrams with btint4cmp() (treating them as 32-bit integers built from 3 bytes + possibly a hash fallback). This is byte-wise and oblivious to collation.

Proposed Solutions and Their Tradeoffs

Four distinct approaches surface in the thread:

1. Refuse to use the index for equality with non-deterministic collations

Laurenz's first instinct. The planner would suppress index paths when the operator's collation is non-deterministic. Problem: It only patches equality; LIKE and % similarity would still silently give wrong answers under any planner that includes a referenced patch ([1]) which makes pg_trgm honor column collation for LIKE matching. Also, the optimizer has no particularly clean hook to say "this index is valid for some collations but not others" — Laurenz's second message asks exactly this question.

2. Refuse to CREATE the index on non-deterministic-collation columns

Geidav proposes this. Problems Laurenz identifies:

Laurenz counter-proposes downgrading to a WARNING at CREATE INDEX time (or even a broader warning that "pg_trgm does not use inferred column collations but always uses the default database collation").

3. Make trigram extraction collation-aware

The ambitious fix, which Geidav's PoC patch attempts. This builds on top of the patch series at the referenced thread (db087c3e-230e-4119-8a03-8b5d74956bc2@gmail.com), which teaches pg_trgm to use the inferred column collation during trigram extraction rather than always falling back to the database default.

The PoC replaces btint4cmp() — used to compare stored/query trigrams inside the GIN internals — with a collation-aware string comparison function.

4. Collation-aware hashing for trigrams

Geidav identifies a deep problem with approach 3: pg_trgm's compact_trigram() falls back to a 32-bit hash when the three consecutive characters exceed 3 bytes (i.e., for most multibyte text — which is precisely where non-deterministic collations matter most). A plain hash does not preserve collation equivalence classes: two characters that collate equal will hash differently.

The proposed remedy is a collation-aware hash akin to hashtext(), which already maps collation-equivalent values to the same hash. The GIN comparison function would then need to distinguish "plain" trigrams from "hash" trigrams (since they live in the same 32-bit slot today). Storing everything as hashes would break show_trgm() and other user-visible functions that display the actual trigrams.

Key Technical Insights

The "index must preserve semantics" invariant

The thread crystallizes why this is a genuine bug rather than a quirk. PostgreSQL's planner freely chooses between scan types based on cost, and users expect result equivalence. Any extension providing an index AM / opclass inherits this contract.

Non-deterministic collations interact poorly with lossy indexing

GIN trigram indexes are fundamentally lossy at extraction time: the index stores features (trigrams) that are a projection of the input. For that to be correct under equality, the feature extraction must be collation-invariant in the right direction: if a = b under the collation, extract(a) ⊇ recheck_set(b) in a way that the recheck can then confirm. Byte-wise trigram extraction does not satisfy this under collations that equate distinct byte sequences.

The hashed-trigram escape hatch is the real obstacle

Even if 3-ASCII-byte trigrams are handled, multibyte trigrams hash into a namespace with no equivalence-class structure. This is the core engineering problem blocking a clean fix, and is why Geidav's PoC only works for the simple ASCII example.

Planner API gap

Laurenz's question — "how can you tell the optimizer to consider a certain index only for certain collations?" — reveals that there is no fine-grained mechanism. Opclasses are either collation-sensitive or not; there is no "valid only for deterministic collations" flag. Adding one would be a modest but real planner/catalog change.

Participant Assessment

No committer has weighed in on the archived portion of the thread; this remains at the design-discussion stage.

Status and Path Forward

As of the last message, the situation is:

The sustainable fix almost certainly requires (a) the prerequisite patch that makes pg_trgm honor column collation, (b) a collation-aware hash for compacted trigrams with a tag bit distinguishing them from literal trigrams, and (c) a GIN comparator that dispatches appropriately. This is a meaningful but tractable amount of engineering in contrib/pg_trgm.