2026-05-27 · claude-opus-4-6

RFC: Boyer-Moore-Horspool Optimization for LIKE '%pattern%' Searches

Technical Problem

PostgreSQL's LIKE '%pattern%' queries perform substring searches using a relatively naive character-by-character scanning approach within the pattern matching engine (src/backend/utils/adt/like.c and related files). When users write queries such as:

SELECT * FROM t WHERE col LIKE '%longsubstring%';

The internal matching logic iterates through the haystack without leveraging advanced substring search algorithms. Meanwhile, PostgreSQL's strpos() function (and the underlying text_position infrastructure in src/backend/utils/adt/varlena.c) already employs a Boyer-Moore-like algorithm for raw substring searches.

The core question is: why doesn't LIKE benefit from the same optimized substring search that strpos() already uses?

Why This Matters Architecturally

LIKE is one of the most commonly used text predicates — it appears in virtually every application's query patterns. Even small constant-factor improvements compound across workloads.
The gap between LIKE and strpos() is a known inconsistency — the optimizer can sometimes rewrite LIKE '%pat%' into index or strpos-based approaches, but this doesn't always happen (especially with non-C collations or when wildcards and single-character matchers _ appear internally).
Pattern matching complexity — The LIKE engine handles % (match any sequence) and _ (match single character) with backtracking. When the pattern is simply %literal%, it degenerates to a pure substring search, but the general LIKE machinery doesn't currently detect and fast-path this case optimally.

Proposed Approach

The author proposes integrating Boyer-Moore-Horspool (BMH) into LIKE's substring matching for the common %literal% case. BMH is a simplification of Boyer-Moore that uses only the bad-character heuristic (skipping based on the last character mismatch), which:

Has O(n/m) best-case performance for text of length n and pattern of length m
Requires O(σ) preprocessing (where σ is alphabet size) to build the skip table
Is simpler to implement than full Boyer-Moore (no good-suffix rule)

Key Design Considerations

When BMH helps:

Longer literal patterns (the skip distance increases with pattern length)
Large alphabets (more distinct characters = better average skip)
Patterns without internal wildcards (% or _)

When BMH may hurt:

Very short patterns (1-3 characters) where preprocessing cost dominates
Small alphabets or pathological inputs (e.g., DNA sequences: ACGT)
Patterns with internal wildcards that prevent simple substring extraction

Collation/Encoding Complications:

BMH operates on bytes, but PostgreSQL's LIKE must respect collation semantics
Multi-byte encodings (UTF-8) complicate the skip table — you can't simply index by byte value without risking incorrect skips across character boundaries
Non-deterministic collations make byte-level comparison invalid

Prior Art Within PostgreSQL

PostgreSQL's text_position_setup() / text_position_next() in varlena.c already implements a Boyer-Moore-inspired search for strpos() and position(). The relevant question is whether this infrastructure can be reused or adapted for the LIKE engine. The challenge is that like.c uses a character-at-a-time state machine approach (via LIKE_TRUE/LIKE_FALSE/LIKE_ABORT returns and the MatchText macro) that doesn't easily decompose into "find this literal substring" sub-problems without refactoring.

Open Technical Questions

Pattern decomposition: Can LIKE patterns be analyzed at planning time to extract literal segments that could use BMH? (e.g., %foo%bar% → search for "foo", then search for "bar" after the match)
Threshold selection: At what pattern length does BMH's preprocessing cost break even? The author's benchmarks suggest gains for longer patterns but doesn't quantify the crossover point.
SIMD alternatives: Modern CPUs support SSE4.2 PCMPESTRI/PCMPESTRM instructions that can search for substrings in 16-byte chunks. These may outperform BMH for the common case without requiring preprocessing. The pg_lfind infrastructure shows PostgreSQL is willing to use SIMD in specific cases.
Integration point: Should this be done in the LIKE operator itself, or should the optimizer more aggressively rewrite LIKE '%literal%' into strpos(col, 'literal') > 0 at plan time?

Assessment

This is an early-stage RFC from a newcomer seeking community guidance. No patch has been submitted. The thread represents a feasibility inquiry rather than a concrete proposal. The historical resistance to this optimization stems from:

The complexity of correctly handling all LIKE semantics (escape characters, wildcards, collations)
The risk of regressions on short patterns that dominate real workloads
The existence of alternative approaches (GIN/GiST trigram indexes via pg_trgm, rewriting to strpos())
The relatively small payoff for the common case (most LIKE patterns are short)

For this to move forward, the community would likely want to see: (1) a concrete patch with regression tests, (2) benchmarks showing no regression for short patterns, (3) correct handling of multibyte encodings, and (4) a clear argument for why this is better than the optimizer rewriting to strpos() or users employing pg_trgm.

RFC: Boyer-Moore-Horspool optimization for LIKE '%pattern%' searches

Latest Update