RFC: Boyer-Moore-Horspool Optimization for LIKE '%pattern%' Searches
Technical Problem
PostgreSQL's LIKE '%pattern%' queries perform substring searches using a relatively naive character-by-character scanning approach within the pattern matching engine (src/backend/utils/adt/like.c and related files). When users write queries such as:
SELECT * FROM t WHERE col LIKE '%longsubstring%';
The internal matching logic iterates through the haystack without leveraging advanced substring search algorithms. Meanwhile, PostgreSQL's strpos() function (and the underlying text_position infrastructure in src/backend/utils/adt/varlena.c) already employs a Boyer-Moore-like algorithm for raw substring searches.
The core question is: why doesn't LIKE benefit from the same optimized substring search that strpos() already uses?
Why This Matters Architecturally
-
LIKE is one of the most commonly used text predicates — it appears in virtually every application's query patterns. Even small constant-factor improvements compound across workloads.
-
The gap between LIKE and strpos() is a known inconsistency — the optimizer can sometimes rewrite
LIKE '%pat%'into index or strpos-based approaches, but this doesn't always happen (especially with non-C collations or when wildcards and single-character matchers_appear internally). -
Pattern matching complexity — The LIKE engine handles
%(match any sequence) and_(match single character) with backtracking. When the pattern is simply%literal%, it degenerates to a pure substring search, but the general LIKE machinery doesn't currently detect and fast-path this case optimally.
Proposed Approach
The author proposes integrating Boyer-Moore-Horspool (BMH) into LIKE's substring matching for the common %literal% case. BMH is a simplification of Boyer-Moore that uses only the bad-character heuristic (skipping based on the last character mismatch), which:
- Has O(n/m) best-case performance for text of length n and pattern of length m
- Requires O(σ) preprocessing (where σ is alphabet size) to build the skip table
- Is simpler to implement than full Boyer-Moore (no good-suffix rule)
Key Design Considerations
When BMH helps:
- Longer literal patterns (the skip distance increases with pattern length)
- Large alphabets (more distinct characters = better average skip)
- Patterns without internal wildcards (
%or_)
When BMH may hurt:
- Very short patterns (1-3 characters) where preprocessing cost dominates
- Small alphabets or pathological inputs (e.g., DNA sequences: ACGT)
- Patterns with internal wildcards that prevent simple substring extraction
Collation/Encoding Complications:
- BMH operates on bytes, but PostgreSQL's LIKE must respect collation semantics
- Multi-byte encodings (UTF-8) complicate the skip table — you can't simply index by byte value without risking incorrect skips across character boundaries
- Non-deterministic collations make byte-level comparison invalid
Prior Art Within PostgreSQL
PostgreSQL's text_position_setup() / text_position_next() in varlena.c already implements a Boyer-Moore-inspired search for strpos() and position(). The relevant question is whether this infrastructure can be reused or adapted for the LIKE engine. The challenge is that like.c uses a character-at-a-time state machine approach (via LIKE_TRUE/LIKE_FALSE/LIKE_ABORT returns and the MatchText macro) that doesn't easily decompose into "find this literal substring" sub-problems without refactoring.
Open Technical Questions
-
Pattern decomposition: Can LIKE patterns be analyzed at planning time to extract literal segments that could use BMH? (e.g.,
%foo%bar%→ search for "foo", then search for "bar" after the match) -
Threshold selection: At what pattern length does BMH's preprocessing cost break even? The author's benchmarks suggest gains for longer patterns but doesn't quantify the crossover point.
-
SIMD alternatives: Modern CPUs support SSE4.2
PCMPESTRI/PCMPESTRMinstructions that can search for substrings in 16-byte chunks. These may outperform BMH for the common case without requiring preprocessing. Thepg_lfindinfrastructure shows PostgreSQL is willing to use SIMD in specific cases. -
Integration point: Should this be done in the LIKE operator itself, or should the optimizer more aggressively rewrite
LIKE '%literal%'intostrpos(col, 'literal') > 0at plan time?
Assessment
This is an early-stage RFC from a newcomer seeking community guidance. No patch has been submitted. The thread represents a feasibility inquiry rather than a concrete proposal. The historical resistance to this optimization stems from:
- The complexity of correctly handling all LIKE semantics (escape characters, wildcards, collations)
- The risk of regressions on short patterns that dominate real workloads
- The existence of alternative approaches (GIN/GiST trigram indexes via
pg_trgm, rewriting tostrpos()) - The relatively small payoff for the common case (most LIKE patterns are short)
For this to move forward, the community would likely want to see: (1) a concrete patch with regression tests, (2) benchmarks showing no regression for short patterns, (3) correct handling of multibyte encodings, and (4) a clear argument for why this is better than the optimizer rewriting to strpos() or users employing pg_trgm.