RFC: Boyer-Moore-Horspool optimization for LIKE '%pattern%' searches

First seen: 2026-05-26 09:54:21+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-05-27 · claude-opus-4-6

RFC: Boyer-Moore-Horspool Optimization for LIKE '%pattern%' Searches

Technical Problem

PostgreSQL's LIKE '%pattern%' queries perform substring searches using a relatively naive character-by-character scanning approach within the pattern matching engine (src/backend/utils/adt/like.c and related files). When users write queries such as:

SELECT * FROM t WHERE col LIKE '%longsubstring%';

The internal matching logic iterates through the haystack without leveraging advanced substring search algorithms. Meanwhile, PostgreSQL's strpos() function (and the underlying text_position infrastructure in src/backend/utils/adt/varlena.c) already employs a Boyer-Moore-like algorithm for raw substring searches.

The core question is: why doesn't LIKE benefit from the same optimized substring search that strpos() already uses?

Why This Matters Architecturally

  1. LIKE is one of the most commonly used text predicates — it appears in virtually every application's query patterns. Even small constant-factor improvements compound across workloads.

  2. The gap between LIKE and strpos() is a known inconsistency — the optimizer can sometimes rewrite LIKE '%pat%' into index or strpos-based approaches, but this doesn't always happen (especially with non-C collations or when wildcards and single-character matchers _ appear internally).

  3. Pattern matching complexity — The LIKE engine handles % (match any sequence) and _ (match single character) with backtracking. When the pattern is simply %literal%, it degenerates to a pure substring search, but the general LIKE machinery doesn't currently detect and fast-path this case optimally.

Proposed Approach

The author proposes integrating Boyer-Moore-Horspool (BMH) into LIKE's substring matching for the common %literal% case. BMH is a simplification of Boyer-Moore that uses only the bad-character heuristic (skipping based on the last character mismatch), which:

Key Design Considerations

When BMH helps:

When BMH may hurt:

Collation/Encoding Complications:

Prior Art Within PostgreSQL

PostgreSQL's text_position_setup() / text_position_next() in varlena.c already implements a Boyer-Moore-inspired search for strpos() and position(). The relevant question is whether this infrastructure can be reused or adapted for the LIKE engine. The challenge is that like.c uses a character-at-a-time state machine approach (via LIKE_TRUE/LIKE_FALSE/LIKE_ABORT returns and the MatchText macro) that doesn't easily decompose into "find this literal substring" sub-problems without refactoring.

Open Technical Questions

  1. Pattern decomposition: Can LIKE patterns be analyzed at planning time to extract literal segments that could use BMH? (e.g., %foo%bar% → search for "foo", then search for "bar" after the match)

  2. Threshold selection: At what pattern length does BMH's preprocessing cost break even? The author's benchmarks suggest gains for longer patterns but doesn't quantify the crossover point.

  3. SIMD alternatives: Modern CPUs support SSE4.2 PCMPESTRI/PCMPESTRM instructions that can search for substrings in 16-byte chunks. These may outperform BMH for the common case without requiring preprocessing. The pg_lfind infrastructure shows PostgreSQL is willing to use SIMD in specific cases.

  4. Integration point: Should this be done in the LIKE operator itself, or should the optimizer more aggressively rewrite LIKE '%literal%' into strpos(col, 'literal') > 0 at plan time?

Assessment

This is an early-stage RFC from a newcomer seeking community guidance. No patch has been submitted. The thread represents a feasibility inquiry rather than a concrete proposal. The historical resistance to this optimization stems from:

For this to move forward, the community would likely want to see: (1) a concrete patch with regression tests, (2) benchmarks showing no regression for short patterns, (3) correct handling of multibyte encodings, and (4) a clear argument for why this is better than the optimizer rewriting to strpos() or users employing pg_trgm.