2026-05-06 · opus 4.7

Improving Unicode Normalization Performance in PostgreSQL

The Core Problem

PostgreSQL's normalize() function (and the underlying unicode_normalize() in src/common/unicode_norm.c) implements the four Unicode Normalization Forms (NFC, NFD, NFKC, NFKD). The implementation had two architectural issues accumulated since its introduction:

Divergent frontend/backend code paths. Because libpq/libpgcommon is shared with client tools (e.g. SCRAM/SASLprep via saslprep.c), a perfect hash table was deemed too large for the frontend. So the frontend used bsearch() over a sorted array, while the backend used a perfect hash (generated via PerfectHash.pm, producing unicode_norm_hashfunc.h). This divergence meant the two paths were not equally exercised by tests, and added maintenance burden.
Performance of table lookups and the decomposition algorithm. The per-codepoint lookup was the hot path. Beyond lookup, decompose_code() used recursion at runtime, and unicode_normalize() allocated a uint32[] buffer for combining-class caching proportional to the decomposed length, which is wasteful for short strings and incurs palloc overhead.

Unicode normalization matters architecturally because it sits under SCRAM password handling (frontend/backend), the normalize()/IS NORMALIZED SQL surface, and is a building block for any future form-insensitive comparison or indexing. It is also code that is executed character-by-character over potentially large strings, so constant factors dominate.

The Solution: Two-Level Sparse Array (Radix-Style) Lookup

Borisov's core design insight is that a range index / two-level sparse array beats both bsearch (log n branches, cache-unfriendly) and perfect hashing (one multiplication + table lookup, but larger tables and a required stored codepoint for verification). The data structure evolved over the thread:

Iteration 1 — Sparse array with generated branch code

A Perl module (initially Ranges.pm, later GenerateSparseArray.pm) emits:

A compact uint16 sparse array indexing into a dense entry array.
Generated C code that does a cascade of range comparisons to map a codepoint to a sparse-array offset.

Crucially, the 32-bit stored codepoint field is removed from the per-entry struct. Perfect hashing had required storing the codepoint to verify collisions; eliminating it both shrinks the dense table and, more importantly, enables deduplication of identical entries (e.g. many codepoints decomposing identically or sharing CCC/flag combinations). The main decomposition table shrank from 6843 → 3957 entries; the decomp codepoint table from 5138 → 4931 uint32s.

Iteration 2 — Replace branches with a second offset table (true radix tree)

After Heikki Linnakangas's review, Borisov replaced the branching lookup function with a pure table lookup:

return table_index[ table_offset[cp >> 6] + (cp & 63) ];

This is O(1) with a single bounds check. As Heikki noted, "Sounds like you re-invented the radix tree." It is in fact a classic 2-level trie keyed by the high and low bits of the codepoint, the same structure used by CPython, ICU, and Tcl for Unicode property lookup. The structural equivalence to src/common/wchar.c's encoding radix trees was also observed, though the lookup here is tighter (no per-level variable stride).

Iteration 3 — Algorithmic cleanup of `unicode_normalize()`

Beyond lookup, v4+ rewrote the normalization driver:

Removed runtime recursion in decompose_code(). Decomposition chains are fully expanded at build time by the Perl generator, so the runtime just copies a flat codepoint list.
Stack buffer for small strings. The combining-class cache was changed from uint32[decomposed_len] (heap) to uint8[512] on the stack, falling back to heap only for longer strings. CCC values fit in a byte, so this is both smaller and avoids palloc for the common case.
Compatibility decomposition separated from canonical decomposition table, which avoids dead lookups for NFC/NFD.

Performance Results

Three measurement frameworks were used (pgbench driving normalize(), Jeff Davis's ICU-derived 8–9MB corpus, and the historical md5/repeat workload from the original normalization thread). Consolidated numbers across iterations:

Stage	NFD vs master	NFC vs master	vs ICU
Lookup-only patch (v3)	~1.7×	~1.6×	PG still 5–10× slower
+ algorithm rewrite (v4)	~5.7×	~4.9×	PG ≈ ICU on NFD, ~2.5× slower on NFC
+ stack CCC cache (v5)	~2× further	~2× further	comparable to ICU for decomposition

Jeff Davis's ICU comparison was significant because it reframed the goal: rather than just "faster than before," the aim became "fastest open-source Unicode normalizer." On decomposition (NFD/NFKD) the patched PG slightly beats ICU. On composition (NFC/NFKC) ICU is still ~2× faster — Borisov attributes part of this to ICU being given a caller-sized output buffer while PG pre-scans to compute output length, but the gap is real and a future optimization target.

Key Design Tensions

Frontend binary size vs. code unification

John Naylor initially asked whether frontend and backend could share a single path — good for test coverage. The "small table" variant adds ~7KB to unicode_norm.o and ~1% to libpq, which Naylor judged acceptable "especially considering unicode_norm_srv.o is smaller by 27kB." This was pivotal: it authorized deleting unicode_norm_hashfunc.h (the 3025-line perfect-hash header) and removing the fork in the code, a structural simplification beyond pure performance.

Sentinel value semantics (unresolved)

Jeff Davis flagged that index 0 is overloaded: it means both "not found" and "valid zeroth entry," forcing callers to reserve slot 0 as a dummy. He suggested #define EMPTY 0xFFFF. Borisov was non-committal, arguing "0 is easier to understand." The reviewer assam258 flagged this as the one open design issue — not a bug today, but a foot-gun if new callers are added.

Lookup API shape

Heikki pushed for the generator to emit a lookup(x) function rather than expose the tables, mirroring PerfectHash.pm. This encapsulates the "it's a radix tree" implementation detail from callers. Borisov adopted this in the v2 algorithm restructure.

Packing tricks declined

Heikki suggested packing comb_class (56 distinct values) and dec_size_flags (18 distinct values) into a single byte via a secondary lookup, shrinking pg_unicode_decomposition from 4 → 3 bytes. Borisov tested this and found the auxiliary index almost as large as the savings, because CCC values are scattered. He kept the straightforward 4-byte struct — a judgment call favoring simplicity over 25% table shrinkage.

Related Ideas Raised But Deferred

Nico Williams contributed a notable architectural digression about ZFS's form-insensitive / form-preserving approach: normalize only inside comparison and hashing functions, with an ASCII fast-path cursor that only invokes the slow path when either the current or next byte is non-ASCII, and that only normalizes one character at a time into a small stack buffer. Jeff Davis responded that PostgreSQL is already form-preserving (no auto-normalization), and non-deterministic ICU collations already provide form-insensitive comparison (with "full normalization" handling many-combining-mark cases), though not available at the database level yet. This is adjacent work, not blocking this patch.

Jeff also suggested operating directly on UTF-8 without round-tripping through codepoints and returning the input buffer unchanged when no normalization is needed — optimizations consistent with the ZFS approach and explicitly left as future work.

Participant Dynamics

Alexander Borisov (author) — the patch series follows his prior case-mapping work. He demonstrates strong ownership, empirically testing reviewer suggestions rather than just accepting them (e.g. rejecting the byte-packing proposal after measurement).
Jeff Davis (committer) — the primary technical reviewer. His ICU benchmark reframed the bar, and his API critique (sentinel semantics, single-step lookup) drove real refactoring.
Heikki Linnakangas (committer) — engaged late but substantively; his feedback precipitated the shift from branch-based to table-based (true radix) lookup, the most impactful structural change after the initial submission.
John Naylor (committer) — early gatekeeping on the frontend/backend size tradeoff.
Tom Lane (committer) — administrative guidance only, no technical review.
Victor Yegorov — independent reviewer who marked Ready for Committer in Sept 2025.
assam258 — late-stage reviewer who did coverage analysis (95.4% line coverage on modified code) and found three uncovered edge cases in canonical_reorder() for Hangul and the stack-buffer overflow path; proposed concrete regression tests.
Michael Paquier (committer) — declined v19 for timing but signaled interest for v20; asked for a repeatable benchmark generator script rather than pre-built corpora.

Implications

If committed, this:

Deletes ~3000 lines of generated perfect-hash code and unifies the frontend/backend paths, simplifying future Unicode upgrades.
Establishes GenerateSparseArray.pm as a reusable radix-tree emitter for other Unicode lookups (case mapping, properties) — a generalizable infrastructure piece.
Brings PostgreSQL's normalization performance within striking distance of ICU without taking a runtime dependency, preserving the ability to run SASLprep in the frontend without linking ICU.
Sets up the follow-on work Jeff and Nico gestured toward: UTF-8-native operation, input-unchanged fast path, and eventual form-insensitive comparison primitives.

The patch was not committed for v19 (frozen March 2026) and is targeted at v20.

Improve the performance of Unicode Normalization Forms.

Latest Update

Improving Unicode Normalization Performance in PostgreSQL

The Core Problem

The Solution: Two-Level Sparse Array (Radix-Style) Lookup

Iteration 1 — Sparse array with generated branch code

Iteration 2 — Replace branches with a second offset table (true radix tree)

Iteration 3 — Algorithmic cleanup of `unicode_normalize()`

Performance Results

Key Design Tensions

Frontend binary size vs. code unification

Sentinel value semantics (unresolved)

Lookup API shape

Packing tricks declined

Related Ideas Raised But Deferred

Participant Dynamics

Implications

Improve the performance of Unicode Normalization Forms.

Latest Update

Improving Unicode Normalization Performance in PostgreSQL

The Core Problem

The Solution: Two-Level Sparse Array (Radix-Style) Lookup

Iteration 1 — Sparse array with generated branch code

Iteration 2 — Replace branches with a second offset table (true radix tree)

Iteration 3 — Algorithmic cleanup of unicode_normalize()

Performance Results

Key Design Tensions

Frontend binary size vs. code unification

Sentinel value semantics (unresolved)

Lookup API shape

Packing tricks declined

Related Ideas Raised But Deferred

Participant Dynamics

Implications

Iteration 3 — Algorithmic cleanup of `unicode_normalize()`