Improving Unicode Normalization Performance in PostgreSQL
The Core Problem
PostgreSQL's normalize() function (and the underlying unicode_normalize() in src/common/unicode_norm.c) implements the four Unicode Normalization Forms (NFC, NFD, NFKC, NFKD). The implementation had two architectural issues accumulated since its introduction:
-
Divergent frontend/backend code paths. Because
libpq/libpgcommonis shared with client tools (e.g. SCRAM/SASLprep viasaslprep.c), a perfect hash table was deemed too large for the frontend. So the frontend usedbsearch()over a sorted array, while the backend used a perfect hash (generated viaPerfectHash.pm, producingunicode_norm_hashfunc.h). This divergence meant the two paths were not equally exercised by tests, and added maintenance burden. -
Performance of table lookups and the decomposition algorithm. The per-codepoint lookup was the hot path. Beyond lookup,
decompose_code()used recursion at runtime, andunicode_normalize()allocated auint32[]buffer for combining-class caching proportional to the decomposed length, which is wasteful for short strings and incurspallocoverhead.
Unicode normalization matters architecturally because it sits under SCRAM password handling (frontend/backend), the normalize()/IS NORMALIZED SQL surface, and is a building block for any future form-insensitive comparison or indexing. It is also code that is executed character-by-character over potentially large strings, so constant factors dominate.
The Solution: Two-Level Sparse Array (Radix-Style) Lookup
Borisov's core design insight is that a range index / two-level sparse array beats both bsearch (log n branches, cache-unfriendly) and perfect hashing (one multiplication + table lookup, but larger tables and a required stored codepoint for verification). The data structure evolved over the thread:
Iteration 1 — Sparse array with generated branch code
A Perl module (initially Ranges.pm, later GenerateSparseArray.pm) emits:
- A compact
uint16sparse array indexing into a dense entry array. - Generated C code that does a cascade of range comparisons to map a codepoint to a sparse-array offset.
Crucially, the 32-bit stored codepoint field is removed from the per-entry struct. Perfect hashing had required storing the codepoint to verify collisions; eliminating it both shrinks the dense table and, more importantly, enables deduplication of identical entries (e.g. many codepoints decomposing identically or sharing CCC/flag combinations). The main decomposition table shrank from 6843 → 3957 entries; the decomp codepoint table from 5138 → 4931 uint32s.
Iteration 2 — Replace branches with a second offset table (true radix tree)
After Heikki Linnakangas's review, Borisov replaced the branching lookup function with a pure table lookup:
return table_index[ table_offset[cp >> 6] + (cp & 63) ];
This is O(1) with a single bounds check. As Heikki noted, "Sounds like you re-invented the radix tree." It is in fact a classic 2-level trie keyed by the high and low bits of the codepoint, the same structure used by CPython, ICU, and Tcl for Unicode property lookup. The structural equivalence to src/common/wchar.c's encoding radix trees was also observed, though the lookup here is tighter (no per-level variable stride).
Iteration 3 — Algorithmic cleanup of unicode_normalize()
Beyond lookup, v4+ rewrote the normalization driver:
- Removed runtime recursion in
decompose_code(). Decomposition chains are fully expanded at build time by the Perl generator, so the runtime just copies a flat codepoint list. - Stack buffer for small strings. The combining-class cache was changed from
uint32[decomposed_len](heap) touint8[512]on the stack, falling back to heap only for longer strings. CCC values fit in a byte, so this is both smaller and avoidspallocfor the common case. - Compatibility decomposition separated from canonical decomposition table, which avoids dead lookups for NFC/NFD.
Performance Results
Three measurement frameworks were used (pgbench driving normalize(), Jeff Davis's ICU-derived 8–9MB corpus, and the historical md5/repeat workload from the original normalization thread). Consolidated numbers across iterations:
| Stage | NFD vs master | NFC vs master | vs ICU |
|---|---|---|---|
| Lookup-only patch (v3) | ~1.7× | ~1.6× | PG still 5–10× slower |
| + algorithm rewrite (v4) | ~5.7× | ~4.9× | PG ≈ ICU on NFD, ~2.5× slower on NFC |
| + stack CCC cache (v5) | ~2× further | ~2× further | comparable to ICU for decomposition |
Jeff Davis's ICU comparison was significant because it reframed the goal: rather than just "faster than before," the aim became "fastest open-source Unicode normalizer." On decomposition (NFD/NFKD) the patched PG slightly beats ICU. On composition (NFC/NFKC) ICU is still ~2× faster — Borisov attributes part of this to ICU being given a caller-sized output buffer while PG pre-scans to compute output length, but the gap is real and a future optimization target.
Key Design Tensions
Frontend binary size vs. code unification
John Naylor initially asked whether frontend and backend could share a single path — good for test coverage. The "small table" variant adds ~7KB to unicode_norm.o and ~1% to libpq, which Naylor judged acceptable "especially considering unicode_norm_srv.o is smaller by 27kB." This was pivotal: it authorized deleting unicode_norm_hashfunc.h (the 3025-line perfect-hash header) and removing the fork in the code, a structural simplification beyond pure performance.
Sentinel value semantics (unresolved)
Jeff Davis flagged that index 0 is overloaded: it means both "not found" and "valid zeroth entry," forcing callers to reserve slot 0 as a dummy. He suggested #define EMPTY 0xFFFF. Borisov was non-committal, arguing "0 is easier to understand." The reviewer assam258 flagged this as the one open design issue — not a bug today, but a foot-gun if new callers are added.
Lookup API shape
Heikki pushed for the generator to emit a lookup(x) function rather than expose the tables, mirroring PerfectHash.pm. This encapsulates the "it's a radix tree" implementation detail from callers. Borisov adopted this in the v2 algorithm restructure.
Packing tricks declined
Heikki suggested packing comb_class (56 distinct values) and dec_size_flags (18 distinct values) into a single byte via a secondary lookup, shrinking pg_unicode_decomposition from 4 → 3 bytes. Borisov tested this and found the auxiliary index almost as large as the savings, because CCC values are scattered. He kept the straightforward 4-byte struct — a judgment call favoring simplicity over 25% table shrinkage.
Related Ideas Raised But Deferred
Nico Williams contributed a notable architectural digression about ZFS's form-insensitive / form-preserving approach: normalize only inside comparison and hashing functions, with an ASCII fast-path cursor that only invokes the slow path when either the current or next byte is non-ASCII, and that only normalizes one character at a time into a small stack buffer. Jeff Davis responded that PostgreSQL is already form-preserving (no auto-normalization), and non-deterministic ICU collations already provide form-insensitive comparison (with "full normalization" handling many-combining-mark cases), though not available at the database level yet. This is adjacent work, not blocking this patch.
Jeff also suggested operating directly on UTF-8 without round-tripping through codepoints and returning the input buffer unchanged when no normalization is needed — optimizations consistent with the ZFS approach and explicitly left as future work.
Participant Dynamics
- Alexander Borisov (author) — the patch series follows his prior case-mapping work. He demonstrates strong ownership, empirically testing reviewer suggestions rather than just accepting them (e.g. rejecting the byte-packing proposal after measurement).
- Jeff Davis (committer) — the primary technical reviewer. His ICU benchmark reframed the bar, and his API critique (sentinel semantics, single-step lookup) drove real refactoring.
- Heikki Linnakangas (committer) — engaged late but substantively; his feedback precipitated the shift from branch-based to table-based (true radix) lookup, the most impactful structural change after the initial submission.
- John Naylor (committer) — early gatekeeping on the frontend/backend size tradeoff.
- Tom Lane (committer) — administrative guidance only, no technical review.
- Victor Yegorov — independent reviewer who marked Ready for Committer in Sept 2025.
- assam258 — late-stage reviewer who did coverage analysis (95.4% line coverage on modified code) and found three uncovered edge cases in
canonical_reorder()for Hangul and the stack-buffer overflow path; proposed concrete regression tests. - Michael Paquier (committer) — declined v19 for timing but signaled interest for v20; asked for a repeatable benchmark generator script rather than pre-built corpora.
Implications
If committed, this:
- Deletes ~3000 lines of generated perfect-hash code and unifies the frontend/backend paths, simplifying future Unicode upgrades.
- Establishes
GenerateSparseArray.pmas a reusable radix-tree emitter for other Unicode lookups (case mapping, properties) — a generalizable infrastructure piece. - Brings PostgreSQL's normalization performance within striking distance of ICU without taking a runtime dependency, preserving the ability to run SASLprep in the frontend without linking ICU.
- Sets up the follow-on work Jeff and Nico gestured toward: UTF-8-native operation, input-unchanged fast path, and eventual form-insensitive comparison primitives.
The patch was not committed for v19 (frozen March 2026) and is targeted at v20.