Improve the performance of Unicode Normalization Forms.

First seen: 2025-06-02 21:51:08+00:00 · Messages: 41 · Participants: 10

Latest Update

2026-05-06 · opus 4.7

Improving Unicode Normalization Performance in PostgreSQL

The Core Problem

PostgreSQL's normalize() function (and the underlying unicode_normalize() in src/common/unicode_norm.c) implements the four Unicode Normalization Forms (NFC, NFD, NFKC, NFKD). The implementation had two architectural issues accumulated since its introduction:

  1. Divergent frontend/backend code paths. Because libpq/libpgcommon is shared with client tools (e.g. SCRAM/SASLprep via saslprep.c), a perfect hash table was deemed too large for the frontend. So the frontend used bsearch() over a sorted array, while the backend used a perfect hash (generated via PerfectHash.pm, producing unicode_norm_hashfunc.h). This divergence meant the two paths were not equally exercised by tests, and added maintenance burden.

  2. Performance of table lookups and the decomposition algorithm. The per-codepoint lookup was the hot path. Beyond lookup, decompose_code() used recursion at runtime, and unicode_normalize() allocated a uint32[] buffer for combining-class caching proportional to the decomposed length, which is wasteful for short strings and incurs palloc overhead.

Unicode normalization matters architecturally because it sits under SCRAM password handling (frontend/backend), the normalize()/IS NORMALIZED SQL surface, and is a building block for any future form-insensitive comparison or indexing. It is also code that is executed character-by-character over potentially large strings, so constant factors dominate.

The Solution: Two-Level Sparse Array (Radix-Style) Lookup

Borisov's core design insight is that a range index / two-level sparse array beats both bsearch (log n branches, cache-unfriendly) and perfect hashing (one multiplication + table lookup, but larger tables and a required stored codepoint for verification). The data structure evolved over the thread:

Iteration 1 — Sparse array with generated branch code

A Perl module (initially Ranges.pm, later GenerateSparseArray.pm) emits:

Crucially, the 32-bit stored codepoint field is removed from the per-entry struct. Perfect hashing had required storing the codepoint to verify collisions; eliminating it both shrinks the dense table and, more importantly, enables deduplication of identical entries (e.g. many codepoints decomposing identically or sharing CCC/flag combinations). The main decomposition table shrank from 6843 → 3957 entries; the decomp codepoint table from 5138 → 4931 uint32s.

Iteration 2 — Replace branches with a second offset table (true radix tree)

After Heikki Linnakangas's review, Borisov replaced the branching lookup function with a pure table lookup:

return table_index[ table_offset[cp >> 6] + (cp & 63) ];

This is O(1) with a single bounds check. As Heikki noted, "Sounds like you re-invented the radix tree." It is in fact a classic 2-level trie keyed by the high and low bits of the codepoint, the same structure used by CPython, ICU, and Tcl for Unicode property lookup. The structural equivalence to src/common/wchar.c's encoding radix trees was also observed, though the lookup here is tighter (no per-level variable stride).

Iteration 3 — Algorithmic cleanup of unicode_normalize()

Beyond lookup, v4+ rewrote the normalization driver:

Performance Results

Three measurement frameworks were used (pgbench driving normalize(), Jeff Davis's ICU-derived 8–9MB corpus, and the historical md5/repeat workload from the original normalization thread). Consolidated numbers across iterations:

Stage NFD vs master NFC vs master vs ICU
Lookup-only patch (v3) ~1.7× ~1.6× PG still 5–10× slower
+ algorithm rewrite (v4) ~5.7× ~4.9× PG ≈ ICU on NFD, ~2.5× slower on NFC
+ stack CCC cache (v5) ~2× further ~2× further comparable to ICU for decomposition

Jeff Davis's ICU comparison was significant because it reframed the goal: rather than just "faster than before," the aim became "fastest open-source Unicode normalizer." On decomposition (NFD/NFKD) the patched PG slightly beats ICU. On composition (NFC/NFKC) ICU is still ~2× faster — Borisov attributes part of this to ICU being given a caller-sized output buffer while PG pre-scans to compute output length, but the gap is real and a future optimization target.

Key Design Tensions

Frontend binary size vs. code unification

John Naylor initially asked whether frontend and backend could share a single path — good for test coverage. The "small table" variant adds ~7KB to unicode_norm.o and ~1% to libpq, which Naylor judged acceptable "especially considering unicode_norm_srv.o is smaller by 27kB." This was pivotal: it authorized deleting unicode_norm_hashfunc.h (the 3025-line perfect-hash header) and removing the fork in the code, a structural simplification beyond pure performance.

Sentinel value semantics (unresolved)

Jeff Davis flagged that index 0 is overloaded: it means both "not found" and "valid zeroth entry," forcing callers to reserve slot 0 as a dummy. He suggested #define EMPTY 0xFFFF. Borisov was non-committal, arguing "0 is easier to understand." The reviewer assam258 flagged this as the one open design issue — not a bug today, but a foot-gun if new callers are added.

Lookup API shape

Heikki pushed for the generator to emit a lookup(x) function rather than expose the tables, mirroring PerfectHash.pm. This encapsulates the "it's a radix tree" implementation detail from callers. Borisov adopted this in the v2 algorithm restructure.

Packing tricks declined

Heikki suggested packing comb_class (56 distinct values) and dec_size_flags (18 distinct values) into a single byte via a secondary lookup, shrinking pg_unicode_decomposition from 4 → 3 bytes. Borisov tested this and found the auxiliary index almost as large as the savings, because CCC values are scattered. He kept the straightforward 4-byte struct — a judgment call favoring simplicity over 25% table shrinkage.

Related Ideas Raised But Deferred

Nico Williams contributed a notable architectural digression about ZFS's form-insensitive / form-preserving approach: normalize only inside comparison and hashing functions, with an ASCII fast-path cursor that only invokes the slow path when either the current or next byte is non-ASCII, and that only normalizes one character at a time into a small stack buffer. Jeff Davis responded that PostgreSQL is already form-preserving (no auto-normalization), and non-deterministic ICU collations already provide form-insensitive comparison (with "full normalization" handling many-combining-mark cases), though not available at the database level yet. This is adjacent work, not blocking this patch.

Jeff also suggested operating directly on UTF-8 without round-tripping through codepoints and returning the input buffer unchanged when no normalization is needed — optimizations consistent with the ZFS approach and explicitly left as future work.

Participant Dynamics

Implications

If committed, this:

  1. Deletes ~3000 lines of generated perfect-hash code and unifies the frontend/backend paths, simplifying future Unicode upgrades.
  2. Establishes GenerateSparseArray.pm as a reusable radix-tree emitter for other Unicode lookups (case mapping, properties) — a generalizable infrastructure piece.
  3. Brings PostgreSQL's normalization performance within striking distance of ICU without taking a runtime dependency, preserving the ability to run SASLprep in the frontend without linking ICU.
  4. Sets up the follow-on work Jeff and Nico gestured toward: UTF-8-native operation, input-unchanged fast path, and eventual form-insensitive comparison primitives.

The patch was not committed for v19 (frozen March 2026) and is targeted at v20.