Interest in Faster Unicode Normalization
Technical Problem
Unicode normalization is a fundamental operation in PostgreSQL's text processing pipeline. It is required for:
- Collation correctness: Ensuring that string comparisons produce semantically correct results regardless of how Unicode codepoints are composed or decomposed.
- The
normalize()andis_normalized()SQL functions: Exposed directly to users since PostgreSQL 13. - ICU integration: PostgreSQL relies on ICU for Unicode normalization (NFC, NFD, NFKC, NFKD forms), which is used in collation, full-text search, and pattern matching.
- Case folding: Used in case-insensitive operations (
ILIKE,citext,lower()/upper()with Unicode-aware collations).
The performance of Unicode normalization matters architecturally because it sits in hot paths for text-heavy workloads — sorting, indexing, and comparing strings all potentially invoke normalization. PostgreSQL's current approach delegates to ICU's unorm2_normalize() and related APIs, which are general-purpose but not optimized for the common case where input is already in NFC form (the most common encoding on the web) or consists entirely of ASCII.
Proposed Solution
Diego Frias presents xxUTF, a library that implements Unicode normalization using SIMD (Single Instruction, Multiple Data) instructions to achieve significant speedups over ICU. The key technical insight behind SIMD-accelerated normalization is:
- Fast-path ASCII detection: SIMD can check 16-32 bytes at a time to determine if a block is entirely ASCII (all bytes < 0x80), in which case no normalization is needed. This is the overwhelmingly common case for many PostgreSQL deployments.
- Quick-check optimization: Unicode defines "quick check" properties that allow skipping normalization for codepoints that are already in the target form. SIMD can accelerate scanning for codepoints that fail the quick check.
- Vectorized case folding: Simple case mappings (single codepoint → single codepoint) can be applied in bulk using SIMD lookup tables.
Architectural Implications for PostgreSQL
Integrating a SIMD-based normalization library into PostgreSQL would involve several considerations:
-
Build system complexity: SIMD code requires compile-time feature detection (SSE2/SSE4.2/AVX2 on x86, NEON on ARM) and potentially runtime dispatch. PostgreSQL already has some precedent for this (e.g., CRC32 acceleration).
-
ICU dependency relationship: PostgreSQL currently has both a built-in Unicode normalization path (added in PG13 for
--without-icubuilds) and the ICU path. A SIMD library could potentially replace or supplement either. -
Correctness guarantees: Unicode normalization must be byte-for-byte correct per the Unicode specification. Any alternative implementation needs exhaustive conformance testing against the Unicode Normalization Test suite.
-
Maintenance burden: Tracking Unicode version updates (new codepoints, changed properties) is non-trivial. ICU handles this upstream; a custom library would need its own update mechanism.
-
Platform portability: PostgreSQL runs on many architectures. A SIMD library needs fallback scalar paths and ideally runtime dispatch.
Prior Art and Context
The thread references a previous mailing list discussion (message-id 844d3dd7-2955-4794-95d1-7f4c13cb89fc@gmail.com) which indicates there has been prior interest in faster normalization within the PostgreSQL community. PostgreSQL 17 saw work on improving the built-in Unicode infrastructure (generated tables, fast-path checks), suggesting the community is receptive to performance improvements in this area.
Assessment
This is an initial inquiry/RFC with no patch attached. The thread represents the earliest possible stage of a contribution — gauging community interest before investing in PostgreSQL-specific integration work. Key questions that would need to be addressed before adoption:
- Licensing compatibility (PostgreSQL uses a permissive BSD-like license)
- Whether the gains justify adding a new dependency vs. improving the existing built-in path
- Whether PostgreSQL should invest in its own SIMD normalization or continue relying on ICU
- Conformance test results against Unicode NormalizationTest.txt
- Benchmark methodology and applicability to PostgreSQL's actual usage patterns (short strings, already-NFC text, mixed scripts)