Interest in Faster Unicode Normalization

First seen: 2026-06-04 00:06:09+00:00 · Messages: 1 · Participants: 1

Latest Update

2026-06-04 · claude-opus-4-6

Interest in Faster Unicode Normalization

Technical Problem

Unicode normalization is a fundamental operation in PostgreSQL's text processing pipeline. It is required for:

  1. Collation correctness: Ensuring that string comparisons produce semantically correct results regardless of how Unicode codepoints are composed or decomposed.
  2. The normalize() and is_normalized() SQL functions: Exposed directly to users since PostgreSQL 13.
  3. ICU integration: PostgreSQL relies on ICU for Unicode normalization (NFC, NFD, NFKC, NFKD forms), which is used in collation, full-text search, and pattern matching.
  4. Case folding: Used in case-insensitive operations (ILIKE, citext, lower()/upper() with Unicode-aware collations).

The performance of Unicode normalization matters architecturally because it sits in hot paths for text-heavy workloads — sorting, indexing, and comparing strings all potentially invoke normalization. PostgreSQL's current approach delegates to ICU's unorm2_normalize() and related APIs, which are general-purpose but not optimized for the common case where input is already in NFC form (the most common encoding on the web) or consists entirely of ASCII.

Proposed Solution

Diego Frias presents xxUTF, a library that implements Unicode normalization using SIMD (Single Instruction, Multiple Data) instructions to achieve significant speedups over ICU. The key technical insight behind SIMD-accelerated normalization is:

Architectural Implications for PostgreSQL

Integrating a SIMD-based normalization library into PostgreSQL would involve several considerations:

  1. Build system complexity: SIMD code requires compile-time feature detection (SSE2/SSE4.2/AVX2 on x86, NEON on ARM) and potentially runtime dispatch. PostgreSQL already has some precedent for this (e.g., CRC32 acceleration).

  2. ICU dependency relationship: PostgreSQL currently has both a built-in Unicode normalization path (added in PG13 for --without-icu builds) and the ICU path. A SIMD library could potentially replace or supplement either.

  3. Correctness guarantees: Unicode normalization must be byte-for-byte correct per the Unicode specification. Any alternative implementation needs exhaustive conformance testing against the Unicode Normalization Test suite.

  4. Maintenance burden: Tracking Unicode version updates (new codepoints, changed properties) is non-trivial. ICU handles this upstream; a custom library would need its own update mechanism.

  5. Platform portability: PostgreSQL runs on many architectures. A SIMD library needs fallback scalar paths and ideally runtime dispatch.

Prior Art and Context

The thread references a previous mailing list discussion (message-id 844d3dd7-2955-4794-95d1-7f4c13cb89fc@gmail.com) which indicates there has been prior interest in faster normalization within the PostgreSQL community. PostgreSQL 17 saw work on improving the built-in Unicode infrastructure (generated tables, fast-path checks), suggesting the community is receptive to performance improvements in this area.

Assessment

This is an initial inquiry/RFC with no patch attached. The thread represents the earliest possible stage of a contribution — gauging community interest before investing in PostgreSQL-specific integration work. Key questions that would need to be addressed before adoption: