Technical Analysis: Fix for Hangul Syllable Composition Bug in Unicode Normalization
Core Problem
PostgreSQL's Unicode normalization implementation contains a subtle off-by-one error in its Hangul syllable composition algorithm. The bug involves the incorrect recognition of codepoint U+11A7 as a valid Hangul trailing consonant (T syllable/jamo), when in fact U+11A7 is not a valid trailing consonant for composition purposes.
Background: Hangul Algorithmic Composition
Unicode defines an algorithmic (rather than table-driven) approach to composing and decomposing Korean Hangul syllables. The key constants are:
- TBase = 0x11A7 — This is set to one less than the actual start of the trailing consonant range
- TCount = 28 — This counts the 27 actual trailing consonants (U+11A8 through U+11C2) plus 1 for the "no trailing consonant" case
The critical insight is that TCount intentionally includes one extra slot (index 0) to represent the absence of a trailing consonant. When computing an S-index for a precomposed Hangul character, s % TCount == 0 means the syllable has no trailing consonant (it's an LV syllable, not an LVT syllable).
The Bug
PostgreSQL's code incorrectly classifies U+11A7 as a valid T jamo. The check likely uses something like:
if (ch >= TBase && ch < TBase + TCount) /* WRONG */
When it should be:
if (ch > TBase && ch < TBase + TCount) /* CORRECT: strict greater-than */
Or equivalently:
if (ch >= TBase + 1 && ch < TBase + TCount)
Since TBase (0x11A7) itself is not a valid trailing consonant — it's the sentinel value below the actual range — including it causes the composition algorithm to produce nonsense precomposed characters when U+11A7 appears in the input stream adjacent to an LV syllable. The algorithm would incorrectly combine an LV syllable with U+11A7, yielding a codepoint that doesn't correspond to any valid Hangul syllable.
Real-World Impact
While U+11A7 (HANGUL JONGSEONG TIKEUT-MIEUM, actually assigned in later Unicode versions as a compatibility jamo) appearing in real text is uncommon, this is a correctness bug in NFC/NFD normalization. It means:
- PostgreSQL's
normalize()function could produce invalid Unicode output is_normalized()could give incorrect results- Any collation or comparison relying on normalization could behave incorrectly with affected inputs
Cross-Project Precedent
The patch author notes this same bug has been found and fixed in:
- Rust's unicode-rs crate (documented in a comment at the referenced line)
- JuliaStrings/utf8proc (fixed by the same contributor a few months prior)
This suggests it's a common implementation mistake stemming from the non-obvious design of the TBase/TCount constants in the Unicode specification.
Proposed Solution
The patch (not shown in full detail) likely changes a >= comparison to > (or adjusts the range boundary by 1) in src/common/unicode_norm.c to exclude U+11A7 from being treated as a valid T jamo during the composition phase.
Testing Considerations
The author notes that the current test infrastructure (src/common/norm_test.c) only runs the official Unicode Normalization Test suite (NormalizationTest.txt from unicode.org) and doesn't support custom test cases. The Unicode test suite apparently does not cover this specific edge case (or it would have been caught earlier). The author offers to add custom test infrastructure if desired.
This is an important observation — it suggests PostgreSQL's Unicode normalization testing relies entirely on the upstream conformance test, which may not exercise all implementation-specific edge cases, particularly around boundary conditions in the algorithmic Hangul handling.
Architectural Significance
This bug lives in src/common/unicode_norm.c, which is part of PostgreSQL's common library used by both frontend and backend. It affects:
- The
normalize()SQL function - The
is_normalized()SQL function - Any internal normalization used for ICU integration or collation purposes
- Potentially
pg_dump/psqlif they perform any normalization on the frontend side
The fix is minimal and low-risk, making it a candidate for backpatching to all supported branches.