2026-06-04 · claude-opus-4-6

Technical Analysis: Fix for Hangul Syllable Composition Bug in Unicode Normalization

Core Problem

PostgreSQL's Unicode normalization implementation contains a subtle off-by-one error in its Hangul syllable composition algorithm. The bug involves the incorrect recognition of codepoint U+11A7 as a valid Hangul trailing consonant (T syllable/jamo), when in fact U+11A7 is not a valid trailing consonant for composition purposes.

Background: Hangul Algorithmic Composition

Unicode defines an algorithmic (rather than table-driven) approach to composing and decomposing Korean Hangul syllables. The key constants are:

TBase = 0x11A7 — This is set to one less than the actual start of the trailing consonant range
TCount = 28 — This counts the 27 actual trailing consonants (U+11A8 through U+11C2) plus 1 for the "no trailing consonant" case

The critical insight is that TCount intentionally includes one extra slot (index 0) to represent the absence of a trailing consonant. When computing an S-index for a precomposed Hangul character, s % TCount == 0 means the syllable has no trailing consonant (it's an LV syllable, not an LVT syllable).

The Bug

PostgreSQL's code incorrectly classifies U+11A7 as a valid T jamo. The check likely uses something like:

if (ch >= TBase && ch < TBase + TCount)  /* WRONG */

When it should be:

if (ch > TBase && ch < TBase + TCount)   /* CORRECT: strict greater-than */

Or equivalently:

if (ch >= TBase + 1 && ch < TBase + TCount)

Since TBase (0x11A7) itself is not a valid trailing consonant — it's the sentinel value below the actual range — including it causes the composition algorithm to produce nonsense precomposed characters when U+11A7 appears in the input stream adjacent to an LV syllable. The algorithm would incorrectly combine an LV syllable with U+11A7, yielding a codepoint that doesn't correspond to any valid Hangul syllable.

Real-World Impact

While U+11A7 (HANGUL JONGSEONG TIKEUT-MIEUM, actually assigned in later Unicode versions as a compatibility jamo) appearing in real text is uncommon, this is a correctness bug in NFC/NFD normalization. It means:

PostgreSQL's normalize() function could produce invalid Unicode output
is_normalized() could give incorrect results
Any collation or comparison relying on normalization could behave incorrectly with affected inputs

Cross-Project Precedent

The patch author notes this same bug has been found and fixed in:

Rust's unicode-rs crate (documented in a comment at the referenced line)
JuliaStrings/utf8proc (fixed by the same contributor a few months prior)

This suggests it's a common implementation mistake stemming from the non-obvious design of the TBase/TCount constants in the Unicode specification.

Proposed Solution

The patch (not shown in full detail) likely changes a >= comparison to > (or adjusts the range boundary by 1) in src/common/unicode_norm.c to exclude U+11A7 from being treated as a valid T jamo during the composition phase.

Testing Considerations

The author notes that the current test infrastructure (src/common/norm_test.c) only runs the official Unicode Normalization Test suite (NormalizationTest.txt from unicode.org) and doesn't support custom test cases. The Unicode test suite apparently does not cover this specific edge case (or it would have been caught earlier). The author offers to add custom test infrastructure if desired.

This is an important observation — it suggests PostgreSQL's Unicode normalization testing relies entirely on the upstream conformance test, which may not exercise all implementation-specific edge cases, particularly around boundary conditions in the algorithmic Hangul handling.

Architectural Significance

This bug lives in src/common/unicode_norm.c, which is part of PostgreSQL's common library used by both frontend and backend. It affects:

The normalize() SQL function
The is_normalized() SQL function
Any internal normalization used for ICU integration or collation purposes
Potentially pg_dump/psql if they perform any normalization on the frontend side

The fix is minimal and low-risk, making it a candidate for backpatching to all supported branches.

[PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization

Latest Update