2026-05-14 · claude-opus-4-6

Remove Invalid SS2/SS3 Handling from EUC-KR Routines

Core Problem

PostgreSQL's EUC-KR encoding implementation carries vestigial SS2 (0x8E) and SS3 (0x8F) single-shift byte handling inherited from shared pg_euc_* helper functions that were originally designed for EUC-JP and EUC-TW — encodings that legitimately designate G2 and G3 character sets. EUC-KR, as defined by KS X 2901 (formerly KS C 5861-1992), designates only G0 (ASCII via KS X 1003) and G1 (KS X 1001 Korean characters). Code sets 2 and 3 are explicitly "undefined" per the standard, with normative text stating they "may be defined and used in the future if necessary."

This creates an internal inconsistency within PostgreSQL's own encoding layer:

pg_euckr_verifychar() (the validation function) already correctly rejects SS2/SS3 — it has no branch for 0x8E/0x8F and only accepts 0x00–0x7F (G0) and 0xA1–0xFE lead bytes (G1).
pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len() delegate to the shared pg_euc_* helpers which do handle SS2/SS3, reporting multi-byte lengths (2 or 3) for bytes that the verifier rejects.
pg_wchar_table[PG_EUC_KR].maxmblen is set to 3, implying 3-byte sequences exist.
Documentation Table 23.3 lists EUC_KR as "1–3" bytes per character.

The verifier and the length/display functions tell contradictory stories. For valid EUC-KR data this is harmless — SS2/SS3 bytes never pass verification, so the SS2/SS3 branches in mblen/dsplen are dead code for verified input. However, the inconsistency matters in two dimensions:

Correctness of the encoding contract: maxmblen=3 is factually wrong per KS X 2901 and misleads buffer-sizing logic and any consumer of pg_encoding_max_length('EUC_KR').
Pre-existing unverified data on disk: Bytes can reach disk through paths that bypass pg_verify_mbstr (physical replication, pg_upgrade relfilenode preservation, pre-patch data from older versions). For such rows containing 0x8E/0x8F, pg_mblen currently returns 2 or 3 (consuming trailing bytes as part of a phantom multi-byte character); after the patch it returns 1 (treating 0x8E/0x8F as a single invalid byte). This changes the behavior of length(), substring(), LIKE, and similar string functions on such rows.

Architectural Context

PostgreSQL's multi-byte character support is organized around pg_wchar_table[] in src/common/wchar.c, a dispatch table indexed by encoding ID. Each encoding registers callbacks for:

mblen — byte length of a character given its lead byte
dsplen — display width
mb2wchar_with_len — conversion to wide characters
mbverifychar / mbverifystr — validation

The EUC family (EUC-JP, EUC-CN, EUC-KR, EUC-TW) historically shared helper functions (pg_euc_mblen, pg_euc_dsplen, etc.) that implement the full EUC generalization including SS2/SS3 handling. This was a reasonable code-sharing decision when these encodings were first added, but it bakes Japanese/Taiwanese assumptions (JIS X 0201 half-width katakana via SS2, JIS X 0212 via SS3) into encodings where those constructs don't apply.

The patch breaks EUC-KR out of this shared delegation by providing standalone implementations that recognize only G0 (1 byte, ASCII) and G1 (2 bytes, high-bit-set lead + trail). This mirrors the structure already used by UHC (Unified Hangul Code), which is also a 1–2 byte Korean encoding with maxmblen=2 — a consistency win noted by the reviewer.

Proposed Solution

The patch (evolving through v1→v2→v3) makes three surgical changes:

Replace delegating functions: pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len() get EUC-KR-specific implementations that handle only G0 (1-byte ASCII) and G1 (2-byte KS X 1001). No SS2/SS3 branches.
Set maxmblen to 2 in pg_wchar_table[PG_EUC_KR], down from 3.
Fix documentation Table 23.3 from "1–3" to "1–2" in charset.sgml.

The only user-visible behavioral change for valid data is pg_encoding_max_length('EUC_KR') returning 2 instead of 3.

Patch Evolution

v1: Initial implementation with the three core changes.
v2: Added commit message note about structural similarity to UHC post-patch (suggested by Henson Choi). Added a release-19.sgml hunk for the pg_encoding_max_length change.
v3: Removed the release-19.sgml hunk per Henson's guidance that release notes are curated by the release manager from commit messages, not embedded in patches. The commit message now carries the user-visible change description. The charset.sgml Table 23.3 fix is retained as it documents encoding behavior.

A planned but not-yet-submitted restructuring will split into:

0001: Baseline characterization tests in src/test/regress/sql/euc_kr.sql covering pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001 handling, and SS2 lead-byte rejection. Expected output records current (pre-fix) behavior where pg_encoding_max_length returns 3.
0002: The actual fix. Expected output changes by exactly one line (3→2), providing a clean audit signal that valid data is unaffected.

Key Technical Debates and Design Decisions

Regression Test Coverage (Michael Paquier's concern)

Michael Paquier flagged that the patch had zero regression tests and pointed to src/test/locale/ as potentially impacted infrastructure. SungJun correctly argued that src/test/locale/ exercises collation/ctype (locale-dependent sort order and isalpha-family functions), not the encoding layer (maxmblen, mblen, dsplen). He further noted the practical limitation that src/test/locale/ uses a 1998-era shell harness not integrated into check-world or the buildfarm — tests added there provide no ongoing coverage.

The proposed 0001/0002 split is an elegant response: characterization-first testing makes the behavior delta precisely one line of diff, which is both a regression test and an audit artifact.

Side-Effect Surface Analysis (Henson Choi's P1–P5 framework)

Henson Choi provided a rigorous review framework identifying five verification principles:

P1 (Indirect-entry completeness): pg_wchar_table[] confirmed as the only callback registry into the EUC-KR slots.
P2 (Direct-call review): Exhaustive enumeration of callers of pg_mblen/pg_dsplen/pg_encoding_mblen etc. across varlena.c, oracle_compat.c, formatting.c, varchar.c, like.c, xml.c, jsonfuncs.c, and tsearch modules.
P3 (Verify-bypass): The most architecturally significant concern — bytes reaching mblen/dsplen without having been validated:
- Call-time: copyto.c file_encoding scan, cstring pg_mblen callers in formatting.c/varlena.c.
- Load-time: pg_upgrade relfilenode preservation (bytes copied as-is), physical replication/WAL replay, and pre-patch on-disk data from earlier versions.
P4 (Cross-encoding contamination): Confirming no Japanese-shaped assumption can reach EUC-KR data after the split.
P5 (User-visible width): pg_encoding_max_length change from 3→2 affecting buffer-sizing consumers.

Both SungJun and Henson agreed to work through this list gradually as follow-up rather than blocking the patch, which is pragmatic given that the patch introduces no behavior change for valid data and the verify-bypass paths represent pre-existing data integrity concerns orthogonal to this fix.

EUC-CN Parallel (Noted but Deferred)

Henson identified that EUC-CN (GB 2312) has the exact same structural problem — pg_euccn_* carries SS2/SS3 branches and maxmblen=3 despite CS2/CS3 being undefined. He traced this to the deliberate minimal-fix approach in commit af79c30dc3e ("Fix encoding length for EUC_CN", CVE-2026-2006), which was designed to be back-patchable. The commit message apparently left the door open for a harmonizing cleanup on master. Henson diplomatically suggested this as a natural follow-up for contributors closer to the Chinese encoding ecosystem — a thoughtful nod to the project's collaborative norms around domain expertise and encoding-specific test data.

Standards Verification

A notable aspect of this thread is the care taken to ground the change in normative standards text. SungJun provided a direct URL to the KS X 2901 digital standard on standard.go.kr, navigational instructions to the relevant clause, a verbatim transcription of the code set table, and both Korean original and English translation of the normative text. Henson independently verified the Korean text. This level of standards citation is exemplary for encoding-related patches where implementors may not read the source language.

Assessment

This is a clean, well-motivated correctness patch that aligns PostgreSQL's EUC-KR encoding layer with both the governing standard (KS X 2901) and its own internal verifier. The risk is low — the only behavioral change for valid data is a metadata value (pg_encoding_max_length), and the verify-bypass surface for invalid data is a pre-existing condition. The patch has received substantive review from a domain-knowledgeable reviewer (Henson Choi) and process guidance from a committer (Michael Paquier). The outstanding action item is the 0001/0002 split with regression tests, after which the patch appears ready for committer consideration.

Remove invalid SS2/SS3 handling from EUC-KR routines

Latest Update