Remove invalid SS2/SS3 handling from EUC-KR routines

First seen: 2026-05-12 06:09:49+00:00 · Messages: 7 · Participants: 3

Latest Update

2026-05-14 · claude-opus-4-6

Remove Invalid SS2/SS3 Handling from EUC-KR Routines

Core Problem

PostgreSQL's EUC-KR encoding implementation carries vestigial SS2 (0x8E) and SS3 (0x8F) single-shift byte handling inherited from shared pg_euc_* helper functions that were originally designed for EUC-JP and EUC-TW — encodings that legitimately designate G2 and G3 character sets. EUC-KR, as defined by KS X 2901 (formerly KS C 5861-1992), designates only G0 (ASCII via KS X 1003) and G1 (KS X 1001 Korean characters). Code sets 2 and 3 are explicitly "undefined" per the standard, with normative text stating they "may be defined and used in the future if necessary."

This creates an internal inconsistency within PostgreSQL's own encoding layer:

  1. pg_euckr_verifychar() (the validation function) already correctly rejects SS2/SS3 — it has no branch for 0x8E/0x8F and only accepts 0x00–0x7F (G0) and 0xA1–0xFE lead bytes (G1).
  2. pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len() delegate to the shared pg_euc_* helpers which do handle SS2/SS3, reporting multi-byte lengths (2 or 3) for bytes that the verifier rejects.
  3. pg_wchar_table[PG_EUC_KR].maxmblen is set to 3, implying 3-byte sequences exist.
  4. Documentation Table 23.3 lists EUC_KR as "1–3" bytes per character.

The verifier and the length/display functions tell contradictory stories. For valid EUC-KR data this is harmless — SS2/SS3 bytes never pass verification, so the SS2/SS3 branches in mblen/dsplen are dead code for verified input. However, the inconsistency matters in two dimensions:

Architectural Context

PostgreSQL's multi-byte character support is organized around pg_wchar_table[] in src/common/wchar.c, a dispatch table indexed by encoding ID. Each encoding registers callbacks for:

The EUC family (EUC-JP, EUC-CN, EUC-KR, EUC-TW) historically shared helper functions (pg_euc_mblen, pg_euc_dsplen, etc.) that implement the full EUC generalization including SS2/SS3 handling. This was a reasonable code-sharing decision when these encodings were first added, but it bakes Japanese/Taiwanese assumptions (JIS X 0201 half-width katakana via SS2, JIS X 0212 via SS3) into encodings where those constructs don't apply.

The patch breaks EUC-KR out of this shared delegation by providing standalone implementations that recognize only G0 (1 byte, ASCII) and G1 (2 bytes, high-bit-set lead + trail). This mirrors the structure already used by UHC (Unified Hangul Code), which is also a 1–2 byte Korean encoding with maxmblen=2 — a consistency win noted by the reviewer.

Proposed Solution

The patch (evolving through v1→v2→v3) makes three surgical changes:

  1. Replace delegating functions: pg_euckr_mblen(), pg_euckr_dsplen(), and pg_euckr2wchar_with_len() get EUC-KR-specific implementations that handle only G0 (1-byte ASCII) and G1 (2-byte KS X 1001). No SS2/SS3 branches.
  2. Set maxmblen to 2 in pg_wchar_table[PG_EUC_KR], down from 3.
  3. Fix documentation Table 23.3 from "1–3" to "1–2" in charset.sgml.

The only user-visible behavioral change for valid data is pg_encoding_max_length('EUC_KR') returning 2 instead of 3.

Patch Evolution

A planned but not-yet-submitted restructuring will split into:

Key Technical Debates and Design Decisions

Regression Test Coverage (Michael Paquier's concern)

Michael Paquier flagged that the patch had zero regression tests and pointed to src/test/locale/ as potentially impacted infrastructure. SungJun correctly argued that src/test/locale/ exercises collation/ctype (locale-dependent sort order and isalpha-family functions), not the encoding layer (maxmblen, mblen, dsplen). He further noted the practical limitation that src/test/locale/ uses a 1998-era shell harness not integrated into check-world or the buildfarm — tests added there provide no ongoing coverage.

The proposed 0001/0002 split is an elegant response: characterization-first testing makes the behavior delta precisely one line of diff, which is both a regression test and an audit artifact.

Side-Effect Surface Analysis (Henson Choi's P1–P5 framework)

Henson Choi provided a rigorous review framework identifying five verification principles:

Both SungJun and Henson agreed to work through this list gradually as follow-up rather than blocking the patch, which is pragmatic given that the patch introduces no behavior change for valid data and the verify-bypass paths represent pre-existing data integrity concerns orthogonal to this fix.

EUC-CN Parallel (Noted but Deferred)

Henson identified that EUC-CN (GB 2312) has the exact same structural problem — pg_euccn_* carries SS2/SS3 branches and maxmblen=3 despite CS2/CS3 being undefined. He traced this to the deliberate minimal-fix approach in commit af79c30dc3e ("Fix encoding length for EUC_CN", CVE-2026-2006), which was designed to be back-patchable. The commit message apparently left the door open for a harmonizing cleanup on master. Henson diplomatically suggested this as a natural follow-up for contributors closer to the Chinese encoding ecosystem — a thoughtful nod to the project's collaborative norms around domain expertise and encoding-specific test data.

Standards Verification

A notable aspect of this thread is the care taken to ground the change in normative standards text. SungJun provided a direct URL to the KS X 2901 digital standard on standard.go.kr, navigational instructions to the relevant clause, a verbatim transcription of the code set table, and both Korean original and English translation of the normative text. Henson independently verified the Korean text. This level of standards citation is exemplary for encoding-related patches where implementors may not read the source language.

Assessment

This is a clean, well-motivated correctness patch that aligns PostgreSQL's EUC-KR encoding layer with both the governing standard (KS X 2901) and its own internal verifier. The risk is low — the only behavioral change for valid data is a metadata value (pg_encoding_max_length), and the verify-bypass surface for invalid data is a pre-existing condition. The patch has received substantive review from a domain-knowledgeable reviewer (Henson Choi) and process guidance from a committer (Michael Paquier). The outstanding action item is the 0001/0002 split with regression tests, after which the patch appears ready for committer consideration.