Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

2026-05-06 · opus 4.7

Core Problem: Structural vs. Semantic Validation Mismatch in Legacy Encodings

This thread surfaces a long-standing asymmetry in PostgreSQL's handling of legacy multibyte encodings (specifically EUC_CN / GB2312, but the issue generalizes to other EUC_* family encodings): input-time validation is purely structural (byte-pattern based), whereas conversion to UTF-8 uses a stricter mapping-table check. The consequence is that byte sequences such as 0xA2A3 — which satisfy the "lead byte in 0xA1–0xFE, trail byte in 0xA1–0xFE" structural rule — can be successfully stored in text columns, silently round-trip when client_encoding = EUC_CN, and then fail at SELECT time the moment a client connects with client_encoding = UTF8, because euc_cn_to_utf8 has no mapping entry for that codepoint.

Architecturally this is a violation of a user's reasonable expectation that data stored in a text column is always retrievable in any compatible client encoding. PostgreSQL's encoding contract is effectively "we store valid sequences of the server encoding", but "valid" here is defined much more loosely than the encoding standard itself defines it. The database becomes a store of bytes that look like EUC_CN rather than characters in EUC_CN.

Where the asymmetry lives in the code

The relevant code paths are:

src/backend/utils/mb/wchar.c — pg_euccn_verifychar() / the generic EUC verifiers, which only check that each byte of a multibyte sequence falls in the high-bit range. They do not consult the conversion tables.
src/backend/utils/mb/conversion_procs/euc_cn_and_mic/euc_cn_and_mic.c and the generated map utf8_to_euc_cn.map / euc_cn_to_utf8.map — these are strict: unmapped code points raise ERROR: character ... has no equivalent in UTF8.

The verifier is invoked on every COPY FROM row and every client message (pg_verify_mbstr), so it is on a hot path; the conversion is only invoked when a client actually requests a different encoding. This is precisely why the historical design chose cheap structural verification.

Proposed Solutions and Their Tradeoffs

Three possible responses are on the table:

Document the current behavior explicitly. State in the docs that input validation for legacy encodings is structural, not mapping-based, and that such data may be unretrievable under a different client_encoding. Lowest cost, no behavior change. David Johnston favors at least this.
Tighten validation unconditionally — make pg_euccn_verifychar() (and siblings) consult the actual assigned code-point ranges, matching what MySQL's func_gb2312_uni_onechar does (three contiguous range lookups into tables tab_gb2312_uni{0,1,2}). This closes the hole at input time but imposes a runtime cost on every row ingested, and it is a backwards-incompatible change: data that currently loads will start being rejected.
Opt-in strict validation — expose a function (or a GUC / per-encoding flag) users can invoke in triggers or CHECK constraints to enforce strictness where they care. This is David's preferred middle path: the core stays fast and compatible, the foot-gun is avoidable for users who know they need UTF-8 round-tripping.

The cost argument, re-examined by benchmark

The most technically interesting contribution is OP's benchmark (May 6), which directly attacks the "runtime cost" rationale for keeping validation purely structural. Measured on a Chinese text corpus:

Strategy	Time
`is_gb2312_simd` (SIMD range check)	181 µs
`is_gb2312_ranges_pg` (current PG-style structural)	467 µs
`is_gb2312_lookup` (table lookup, strict)	490 µs
`is_gb2312_ranges_mysql` (MySQL-style strict range)	1267 µs
`is_gb2312_rs` (Rust encoding_rs)	2823 ms-scale figure — note units differ
`iconv`	2564 µs-scale
`icu`	3258 µs-scale

The headline result: a lookup-table strict validator runs at roughly the same cost as PostgreSQL's current structural verifier (490 µs vs 467 µs — within ~5%), and a SIMD-accelerated strict validator is ~2.5× faster than the status quo. This significantly undercuts the "we can't afford strict validation" argument. The MySQL-style nested-range approach is genuinely slower (~2.7×), but that's an implementation choice, not a fundamental constraint.

The implication is that a strict validator could be introduced without measurable ingestion regression, provided it uses a lookup-table or SIMD representation rather than the naive range-cascade MySQL uses.

Key Technical Insights

Why structural-only verification was chosen historically

PostgreSQL's MB infrastructure predates widespread UTF-8 dominance, and the verifier was designed to be a cheap gate against obvious corruption (stray high-bit bytes, truncated sequences), not a full encoding validator. The conversion tables were added later and independently, so the two notions of "valid" drifted apart. No one deliberately decided "accept unassigned code points"; it's an emergent property.

The real failure mode is delayed

The bug is especially nasty because the data ingests, indexes, replicates, and dumps successfully. It only breaks when:

A client reconnects with client_encoding = UTF8
A pg_dump with UTF-8 output is attempted
A logical replication subscriber with a different encoding consumes the stream

This means hardening is not merely cosmetic — it prevents latent corruption that manifests at migration or disaster-recovery time, which is architecturally the worst moment for it to appear.

Generalization beyond EUC_CN

EUC_JP, EUC_KR, EUC_TW, BIG5, GB18030, SJIS all have the same structural/semantic gap to varying degrees. Any fix should be designed as a per-encoding hook on the pg_wchar_tbl entry (the mblen/verifier function pointers in src/backend/utils/mb/wchar.c), not as a one-off for GB2312.

Opt-in as a pragmatic path

David Johnston's suggestion of an exposed function (e.g., pg_encoding_verify_strict(text, name)) is attractive because:

It sidesteps the backwards-compatibility problem entirely.
It composes with existing CHECK-constraint and trigger infrastructure.
It lets the benchmark question be decided per-workload rather than globally.

A plausible patch would add pg_mb_verify_strict(bytea, encoding) returning boolean, backed by per-encoding strict verifiers that consult the existing *_to_utf8.map tables (which already exist in the tree — no new data needed).

Participant Weight

chenloveit (OP) — reporter and benchmark author. Not (visibly) a committer; brings empirical data and a concrete reference implementation (the NUAV repo). Technical contribution is substantial: the benchmark effectively rebuts the cost objection.
David G. Johnston — major non-committer contributor, highly active on -hackers, strong voice on documentation and user-facing semantics. His position ("document it, maybe offer opt-in, don't pay the cost by default") represents the likely consensus default; committers would need a strong reason to override it. No committer has weighed in on this thread yet, which is itself a signal that the proposal as framed (tighten by default) is unlikely to land without a more detailed patch and broader buy-in.

What a Credible Patch Would Look Like

Given the thread's trajectory, the patch most likely to be accepted would:

Add a paragraph to doc/src/sgml/charset.sgml documenting the structural-only guarantee.
Add a new SQL-callable function pg_mb_strict_verify(text) (or similar) using lookup tables derived from the existing conversion maps.
Not change the default verifier behavior in any release — or at most gate it behind a GUC like strict_multibyte_validation = off defaulting to off, with a deprecation path.

A patch that simply tightens pg_euccn_verifychar() would almost certainly be rejected on compatibility grounds regardless of the benchmark, because existing databases contain such sequences and pg_upgrade / logical replication would break.