Core Problem: Structural vs. Semantic Validation Mismatch in Legacy Encodings
This thread surfaces a long-standing asymmetry in PostgreSQL's handling of legacy multibyte encodings (specifically EUC_CN / GB2312, but the issue generalizes to other EUC_* family encodings): input-time validation is purely structural (byte-pattern based), whereas conversion to UTF-8 uses a stricter mapping-table check. The consequence is that byte sequences such as 0xA2A3 — which satisfy the "lead byte in 0xA1–0xFE, trail byte in 0xA1–0xFE" structural rule — can be successfully stored in text columns, silently round-trip when client_encoding = EUC_CN, and then fail at SELECT time the moment a client connects with client_encoding = UTF8, because euc_cn_to_utf8 has no mapping entry for that codepoint.
Architecturally this is a violation of a user's reasonable expectation that data stored in a text column is always retrievable in any compatible client encoding. PostgreSQL's encoding contract is effectively "we store valid sequences of the server encoding", but "valid" here is defined much more loosely than the encoding standard itself defines it. The database becomes a store of bytes that look like EUC_CN rather than characters in EUC_CN.
Where the asymmetry lives in the code
The relevant code paths are:
src/backend/utils/mb/wchar.c—pg_euccn_verifychar()/ the generic EUC verifiers, which only check that each byte of a multibyte sequence falls in the high-bit range. They do not consult the conversion tables.src/backend/utils/mb/conversion_procs/euc_cn_and_mic/euc_cn_and_mic.cand the generated maputf8_to_euc_cn.map/euc_cn_to_utf8.map— these are strict: unmapped code points raiseERROR: character ... has no equivalent in UTF8.
The verifier is invoked on every COPY FROM row and every client message (pg_verify_mbstr), so it is on a hot path; the conversion is only invoked when a client actually requests a different encoding. This is precisely why the historical design chose cheap structural verification.
Proposed Solutions and Their Tradeoffs
Three possible responses are on the table:
-
Document the current behavior explicitly. State in the docs that input validation for legacy encodings is structural, not mapping-based, and that such data may be unretrievable under a different
client_encoding. Lowest cost, no behavior change. David Johnston favors at least this. -
Tighten validation unconditionally — make
pg_euccn_verifychar()(and siblings) consult the actual assigned code-point ranges, matching what MySQL'sfunc_gb2312_uni_onechardoes (three contiguous range lookups into tablestab_gb2312_uni{0,1,2}). This closes the hole at input time but imposes a runtime cost on every row ingested, and it is a backwards-incompatible change: data that currently loads will start being rejected. -
Opt-in strict validation — expose a function (or a GUC / per-encoding flag) users can invoke in triggers or CHECK constraints to enforce strictness where they care. This is David's preferred middle path: the core stays fast and compatible, the foot-gun is avoidable for users who know they need UTF-8 round-tripping.
The cost argument, re-examined by benchmark
The most technically interesting contribution is OP's benchmark (May 6), which directly attacks the "runtime cost" rationale for keeping validation purely structural. Measured on a Chinese text corpus:
| Strategy | Time |
|---|---|
is_gb2312_simd (SIMD range check) |
181 µs |
is_gb2312_ranges_pg (current PG-style structural) |
467 µs |
is_gb2312_lookup (table lookup, strict) |
490 µs |
is_gb2312_ranges_mysql (MySQL-style strict range) |
1267 µs |
is_gb2312_rs (Rust encoding_rs) |
2823 ms-scale figure — note units differ |
iconv |
2564 µs-scale |
icu |
3258 µs-scale |
The headline result: a lookup-table strict validator runs at roughly the same cost as PostgreSQL's current structural verifier (490 µs vs 467 µs — within ~5%), and a SIMD-accelerated strict validator is ~2.5× faster than the status quo. This significantly undercuts the "we can't afford strict validation" argument. The MySQL-style nested-range approach is genuinely slower (~2.7×), but that's an implementation choice, not a fundamental constraint.
The implication is that a strict validator could be introduced without measurable ingestion regression, provided it uses a lookup-table or SIMD representation rather than the naive range-cascade MySQL uses.
Key Technical Insights
Why structural-only verification was chosen historically
PostgreSQL's MB infrastructure predates widespread UTF-8 dominance, and the verifier was designed to be a cheap gate against obvious corruption (stray high-bit bytes, truncated sequences), not a full encoding validator. The conversion tables were added later and independently, so the two notions of "valid" drifted apart. No one deliberately decided "accept unassigned code points"; it's an emergent property.
The real failure mode is delayed
The bug is especially nasty because the data ingests, indexes, replicates, and dumps successfully. It only breaks when:
- A client reconnects with
client_encoding = UTF8 - A
pg_dumpwith UTF-8 output is attempted - A logical replication subscriber with a different encoding consumes the stream
This means hardening is not merely cosmetic — it prevents latent corruption that manifests at migration or disaster-recovery time, which is architecturally the worst moment for it to appear.
Generalization beyond EUC_CN
EUC_JP, EUC_KR, EUC_TW, BIG5, GB18030, SJIS all have the same structural/semantic gap to varying degrees. Any fix should be designed as a per-encoding hook on the pg_wchar_tbl entry (the mblen/verifier function pointers in src/backend/utils/mb/wchar.c), not as a one-off for GB2312.
Opt-in as a pragmatic path
David Johnston's suggestion of an exposed function (e.g., pg_encoding_verify_strict(text, name)) is attractive because:
- It sidesteps the backwards-compatibility problem entirely.
- It composes with existing CHECK-constraint and trigger infrastructure.
- It lets the benchmark question be decided per-workload rather than globally.
A plausible patch would add pg_mb_verify_strict(bytea, encoding) returning boolean, backed by per-encoding strict verifiers that consult the existing *_to_utf8.map tables (which already exist in the tree — no new data needed).
Participant Weight
- chenloveit (OP) — reporter and benchmark author. Not (visibly) a committer; brings empirical data and a concrete reference implementation (the NUAV repo). Technical contribution is substantial: the benchmark effectively rebuts the cost objection.
- David G. Johnston — major non-committer contributor, highly active on -hackers, strong voice on documentation and user-facing semantics. His position ("document it, maybe offer opt-in, don't pay the cost by default") represents the likely consensus default; committers would need a strong reason to override it. No committer has weighed in on this thread yet, which is itself a signal that the proposal as framed (tighten by default) is unlikely to land without a more detailed patch and broader buy-in.
What a Credible Patch Would Look Like
Given the thread's trajectory, the patch most likely to be accepted would:
- Add a paragraph to
doc/src/sgml/charset.sgmldocumenting the structural-only guarantee. - Add a new SQL-callable function
pg_mb_strict_verify(text)(or similar) using lookup tables derived from the existing conversion maps. - Not change the default verifier behavior in any release — or at most gate it behind a GUC like
strict_multibyte_validation = offdefaulting to off, with a deprecation path.
A patch that simply tightens pg_euccn_verifychar() would almost certainly be rejected on compatibility grounds regardless of the benchmark, because existing databases contain such sequences and pg_upgrade / logical replication would break.