Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

First seen: 2026-05-02 02:31:12+00:00 · Messages: 12 · Participants: 4

Latest Update

2026-05-14 · claude-opus-4-6

What's New in This Round

This round contains only two messages: Ishii adding pgsql-hackers to the CC (a procedural action with no technical content), and chenloveit's reply which introduces a new architectural argument against the GUC approach — this time from the OP himself, reversing his own earlier design.

chenloveit abandons his own GUC prototype in favor of encoding variants

chenloveit now argues against the GUC-based encoding_validation parameter he himself implemented in the GitHub prototype, citing a concrete pg_dumpall failure scenario:

  1. Database populated under encoding_validation = 'native' (permissive)
  2. Cluster dumped via pg_dumpall
  3. New cluster initialized with encoding_validation = 'read_compatible' (strict)
  4. Restore fails because previously-accepted bytes are now rejected

This is the standard "dump/restore asymmetry" argument that kills most GUC-gated strictness proposals in PostgreSQL. It's the same class of problem that affects standard_conforming_strings transitions and similar behavioral GUCs.

New proposal: encoding variants rather than configuration

chenloveit's alternative is architecturally distinct: rather than a runtime switch, register new encoding identifiers (e.g., a strict EUC_CN_STRICT or similar variant) as first-class encodings in pg_encoding. This means:

This directly addresses Ishii's objection about per-encoding granularity and the dump/restore problem in one stroke, at the cost of encoding-namespace proliferation and the need to wire up new pg_wchar_tbl entries, conversion procs, etc.

Significance

This is a meaningful design evolution: the thread has now cycled through three mechanism proposals (documentation-only → global GUC → encoding variants), each addressing objections raised against the prior one. The encoding-variant approach is the first that simultaneously satisfies:

However, it introduces its own complications: encoding proliferation, whether ICU/libc locale combinations work with variant names, and how pg_upgrade handles databases that want to switch. No committer has yet reacted to this proposal.