Analysis: ENCODING option handling for COPY TO FORMAT JSON
Core Problem
Commit 7dadd38cda9 added JSON output format support to COPY TO in the v19 development cycle. The implementation shortcut the normal per-attribute encoding pipeline: in CopyToJsonOneRow(), the row is serialized as a whole via composite_to_json() and handed directly to CopySendData(). The text and CSV paths, by contrast, funnel per-attribute output through pg_server_to_any() when need_transcoding is set.
This produces two user-visible defects:
- Explicit ENCODING is silently ignored.
COPY t TO '/tmp/out.json' WITH (FORMAT json, ENCODING 'LATIN1')on a UTF-8 server writes UTF-8 bytes to the file. The user asked for LATIN1 and got something else with no error, no warning. - Implicit client_encoding mismatch for
COPY TO STDOUT. Sincefile_encodingdefaults toclient_encodingwhen unspecified, a client withclient_encodingdiffering from the server encoding also receives unconverted server-encoded bytes. This is not merely a correctness issue but a security hazard: sending un-transcoded bytes across an encoding boundary can produce sequences that the client's decoder interprets unexpectedly (classic encoding-confusion attack surface, the same class of issue that motivates strict transcoding at protocol boundaries).
Design Tension: Spec Compliance vs. PostgreSQL's Encoding Model
RFC 8259 mandates JSON text be UTF-8. A defensible reading is therefore "COPY TO JSON should always emit UTF-8, regardless of server or client encoding, and ENCODING should be rejected." That was the initial proposal.
Tom Lane rejected this framing on two grounds that carry decisive architectural weight:
- The implementation does not actually enforce UTF-8 today.
composite_to_json()returns text in the server encoding. If the server isSQL_ASCIIorLATIN1orEUC_JP, the JSON bytes emitted are in that encoding, not UTF-8. Claiming RFC compliance as justification for rejecting ENCODING would be fiction: the feature is already non-compliant in the general case, and rejecting the option does nothing to fix that. - Backwards compatibility pressure after v19 ships. Once the feature is released with a particular behavior — even a buggy one — the project accumulates users who depend on it. Deferring the design question ("should JSON always be UTF-8?") past the v19 release would lock in the current broken behavior. This is the standard "open item" logic: fix it before the release locks the ABI/UX.
The stronger, consistent solution is therefore to make JSON behave like text/CSV: run the fully-formed JSON buffer through pg_server_to_any() before sending, honoring both explicit ENCODING and the implicit client_encoding path.
Patch Evolution
- v1/v2: Added JSON to the list of formats that reject
ENCODINGat option-parsing time inProcessCopyOptions()(same pattern as existing rejections ofDELIMITER,NULL,DEFAULT,HEADERfor JSON). Minimal, defensive, but punts the real problem. - Final direction: Insert a
pg_server_to_any()call on the JSON buffer produced bycomposite_to_json()in theCopyToJsonOneRow()path, gated byneed_transcoding. This makes ENCODING functional rather than rejected, and transparently fixes the STDOUT/client_encoding case becauseneed_transcodingis already set correctly byBeginCopyTo()based on the resolvedfile_encodingvs. server encoding.
The conversion happens on the whole JSON row rather than per-attribute because composite_to_json() is atomic — it builds the object including braces, commas, and key quoting in server encoding, and only the final bytes need transcoding. This is safe because JSON's structural characters ({},:") are ASCII and thus invariant across all PostgreSQL-supported server encodings (all are ASCII-supersets).
Testing Approach
Andrew Dunstan's revision replaced a round-trip text comparison with pg_read_binary_file() to assert on raw bytes. This matters because:
- A round-trip test via
COPY FROMcould mask the bug if both directions use the same (wrong) encoding. - Binary-level assertions directly verify that a LATIN1-encoded accented character (e.g.,
0xE9foré) is present and the UTF-8 sequence (0xC3 0xA9) is absent. - An added case flipping
client_encodingexercises the implicit-transcoding path for STDOUT, which is the security-relevant scenario.
Architectural Implications
- JSON format is not encoding-transparent. This patch preserves the status quo that JSON inherits server encoding rather than mandating UTF-8. A future, separate discussion could enforce UTF-8 (perhaps via a new option), but doing so would now require a deprecation path.
need_transcodingis the right abstraction. By reusing the existing flag set duringBeginCopy*, the fix integrates cleanly withCOPY's existing encoding state machine rather than introducing a JSON-specific code path.- Open-items discipline. Tom's insistence on fixing this pre-release rather than deferring is a concrete example of how the PostgreSQL project treats user-visible behavior of newly-committed features as effectively frozen at release.