2026-05-06 · opus 4.7

Analysis: ENCODING option handling for COPY TO FORMAT JSON

Core Problem

Commit 7dadd38cda9 added JSON output format support to COPY TO in the v19 development cycle. The implementation shortcut the normal per-attribute encoding pipeline: in CopyToJsonOneRow(), the row is serialized as a whole via composite_to_json() and handed directly to CopySendData(). The text and CSV paths, by contrast, funnel per-attribute output through pg_server_to_any() when need_transcoding is set.

This produces two user-visible defects:

Explicit ENCODING is silently ignored. COPY t TO '/tmp/out.json' WITH (FORMAT json, ENCODING 'LATIN1') on a UTF-8 server writes UTF-8 bytes to the file. The user asked for LATIN1 and got something else with no error, no warning.
Implicit client_encoding mismatch for COPY TO STDOUT. Since file_encoding defaults to client_encoding when unspecified, a client with client_encoding differing from the server encoding also receives unconverted server-encoded bytes. This is not merely a correctness issue but a security hazard: sending un-transcoded bytes across an encoding boundary can produce sequences that the client's decoder interprets unexpectedly (classic encoding-confusion attack surface, the same class of issue that motivates strict transcoding at protocol boundaries).

Design Tension: Spec Compliance vs. PostgreSQL's Encoding Model

RFC 8259 mandates JSON text be UTF-8. A defensible reading is therefore "COPY TO JSON should always emit UTF-8, regardless of server or client encoding, and ENCODING should be rejected." That was the initial proposal.

Tom Lane rejected this framing on two grounds that carry decisive architectural weight:

The implementation does not actually enforce UTF-8 today. composite_to_json() returns text in the server encoding. If the server is SQL_ASCII or LATIN1 or EUC_JP, the JSON bytes emitted are in that encoding, not UTF-8. Claiming RFC compliance as justification for rejecting ENCODING would be fiction: the feature is already non-compliant in the general case, and rejecting the option does nothing to fix that.
Backwards compatibility pressure after v19 ships. Once the feature is released with a particular behavior — even a buggy one — the project accumulates users who depend on it. Deferring the design question ("should JSON always be UTF-8?") past the v19 release would lock in the current broken behavior. This is the standard "open item" logic: fix it before the release locks the ABI/UX.

The stronger, consistent solution is therefore to make JSON behave like text/CSV: run the fully-formed JSON buffer through pg_server_to_any() before sending, honoring both explicit ENCODING and the implicit client_encoding path.

Patch Evolution

v1/v2: Added JSON to the list of formats that reject ENCODING at option-parsing time in ProcessCopyOptions() (same pattern as existing rejections of DELIMITER, NULL, DEFAULT, HEADER for JSON). Minimal, defensive, but punts the real problem.
Final direction: Insert a pg_server_to_any() call on the JSON buffer produced by composite_to_json() in the CopyToJsonOneRow() path, gated by need_transcoding. This makes ENCODING functional rather than rejected, and transparently fixes the STDOUT/client_encoding case because need_transcoding is already set correctly by BeginCopyTo() based on the resolved file_encoding vs. server encoding.

The conversion happens on the whole JSON row rather than per-attribute because composite_to_json() is atomic — it builds the object including braces, commas, and key quoting in server encoding, and only the final bytes need transcoding. This is safe because JSON's structural characters ({},:") are ASCII and thus invariant across all PostgreSQL-supported server encodings (all are ASCII-supersets).

Testing Approach

Andrew Dunstan's revision replaced a round-trip text comparison with pg_read_binary_file() to assert on raw bytes. This matters because:

A round-trip test via COPY FROM could mask the bug if both directions use the same (wrong) encoding.
Binary-level assertions directly verify that a LATIN1-encoded accented character (e.g., 0xE9 for é) is present and the UTF-8 sequence (0xC3 0xA9) is absent.
An added case flipping client_encoding exercises the implicit-transcoding path for STDOUT, which is the security-relevant scenario.

Architectural Implications

JSON format is not encoding-transparent. This patch preserves the status quo that JSON inherits server encoding rather than mandating UTF-8. A future, separate discussion could enforce UTF-8 (perhaps via a new option), but doing so would now require a deprecation path.
need_transcoding is the right abstraction. By reusing the existing flag set during BeginCopy*, the fix integrates cleanly with COPY's existing encoding state machine rather than introducing a JSON-specific code path.
Open-items discipline. Tom's insistence on fixing this pre-release rather than deferring is a concrete example of how the PostgreSQL project treats user-visible behavior of newly-committed features as effectively frozen at release.