[PATCH] Reject ENCODING option for COPY TO FORMAT JSON

First seen: 2026-04-20 06:06:27+00:00 · Messages: 9 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

Analysis: ENCODING option handling for COPY TO FORMAT JSON

Core Problem

Commit 7dadd38cda9 added JSON output format support to COPY TO in the v19 development cycle. The implementation shortcut the normal per-attribute encoding pipeline: in CopyToJsonOneRow(), the row is serialized as a whole via composite_to_json() and handed directly to CopySendData(). The text and CSV paths, by contrast, funnel per-attribute output through pg_server_to_any() when need_transcoding is set.

This produces two user-visible defects:

  1. Explicit ENCODING is silently ignored. COPY t TO '/tmp/out.json' WITH (FORMAT json, ENCODING 'LATIN1') on a UTF-8 server writes UTF-8 bytes to the file. The user asked for LATIN1 and got something else with no error, no warning.
  2. Implicit client_encoding mismatch for COPY TO STDOUT. Since file_encoding defaults to client_encoding when unspecified, a client with client_encoding differing from the server encoding also receives unconverted server-encoded bytes. This is not merely a correctness issue but a security hazard: sending un-transcoded bytes across an encoding boundary can produce sequences that the client's decoder interprets unexpectedly (classic encoding-confusion attack surface, the same class of issue that motivates strict transcoding at protocol boundaries).

Design Tension: Spec Compliance vs. PostgreSQL's Encoding Model

RFC 8259 mandates JSON text be UTF-8. A defensible reading is therefore "COPY TO JSON should always emit UTF-8, regardless of server or client encoding, and ENCODING should be rejected." That was the initial proposal.

Tom Lane rejected this framing on two grounds that carry decisive architectural weight:

The stronger, consistent solution is therefore to make JSON behave like text/CSV: run the fully-formed JSON buffer through pg_server_to_any() before sending, honoring both explicit ENCODING and the implicit client_encoding path.

Patch Evolution

The conversion happens on the whole JSON row rather than per-attribute because composite_to_json() is atomic — it builds the object including braces, commas, and key quoting in server encoding, and only the final bytes need transcoding. This is safe because JSON's structural characters ({},:") are ASCII and thus invariant across all PostgreSQL-supported server encodings (all are ASCII-supersets).

Testing Approach

Andrew Dunstan's revision replaced a round-trip text comparison with pg_read_binary_file() to assert on raw bytes. This matters because:

Architectural Implications

  1. JSON format is not encoding-transparent. This patch preserves the status quo that JSON inherits server encoding rather than mandating UTF-8. A future, separate discussion could enforce UTF-8 (perhaps via a new option), but doing so would now require a deprecation path.
  2. need_transcoding is the right abstraction. By reusing the existing flag set during BeginCopy*, the fix integrates cleanly with COPY's existing encoding state machine rather than introducing a JSON-specific code path.
  3. Open-items discipline. Tom's insistence on fixing this pre-release rather than deferring is a concrete example of how the PostgreSQL project treats user-visible behavior of newly-committed features as effectively frozen at release.