COPY JSON: use trailing commas in FORCE_ARRAY output

First seen: 2026-05-06 06:40:24+00:00 · Messages: 6 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

COPY JSON: Trailing Commas in FORCE_ARRAY Output

Context and Architectural Background

PostgreSQL 19 (unreleased at the time of this thread) introduces COPY TO ... WITH (FORMAT json), a long-requested feature that allows bulk export of query results in JSON form. When combined with the FORCE_ARRAY option, the output is wrapped as a single JSON array containing one object per row, as opposed to the default newline-delimited JSON (NDJSON / JSON Lines) style where each row is an independent top-level JSON value.

The implementation of FORCE_ARRAY in the COPY TO code path (primarily in src/backend/commands/copyto.c / copytojson.c) emits rows incrementally rather than buffering the full result set in memory — which is essential for COPY's streaming semantics on potentially very large tables. To achieve this while still producing syntactically valid JSON, the original implementation chose the simplest streaming-friendly approach: emit a newline after each row and prepend a comma to every row except the first.

This yields output like:

[
{"id":1,...}
,{"id":2,...}
,{"id":3,...}
]

The result is valid JSON (JSON does not care about whitespace placement), but it is stylistically unusual. Virtually every hand-written or tool-generated JSON array places the separator comma at the end of the preceding line (trailing-comma style in the layout sense — not the JSON5 semantic trailing-comma-before-]).

The Core Problem

Evan Li reports that the leading-comma style is surprising and inconsistent with the existing convention established by json_agg(), which uses trailing commas. Because FORMAT json for COPY has not yet shipped in a release, there is a narrow window to change the emitted format without creating a compatibility break for scripts, diff tools, or downstream parsers that may have pinned expectations to the leading-comma layout.

The argument for changing it now is therefore a combination of:

  1. Consistency with existing in-core JSON output (json_agg, row_to_json in aggregates, etc.).
  2. Principle of least astonishment for users familiar with standard JSON pretty-printers.
  3. Zero cost to change pre-release versus a permanent wart if released as-is.

Streaming Constraint and the Implementation Challenge

The reason the leading-comma pattern was chosen initially is non-trivial. To place the comma at the end of the previous row in a streaming emitter, you must know, at the moment you finish writing row N, whether row N+1 exists. COPY's row-by-row architecture does not naturally provide that lookahead.

The standard solution — and the one the submitter claims to implement — is to invert the logic: instead of "emit comma before every row except the first," emit "comma + newline before every row except the first, then emit the row, and emit a final newline only before the closing ]." Equivalently, track whether a row has been emitted yet, and for each subsequent row, emit ,\n before writing it. The key insight is that ,\n{...} and \n,{...} are visually distinct but structurally identical from a streaming standpoint — both only require knowing whether the current row is the first one, not whether another row follows.

Concretely, the transformation is roughly:

This preserves O(1) per-row memory (no full-result buffering) and imposes no additional branches in the hot path. The submitter explicitly notes: "It does not buffer the whole result, so it should not have any performance impact."

Design Considerations and Tradeoffs

Several subtle points are worth flagging even though the thread (as presented) contains only the initial post:

  1. Empty-array case: COPY of an empty relation with FORCE_ARRAY must still produce [] (or [\n]). The patch must ensure the "first row" flag logic does not emit a stray comma or newline in this case.

  2. Final newline before ]: There is a minor choice about whether the closing bracket appears on its own line (...}\n]) or immediately after the last row (...}]). json_agg pretty-printing shows the latter, but for streaming COPY the former is more natural and matches the existing opening [\n.

  3. Symmetry with COPY FROM: COPY FROM with FORMAT json parses JSON input. The parser should already accept both comma styles since it uses the standard JSON lexer, so this output change does not require a corresponding input-side change — but it's worth verifying that the parser's test coverage exercises both layouts.

  4. Regression tests: Any patch here must update the expected output in copy.out / copy2.out regression files that exercise FORCE_ARRAY. This is mechanical but essential.

  5. Documentation: The COPY reference page's examples of FORMAT json + FORCE_ARRAY output (if any) must be updated to reflect the new layout.

Why This Matters Architecturally

On the surface this is cosmetic. But it touches a broader PostgreSQL principle: once a wire/text format ships, it is effectively frozen. The project has repeatedly paid long-term costs for format choices made hastily in a release (e.g., various EXPLAIN format quirks, pg_dump escaping choices). Catching stylistic issues during the pre-release window of a new feature is precisely the kind of review pgsql-hackers is designed for, and the submitter's framing — "not worth changing post-release, but worth changing now" — reflects correct project-cultural instincts.

The patch is low-risk: localized to the JSON COPY emitter, no semantic change to produced JSON (both forms parse identically under RFC 8259), no performance impact, and trivially testable.

Likely Reception

Given that the FORMAT json COPY work in PG19 was itself a contested, multi-cycle effort (Joe Conway and others worked on earlier versions; the feature was deferred multiple times), reviewers sensitive to that history would likely welcome a small pre-release polish rather than resist it. The change aligns COPY's output with json_agg, which is the closest precedent in the codebase, giving the patch a strong consistency argument that typically carries weight with committers like Tom Lane and Andrew Dunstan (the latter being the primary JSON-subsystem maintainer).

Open Questions Not Addressed in the Initial Post