COPY JSON: Trailing Commas in FORCE_ARRAY Output
Context and Architectural Background
PostgreSQL 19 (unreleased at the time of this thread) introduces COPY TO ... WITH (FORMAT json), a long-requested feature that allows bulk export of query results in JSON form. When combined with the FORCE_ARRAY option, the output is wrapped as a single JSON array containing one object per row, as opposed to the default newline-delimited JSON (NDJSON / JSON Lines) style where each row is an independent top-level JSON value.
The implementation of FORCE_ARRAY in the COPY TO code path (primarily in src/backend/commands/copyto.c / copytojson.c) emits rows incrementally rather than buffering the full result set in memory — which is essential for COPY's streaming semantics on potentially very large tables. To achieve this while still producing syntactically valid JSON, the original implementation chose the simplest streaming-friendly approach: emit a newline after each row and prepend a comma to every row except the first.
This yields output like:
[
{"id":1,...}
,{"id":2,...}
,{"id":3,...}
]
The result is valid JSON (JSON does not care about whitespace placement), but it is stylistically unusual. Virtually every hand-written or tool-generated JSON array places the separator comma at the end of the preceding line (trailing-comma style in the layout sense — not the JSON5 semantic trailing-comma-before-]).
The Core Problem
Evan Li reports that the leading-comma style is surprising and inconsistent with the existing convention established by json_agg(), which uses trailing commas. Because FORMAT json for COPY has not yet shipped in a release, there is a narrow window to change the emitted format without creating a compatibility break for scripts, diff tools, or downstream parsers that may have pinned expectations to the leading-comma layout.
The argument for changing it now is therefore a combination of:
- Consistency with existing in-core JSON output (
json_agg,row_to_jsonin aggregates, etc.). - Principle of least astonishment for users familiar with standard JSON pretty-printers.
- Zero cost to change pre-release versus a permanent wart if released as-is.
Streaming Constraint and the Implementation Challenge
The reason the leading-comma pattern was chosen initially is non-trivial. To place the comma at the end of the previous row in a streaming emitter, you must know, at the moment you finish writing row N, whether row N+1 exists. COPY's row-by-row architecture does not naturally provide that lookahead.
The standard solution — and the one the submitter claims to implement — is to invert the logic: instead of "emit comma before every row except the first," emit "comma + newline before every row except the first, then emit the row, and emit a final newline only before the closing ]." Equivalently, track whether a row has been emitted yet, and for each subsequent row, emit ,\n before writing it. The key insight is that ,\n{...} and \n,{...} are visually distinct but structurally identical from a streaming standpoint — both only require knowing whether the current row is the first one, not whether another row follows.
Concretely, the transformation is roughly:
- Before: For row i: if i>0 emit
\n,; then emit row JSON. After loop: emit\n]. - After: For row i: if i>0 emit
,\n; then emit row JSON. After loop: emit\n].
This preserves O(1) per-row memory (no full-result buffering) and imposes no additional branches in the hot path. The submitter explicitly notes: "It does not buffer the whole result, so it should not have any performance impact."
Design Considerations and Tradeoffs
Several subtle points are worth flagging even though the thread (as presented) contains only the initial post:
-
Empty-array case:
COPYof an empty relation withFORCE_ARRAYmust still produce[](or[\n]). The patch must ensure the "first row" flag logic does not emit a stray comma or newline in this case. -
Final newline before
]: There is a minor choice about whether the closing bracket appears on its own line (...}\n]) or immediately after the last row (...}]).json_aggpretty-printing shows the latter, but for streaming COPY the former is more natural and matches the existing opening[\n. -
Symmetry with COPY FROM: COPY FROM with
FORMAT jsonparses JSON input. The parser should already accept both comma styles since it uses the standard JSON lexer, so this output change does not require a corresponding input-side change — but it's worth verifying that the parser's test coverage exercises both layouts. -
Regression tests: Any patch here must update the expected output in
copy.out/copy2.outregression files that exerciseFORCE_ARRAY. This is mechanical but essential. -
Documentation: The
COPYreference page's examples ofFORMAT json+FORCE_ARRAYoutput (if any) must be updated to reflect the new layout.
Why This Matters Architecturally
On the surface this is cosmetic. But it touches a broader PostgreSQL principle: once a wire/text format ships, it is effectively frozen. The project has repeatedly paid long-term costs for format choices made hastily in a release (e.g., various EXPLAIN format quirks, pg_dump escaping choices). Catching stylistic issues during the pre-release window of a new feature is precisely the kind of review pgsql-hackers is designed for, and the submitter's framing — "not worth changing post-release, but worth changing now" — reflects correct project-cultural instincts.
The patch is low-risk: localized to the JSON COPY emitter, no semantic change to produced JSON (both forms parse identically under RFC 8259), no performance impact, and trivially testable.
Likely Reception
Given that the FORMAT json COPY work in PG19 was itself a contested, multi-cycle effort (Joe Conway and others worked on earlier versions; the feature was deferred multiple times), reviewers sensitive to that history would likely welcome a small pre-release polish rather than resist it. The change aligns COPY's output with json_agg, which is the closest precedent in the codebase, giving the patch a strong consistency argument that typically carries weight with committers like Tom Lane and Andrew Dunstan (the latter being the primary JSON-subsystem maintainer).
Open Questions Not Addressed in the Initial Post
- Does the patch also adjust the placement of the newline after
[and before], or only the inter-row separator? - Are there regression tests included in the patch?
- What does the output look like for a single-row result — is it
[\n{...}\n]or[{...}]? - Is there any interaction with the
HEADERoption (which is meaningless for JSON but must be rejected cleanly)?