Making printtup Faster: Rethinking the Type Output Function API
Core Problem
printtup — the DestReceiver callback used to format tuples for the wire protocol in text mode — shows up prominently in CPU profiles (~85% self+children for a trivial SELECT * FROM pg_class scan in Andy Fan's measurements). The hot spots are not the I/O itself but the plumbing of converting Datums to wire bytes:
__strlen_avx2 — 8.35%
__memcpy_avx_unaligned_erms — 4.27%
palloc / AllocSetAlloc — ~10% combined
The architectural source of this waste is the type output function API itself. Every typoutput function returns a freshly-palloc'd cstring:
- The output function (e.g.
jsonb_out, textout, int4out) allocates a buffer (sometimes via a transient StringInfoData, sometimes via palloc directly) and produces a NUL-terminated string.
- The caller (
printtup) then calls strlen() to recover the length that the output function already knew.
- The caller
memcpys the bytes into DR_printtup.buf.
socket_putmessage later memcpys them again into PqSendBuffer.
send(2) copies from userspace into the kernel.
So for each datum there are effectively three copies and a redundant strlen, plus at least one palloc whose size is often mispredicted (see JsonbToCStringWorker, which initializes a 1024-byte StringInfo then immediately repallocs based on a size estimate — copying the still-empty buffer). int4out is an extreme case: essentially the only work it does is a palloc and a small memcpy.
Input functions have a symmetric defect: textin, int4in, etc. receive a cstring without a length, forcing internal strlen calls and precluding SIMD/SWAR tricks that require a known length bound. This is particularly painful in COPY FROM.
Design Space and Disagreements
Three designs were weighed:
1. Andy Fan's "optional second function" proposal ({type}print)
Add a new, optional per-type function alongside typoutput that takes (Datum, StringInfo) and appends directly to the caller's buffer. Callers in hot paths (printtup initially) look up the print function; if absent, fall back to the legacy out function. No catalog change initially — the mapping would be hardcoded for "common" types (int2/4/8, float4/8, date/time/timestamp, text, varchar, numeric).
Tradeoff: minimizes churn, keeps extension compatibility trivially, but bifurcates the type I/O API and introduces a "third I/O function" maintenance burden that Andy himself eventually acknowledged as smelly.
2. David Rowley's "rewrite all output functions" proposal
Change the signature of typoutput itself to (Datum, StringInfo). Hide the transition by having OutputFunctionCall() fake up a StringInfo for legacy functions (possibly via initReadOnlyStringInfo to avoid an extra copy). Extend the same treatment to typinput (pass length) and typsend.
Rowley's case is quantitative and pointed: a branch to dispatch on function signature is a single predictable cmp/jne in the hot path — "off by 1 order of magnitude at the minimum, 2–3+ for medium/large varlena." He also ties the input-side change to unlocking SIMD in pg_strtoint32_safe and the COPY delimiter scan, referencing commit ca6fde922 where SIMD yielded ~4× gains. The DDR5 bandwidth argument (32–64 GB/s unreachable byte-at-a-time) is the macro-justification for a length-aware input API.
3. Tom Lane's skepticism → conditional acceptance
Tom's initial objection is concrete: you can never remove the old API because non-core datatypes exist. Therefore call sites must dispatch on both APIs forever, and that dispatch overhead may swamp the savings. His later messages soften to: a new API can live alongside the old one, need not be SQL-callable (solving Andy's concern about select int4out(8) breaking), and realistically only a handful of types (textout, bytea, int*) will benefit enough to be worth converting. He explicitly doubts numeric_out or point_out would show measurable wins.
Tom points at the "soft error handling" work (d9f7f5d32, ccff2d20e) as the model: incremental, infrastructure-first, convert a few callees as demonstration, never force a flag day.
4. Andres Freund's alternative: fcinfo-carried context
Rather than changing the function signature, pass an optional "output context" through FunctionCallInfo. Output/send functions that know about it can allocate into that context (effectively the caller's StringInfo), falling back to CurrentMemoryContext otherwise. This preserves the existing signature entirely — pure ABI compatibility — while letting opt-in functions avoid the copy. Andres notes the pattern works cleanly for send (already StringInfo-based) and for simple types, but gets ugly where pg_server_to_client encoding conversion pre-allocates its own buffer.
Andres's orthogonal suggestion — fix pg_server_to_client() first so it writes directly into the destination StringInfo instead of pessimistically copying — is identified as an independent win for printtup()→pg_sendcountedtext() and a prerequisite for any serious textout optimization. Tom endorses this route.
Key Technical Insights
The cstring return convention is the problem, not printtup. Every alternative proposal — Andy's print function, David's rewrite, Andres's fcinfo context — is ultimately about letting the output function write into caller-owned memory of known location, avoiding the palloc + strlen + memcpy trio.
SQL-callability is a real constraint on signature changes. Andy surfaced the concrete break: SELECT int4out(8) works today because int4out has a well-typed SQL signature. Changing it to take internal (StringInfo) makes the function un-invokable from SQL. Tom's response clarifies this is moot only if the new API is a parallel one — the old SQL-callable int4out must persist forever for non-core compatibility.
Type output function skew. Andy's catalog survey is illuminating:
621 pg_type rows, 97 distinct typoutput values.
array_out : 296 types
record_out : 214 types
Converting array_out and record_out alone covers ~82% of types by count. And since those two are themselves callers of per-element output functions, they benefit compound-ly from converting the leaf scalar types.
strlen is a delimiter search in disguise. Andy's observation that strlen is essentially a SIMD-able search for \0 dovetails with David's point about SIMD delimiter search in COPY. Glibc's __strlen_avx2 is already SIMD — the waste isn't slow strlen, it's doing it at all when the producer knew the length.
The initReadOnlyStringInfo trick. Rowley points out that a wrapper making the legacy API look like the new API no longer requires a memcpy — a read-only StringInfo can point directly at the cstring. This defuses much of Tom's overhead concern for the wrapper direction.
Likely Path Forward
Based on the concluding Tom/Andres exchange, the consensus shape is:
- A new optional output API living alongside
typoutput, not SQL-callable, discovered via a new pg_proc entry or a flag.
- Fix
pg_server_to_client first to write into a caller-supplied StringInfo — independent benefit, required building block.
- Convert only high-value types (textout, byteaout, int[248]out, perhaps timestamp). Tom is explicit that
numeric_out/point_out are not worth it.
- Input-side length propagation treated as a separate, parallel project (Rowley's interest), with
COPY FROM and SIMD int parsing as the headline wins.
Patch Status
Andy posted a PoC showing ~18% improvement (0.134ms → 0.110ms) on SELECT * FROM demo where demo has oid/text/oid/text columns. He later split work into 0001–0003 (preparatory) + 0004 (17 _print functions wired into printtup). The thread's two-year gap (2024 → 2026) suggests the work stalled awaiting API consensus, which the final Tom/Andres exchange begins to unblock by endorsing the parallel-API route.