Make printtup a bit faster

First seen: 2024-08-29 09:40:14+00:00 · Messages: 24 · Participants: 5

Latest Update

2026-06-04 · claude-opus-4-6

Monthly Summary: Make printtup a bit faster (May 2026)

Overview

After a two-year hiatus (2024–2026), this thread reached design consensus in May 2026. The long-standing debate over how to eliminate the palloc + strlen + memcpy overhead in PostgreSQL's tuple output path converged on Andres Freund's fcinfo-carried-context approach, with a rough prototype posted and all active participants aligned.

The Problem

printtup — the DestReceiver callback formatting tuples for the wire protocol — dominates CPU profiles (~85% self+children on trivial queries). The root cause is the typoutput function API: every output function returns a freshly palloc'd cstring, forcing the caller to strlen() the result (recovering a length the producer already knew), memcpy it into the send buffer, and eventually free the allocation. For high-throughput queries this produces three redundant copies and a pointless strlen per datum.

Design Convergence

Three competing designs were evaluated:

  1. Andy Fan's {type}print functions — optional per-type functions taking (Datum, StringInfo), hardcoded for common types. Andy withdrew this in favor of the consensus approach.

  2. David Rowley's full signature rewrite — change typoutput itself to (Datum, StringInfo). Not defended after Tom Lane's objections about permanent backward-compatibility dispatch costs.

  3. Andres Freund's fcinfo-context approach (selected) — pass an optional "output context" (containing the caller's StringInfo) through FunctionCallInfo. Opt-in output functions append directly to the caller's buffer; legacy functions continue returning cstring unchanged. Uses PG_WINDOW_OBJECT() / WindowObjectIsValid() as the in-tree precedent for smuggling typed side-channels through fcinfo.

Tom Lane endorsed the rollout strategy modeled on commit d9f7f5d32 (soft error reporting for input functions): infrastructure first, convert a handful of high-value callees as demonstration, no flag day.

Prototype Posted

Andres posted a two-patch sketch (2026-05-06):

The prototype is self-described as "very rough" and not yet benchmarkable, but establishes the concrete shape of the solution.

Key Technical Insights

Prior Benchmark

Andy's earlier PoC showed ~18% improvement (0.134ms → 0.110ms) on SELECT * FROM demo with oid/text columns, using the now-superseded {type}print approach. The fcinfo-context approach should yield similar or better gains with cleaner architecture.

Status

The design debate is closed. Next steps:

  1. Andres polishes Patch 1 (pg_server_to_client StringInfo refactor) for independent commit
  2. Patch 2 matures with proper helper macros and benchmark validation
  3. High-value output functions converted incrementally per the soft-error-reporting playbook
History (1 prior analysis)
2026-06-04 · claude-opus-4-6

What's New in This Round

Andy Fan returns after a ~4-week gap with a complete, benchmarked patch series implementing Andres's fcinfo-context approach across multiple data types. This is the first time the thread has production-quality numbers covering a broad type matrix in both text and binary wire formats.

Patch Series Structure (4 patches)

  • 0001 + 0002: Andres's original rough prototype patches, now carried forward as the foundation (the pg_server_to_client / StringInfo refactor and the fcinfo-context infrastructure).
  • 0003 (Andy's new work): Extends the optimization to additional data types beyond Andres's demo, adds helper functions/macros to make conversion easier, and includes a bugfix for binary format (details not elaborated in the message).
  • 0004: Adds binary format support to pgbench — test-only tooling to enable the benchmark methodology.

Benchmark Results — Key Observations

Andy benchmarks 11 data types, each in both binary and text result formats, using pgbench with extended query protocol, single client, 1000 transactions.

Binary format results (strongest wins):

Type Latency ratio (master/patched) Interpretation
int2 1.287× ~29% faster
int4 1.275× ~28% faster
float4 1.272× ~27% faster
float8 1.271× ~27% faster
time 1.273× ~27% faster
timestamp 1.243× ~24% faster
timestamptz 1.228× ~23% faster
text 1.241× ~24% faster
numeric 1.160× ~16% faster
int8 1.011× ~1% (negligible)
date 1.029× ~3% (negligible)

Text format results (moderate wins):

Type Latency ratio Interpretation
int8 1.271× ~27% faster
int2 1.225× ~22% faster
int4 1.217× ~22% faster
time 1.211× ~21% faster
date 1.169× ~17% faster
text 1.148× ~15% faster
timestamp 1.150× ~15% faster
timestamptz 1.107× ~11% faster
float4 1.090× ~9% faster
float8 1.079× ~8% faster
numeric 1.010× ~1% (negligible)

Notable patterns:

  • numeric shows minimal gain in both modes — vindicating Tom Lane's earlier prediction that numeric_out's computation dominates and the palloc/strlen/memcpy overhead is proportionally irrelevant.
  • int8 is anomalous: negligible in binary but huge in text. Binary int8 is just an 8-byte network-order copy (trivially fast already), so the optimization has little to save; text int8 benefits because the string conversion + palloc overhead is proportionally large.
  • float4/float8 show the inverse asymmetry: large wins in binary (avoiding palloc for a simple 4/8-byte send), but modest in text (where float_to_shortest_decimal_buf dominates).
  • Simple integer types (int2/int4) win big in both modes — exactly as predicted in earlier rounds since their output functions are almost pure palloc overhead.

Significance

This is the first time the fcinfo-context approach has been benchmarked across a representative type set. The results confirm the design's value: 15–29% latency improvements for the common simple types, with graceful degradation to near-zero overhead for types where computation dominates (numeric). The "convert only high-value types" strategy endorsed by Tom is empirically validated — the patch series apparently converts ~11 types and gets meaningful wins on 9 of them.

The patch is now in a reviewable state with Andres's infrastructure as the base and Andy's extensions covering the type breadth.