Support for 8-byte TOAST values, round two — Technical Analysis
The Core Problem: TOAST OID Exhaustion
PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) mechanism stores large field values out-of-line in a companion TOAST relation. Each out-of-line value is identified by a 4-byte Oid (chunk_id), allocated from a cluster-wide OID counter with collision checks against the per-TOAST-table unique index on chunk_id.
This 32-bit identifier space has become an operational hazard on large, write-heavy systems:
- Collision-driven slowdowns at scale.
GetNewOidWithIndex()must loop and probe the TOAST index when the space becomes densely populated. On a single TOAST relation approaching billions of live chunk_ids, insert latency becomes dominated by OID collision retries, and in pathological cases can live-lock. - Hard ceiling. A TOAST relation cannot store more than ~2^32 distinct live chunk_id values. Combined with the fact that every detoasted datum is a function of
(toastrelid, chunk_id), this caps the effective lifetime amount of unique TOAST-able data per relation. - Wraparound semantics for OIDs historically skip values below
FirstNormalObjectId, further shrinking the usable space and making wraparound-driven collisions more frequent once the counter laps.
The goal of this patch set is to lift that ceiling by allowing a TOAST relation to use a 64-bit identifier (Oid8) instead of a 32-bit Oid, while retaining full binary compatibility for existing clusters that don't opt in.
Why This Is Round Two: Rejection of the Callback Approach
The prior revision (v1–v19, referenced by message aFOnKHG7Wn-Srnpv@paquier.xyz) abstracted over the varlena external pointer variants using function pointer dispatch — a callback per vartag_external. Reviewer feedback was decisive on two points:
- Performance: TOAST detoasting sits on the hot path of virtually every query that touches wide columns. Indirect calls through function pointers defeat inlining in
detoast.c,toast_internals.c, andheaptoast.c, and inject branch-predictor pressure where a direct switch is already well-predicted. - Readability / maintainability: Pointer redirections obscured what is, fundamentally, a narrow dispatch on a tag byte.
Michael Paquier's v20 abandons that design entirely in favor of what he calls the "brutal" approach: open-coded branching on either (a) the vartag_external of an in-memory varlena datum, or (b) the atttype of the TOAST relation's chunk_id column when reading from disk. No function pointers, no vtables — just two concrete shapes handled inline.
The Design
Opt-in per table via reloption
A new toast_value_type reloption takes values oid (default, current behavior) or oid8. Setting this at CREATE TABLE time causes the TOAST relation to be built with chunk_id of type oid8 instead of oid. Crucially:
- No in-place migration. A
VACUUM FULLor table rewrite cannot change the TOAST type. This is an explicit implementation-simplicity tradeoff: supporting mid-life TOAST type switching would require rewriting every external varlena pointer stored in every heap tuple, which crosses into territory that would need either a logical rewrite or a dual-format reader over a long transition. Paquier punts on this — a defensible choice, given that users motivated to flip the bit can do so via logical replication orpg_dump→ restore into a freshly-created table. - Dump / binary-upgrade support.
pg_dumpandpg_upgrademust preserve thechunk_idatttype so that existing on-disk external pointers continue to resolve. This is non-negotiable: an upgraded cluster whose TOAST table silently changed identifier width would produce garbage on every detoast.
The Oid8 counter and wraparound
The 64-bit TOAST value ID is allocated from a new counter persisted in pg_control, extended by 4 bytes. Reusing the control file rather than introducing a new SLRU or file avoids crash-recovery complications; pg_control is already fsync'd on checkpoint and replicated via base backup and streaming.
The counter preserves the existing semantics of skipping the [0, FirstNormalObjectId) range on wraparound. Note the subtlety: with 64 bits, wraparound is effectively unreachable at any realistic write rate, but the skip logic is retained for consistency with how OIDs behave and to keep the low range reserved for future sentinel use.
The new varlena external tag
A new vartag_external value is introduced alongside a new on-disk / in-memory struct:
typedef struct varatt_external_oid8
{
int32 va_rawsize;
uint32 va_extinfo;
uint32 va_valueid_lo;
uint32 va_valueid_hi;
Oid va_toastrelid;
} varatt_external_oid8;
Two design points deserve attention:
- Split into
_lo/_hiuint32 pair instead of a singleuint64. This is deliberate to avoid 8-byte alignment padding inside the struct on platforms whereuint64forces 8-byte alignment. The struct as declared is 20 bytes with no internal padding, matching howvaratt_external(16 bytes) is treated as a packed on-disk representation viamemcpythroughVARATT_EXTERNAL_GET_POINTER/SET_POINTERmacros. Preserving the pack-without-padding invariant is important because these structs are embedded directly into heap tuples as part of the varlena datum; any padding difference would change the on-disk footprint across architectures. - The tag, not the atttype, drives in-memory dispatch. Once a tuple is loaded, all that's available is the varlena datum itself. The
vartag_externalbyte immediately after theVARTAG_1B_Eheader tells detoast code which struct shape to read. Conversely, when a backend is creating a new external pointer (intoast_save_datum/heap_toast_insert_or_update), it must consult the TOAST relation'schunk_idatttype to decide which shape to write. This is the two-sided dispatch Paquier refers to.
Layering of the patch set
The patch set is deliberately stacked so that ground-work lands independently of the user-visible feature:
- Renames and cleanup of
varatt_externalreferences — making the "OID-ness" explicit in identifiers so the eventualoid8variant reads naturally alongside. - Infrastructure for the
Oid8counter in pg_control, and for carrying the TOAST value type through the catalog / reloption machinery. - Feature patch (last) adds the new
vartag_externaland thevaratt_external_oid8struct.
An interesting property Paquier highlights: if you apply every patch except the last, an oid8 TOAST table works — but using the existing varatt_external with a truncated identifier. This is not a useful configuration in itself but makes each patch independently reviewable and testable, which matters given the breadth of code touched (detoast, compression, reorderbuffer.c, amcheck).
Touched Subsystems and Why Each Matters
detoast.c/toast_internals.c/heaptoast.c: the read path. Must branch onvartag_externalto select the right struct layout and extract the 64-bit value ID.reorderbuffer.c: logical decoding reassembles TOAST chunks for changes it streams. It builds its own hash keyed by chunk_id; the key type must widen toOid8foroid8TOAST tables. This is a correctness-critical path — mis-keying would silently drop or duplicate TOAST chunks in replication output.amcheck: verifies TOAST references from heap tuples. Must understand both external pointer shapes to validate consistency.- Compression path: creating new external pointers after compression must emit the correct tag and struct.
Paquier reports the final feature patch footprint is modest — 9 files, +537 / -208 — which is credible only because the ground-work patches absorbed most of the churn.
Key Tradeoffs and Open Questions
- No in-place conversion is the single biggest usability compromise. Users with an existing hot TOAST table approaching the OID ceiling must do a logical migration. This is the right call for a first cut but will likely generate follow-up requests.
- Reloption granularity is per-table. There is no cluster-wide default to make every new table
oid8. For installations that know they wantoid8everywhere, they'd have to set it atCREATE TABLEtime or via a template. A GUC-driven default would be a natural follow-up. - pg_control growth. Adding 4 bytes to pg_control is cheap but is an on-disk format change requiring
pg_upgradehandling and a catversion bump. This is routine for major versions. - The "brutal" approach leaves the door open to a third variant. If someone later proposes, say, a variable-length value ID, they'd have to add yet another
vartag_externaland another open-coded branch. The callback approach would have absorbed new variants more gracefully — a point reviewers may revisit, though Paquier's performance argument remains strong.
Weight of the Proposer
Michael Paquier is a longtime PostgreSQL committer with deep familiarity with the storage, TOAST, and pg_control subsystems. The fact that this is a determined v20 resubmission after a round of pushback, with the design substantially rethought rather than merely patched, signals both personal investment and technical seriousness. The mention of pgconf.dev discussion and explicit targeting of v20 (the PostgreSQL major version) indicates this is being staged for a full release cycle of review.
Status at This Point in the Thread
Only the initial post exists in this excerpt. No reviewer responses, no benchmark numbers, no committer pushback on the new design are yet present. The real technical debate — whether the "brutal" dispatch remains clean as more call sites are touched, whether pg_control is the right home for the counter, and whether in-place conversion can be bolted on later — will play out in subsequent messages.