Refactor code around GUC default_toast_compression

First seen: 2026-05-01 07:50:46+00:00 · Messages: 4 · Participants: 3

Latest Update

2026-05-06 · opus 4.7

Refactoring the TOAST Compression Method Registry

Architectural Context

PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) subsystem supports pluggable-but-fixed compression methods. The on-disk encoding is painfully tight: only 2 bits in va_tcinfo/va_extinfo of a compressed varlena header identify the compression algorithm. That is a hard ceiling of four possible values, one of which is already reserved as TOAST_INVALID_COMPRESSION_ID. Currently only two are used (pglz, lz4).

This 2-bit constraint coexists with three other representations of "which compression method":

  1. On-disk ID (ToastCompressionId, 0..3): the raw bits stored in the varlena header.
  2. Catalog char (pg_attribute.attcompression: 'p', 'l', or '\0'): what TOAST_PGLZ_COMPRESSION/TOAST_LZ4_COMPRESSION macros expand to.
  3. GUC integer for default_toast_compression: historically stored the char value, masquerading as an int for the GUC machinery.

These three representations are scattered across toast_compression.[ch], detoast.c, toast_internals.c, pg_column_compression(), and the GUC table, with ad-hoc switch statements translating between them. Every addition of a new compression method (zstd is a long-discussed candidate) requires touching all of these sites and carefully preserving invariants between the char encoding, the GUC enum, and the on-disk bits.

The Patch's Core Idea

Michael Paquier's proposal introduces a centralized method registry in toast_compression.c that holds, per method:

Translation helpers (MethodToCompressionId, CompressionIdToMethod, and GUC ↔ char mappings) replace the hand-written switches. The semantic cleanup also removes TOAST_INVALID_COMPRESSION_ID from the GUC enum list, which never made sense as a user-selectable value — it was an artifact of reusing the on-disk ID space for GUC values.

Crucially, the GUC default_toast_compression now stores a new, independent integer encoding (TOAST_PGLZ_COMPRESSION_GUC, TOAST_LZ4_COMPRESSION_GUC) rather than the char value 'p'/'l' cast to int. This decouples the GUC representation from both the catalog char and the on-disk ID.

Key Design Decisions and Tradeoffs

Decision: Keep varlena-pointer decompression paths untouched

Ayush Tiwari asked why pg_column_compression() still has its own cmid-to-name switch instead of routing through CompressionIdToMethod(). Paquier's response is telling: the code paths that start from a raw varlena pointer and extract the on-disk ID are intentionally left with explicit switches, because they are the hottest decompression paths and are tightly coupled to the 2-bit on-disk layout. Centralizing them would mean an indirection through the registry on every detoast. The registry is therefore a configuration/catalog-facing abstraction, not a runtime dispatch table for decompression.

This is a reasonable line to draw, but it does mean "adding a new compression method" still requires editing the decompression switches — the patch is a cleanup, not a full pluggability framework.

Disagreement: Naming and the silent-misbehavior risk

Evan Chao raises the sharpest concern: the semantic meaning of default_toast_compression (the C global int) changes. Before the patch it holds 'p' (112) or 'l' (108); after, it holds 0 or 1. Any out-of-tree extension that reads this global directly will compile cleanly and silently do the wrong thing — e.g., treating 0 as '\0' (invalid) or indexing into the wrong array.

His proposed mitigation is to rename the variable (and the DEFAULT_TOAST_COMPRESSION macro) to include a _GUC suffix, forcing a compile error in third-party code. This is the classic Postgres ABI-break hygiene argument: in a major-version bump you are allowed to change semantics, but you should make the change loud rather than silent. The argument has merit and Paquier has not yet pushed back on it in this thread.

Evan also notes readability hazard #3: TOAST_COMPRESS_PGLZ (value 0) and TOAST_PGLZ_COMPRESSION (value 'p') differ only subtly in name but represent completely different numeric spaces. His suggested TOAST_PGLZ_COMPRESS_ID vs TOAST_PGLZ_COMPRESS_METHOD makes the distinction explicit.

Minor: Ordering of MethodToCompressionId call

Evan's point #4 is a stylistic preference — pull the translation out of the switch so the default/error case runs first. It's a small readability win; the functional behavior is unchanged because elog(ERROR, ...) in the default branch is unreachable if MethodToCompressionId accepts the same domain.

What's at Stake Architecturally

This refactor matters beyond aesthetics because the TOAST compression extensibility story has been stuck for years. Every serious proposal (zstd, snappy, per-column custom codecs) trips on the same scattered representations. By collapsing the char/GUC/on-disk translations into one registry, the patch lowers the activation energy for adding the third built-in method — which is relevant because with only 2 bits and INVALID reserved, there are only two slots left for built-in methods before the on-disk format itself must change. Clean bookkeeping is a prerequisite for anyone proposing to consume one of those slots.

It is also notable what the patch does not attempt: it does not introduce a compression method AM, does not make compression methods extension-definable, and does not expand the on-disk bits. It is a tactical internal cleanup targeted at v20, not a pluggability overhaul.

Review Consensus

Both reviewers are supportive of the direction. The outstanding items after v2 are:

v2 already addressed Ayush's three points: the stray alter_index.sgml hunk was removed, the stale comment on default_toast_compression was deleted entirely, and the unused CompressionIdToMethod() helper was removed.