Refactoring the TOAST Compression Method Registry
Architectural Context
PostgreSQL's TOAST (The Oversized-Attribute Storage Technique) subsystem supports pluggable-but-fixed compression methods. The on-disk encoding is painfully tight: only 2 bits in va_tcinfo/va_extinfo of a compressed varlena header identify the compression algorithm. That is a hard ceiling of four possible values, one of which is already reserved as TOAST_INVALID_COMPRESSION_ID. Currently only two are used (pglz, lz4).
This 2-bit constraint coexists with three other representations of "which compression method":
- On-disk ID (
ToastCompressionId, 0..3): the raw bits stored in the varlena header. - Catalog char (
pg_attribute.attcompression:'p','l', or'\0'): whatTOAST_PGLZ_COMPRESSION/TOAST_LZ4_COMPRESSIONmacros expand to. - GUC integer for
default_toast_compression: historically stored the char value, masquerading as anintfor the GUC machinery.
These three representations are scattered across toast_compression.[ch], detoast.c, toast_internals.c, pg_column_compression(), and the GUC table, with ad-hoc switch statements translating between them. Every addition of a new compression method (zstd is a long-discussed candidate) requires touching all of these sites and carefully preserving invariants between the char encoding, the GUC enum, and the on-disk bits.
The Patch's Core Idea
Michael Paquier's proposal introduces a centralized method registry in toast_compression.c that holds, per method:
- Name (human-readable, for
pg_column_compression()output and error messages) - GUC enum value (the integer the GUC machinery stores)
attcompressionchar value (for the catalog)- Varatt on-disk ID (the 2-bit value)
Translation helpers (MethodToCompressionId, CompressionIdToMethod, and GUC ↔ char mappings) replace the hand-written switches. The semantic cleanup also removes TOAST_INVALID_COMPRESSION_ID from the GUC enum list, which never made sense as a user-selectable value — it was an artifact of reusing the on-disk ID space for GUC values.
Crucially, the GUC default_toast_compression now stores a new, independent integer encoding (TOAST_PGLZ_COMPRESSION_GUC, TOAST_LZ4_COMPRESSION_GUC) rather than the char value 'p'/'l' cast to int. This decouples the GUC representation from both the catalog char and the on-disk ID.
Key Design Decisions and Tradeoffs
Decision: Keep varlena-pointer decompression paths untouched
Ayush Tiwari asked why pg_column_compression() still has its own cmid-to-name switch instead of routing through CompressionIdToMethod(). Paquier's response is telling: the code paths that start from a raw varlena pointer and extract the on-disk ID are intentionally left with explicit switches, because they are the hottest decompression paths and are tightly coupled to the 2-bit on-disk layout. Centralizing them would mean an indirection through the registry on every detoast. The registry is therefore a configuration/catalog-facing abstraction, not a runtime dispatch table for decompression.
This is a reasonable line to draw, but it does mean "adding a new compression method" still requires editing the decompression switches — the patch is a cleanup, not a full pluggability framework.
Disagreement: Naming and the silent-misbehavior risk
Evan Chao raises the sharpest concern: the semantic meaning of default_toast_compression (the C global int) changes. Before the patch it holds 'p' (112) or 'l' (108); after, it holds 0 or 1. Any out-of-tree extension that reads this global directly will compile cleanly and silently do the wrong thing — e.g., treating 0 as '\0' (invalid) or indexing into the wrong array.
His proposed mitigation is to rename the variable (and the DEFAULT_TOAST_COMPRESSION macro) to include a _GUC suffix, forcing a compile error in third-party code. This is the classic Postgres ABI-break hygiene argument: in a major-version bump you are allowed to change semantics, but you should make the change loud rather than silent. The argument has merit and Paquier has not yet pushed back on it in this thread.
Evan also notes readability hazard #3: TOAST_COMPRESS_PGLZ (value 0) and TOAST_PGLZ_COMPRESSION (value 'p') differ only subtly in name but represent completely different numeric spaces. His suggested TOAST_PGLZ_COMPRESS_ID vs TOAST_PGLZ_COMPRESS_METHOD makes the distinction explicit.
Minor: Ordering of MethodToCompressionId call
Evan's point #4 is a stylistic preference — pull the translation out of the switch so the default/error case runs first. It's a small readability win; the functional behavior is unchanged because elog(ERROR, ...) in the default branch is unreachable if MethodToCompressionId accepts the same domain.
What's at Stake Architecturally
This refactor matters beyond aesthetics because the TOAST compression extensibility story has been stuck for years. Every serious proposal (zstd, snappy, per-column custom codecs) trips on the same scattered representations. By collapsing the char/GUC/on-disk translations into one registry, the patch lowers the activation energy for adding the third built-in method — which is relevant because with only 2 bits and INVALID reserved, there are only two slots left for built-in methods before the on-disk format itself must change. Clean bookkeeping is a prerequisite for anyone proposing to consume one of those slots.
It is also notable what the patch does not attempt: it does not introduce a compression method AM, does not make compression methods extension-definable, and does not expand the on-disk bits. It is a tactical internal cleanup targeted at v20, not a pluggability overhaul.
Review Consensus
Both reviewers are supportive of the direction. The outstanding items after v2 are:
- Whether to rename
default_toast_compression(andDEFAULT_TOAST_COMPRESSION) to make the ABI break loud (Evan's strongest point, unresolved). - Naming disambiguation between on-disk-ID macros and catalog-char macros.
- Minor code-motion of
MethodToCompressionId()out of the switch.
v2 already addressed Ayush's three points: the stray alter_index.sgml hunk was removed, the stale comment on default_toast_compression was deleted entirely, and the unused CompressionIdToMethod() helper was removed.