2026-05-14 · claude-opus-4-6

Proposal: Adding Compression of Temporary Files in PostgreSQL

The Core Problem

PostgreSQL's query executor spills intermediate results to temporary files when work_mem is exhausted. This happens during hash joins (when inner batches exceed memory), sorts, GiST index builds, tuplestores, and other operations. These temporary files can grow enormous — the thread demonstrates cases generating 20+ GB of temp data for a single query — producing heavy I/O load that becomes the dominant bottleneck for complex analytical queries.

While PostgreSQL already supports compression in several contexts (TOAST, WAL backup blocks, base backups), temporary files written by the executor have never been compressed. This is a significant architectural gap: temp files are ephemeral, sequential-access data that is an ideal candidate for transparent compression, yet they currently impose full I/O costs on every byte.

The core value proposition is straightforward: compress temp data before writing to disk, reducing I/O volume at the cost of CPU. The fundamental question the thread grapples with is whether this tradeoff is actually beneficial in practice, and under what conditions.

Architectural Design and Key Technical Decisions

Where Compression Lives: The BufFile Layer

The patch inserts compression at the BufFile abstraction layer (src/backend/storage/file/buffile.c), which is PostgreSQL's buffered temporary file interface. This is architecturally significant because BufFile is the common layer used by hash joins, tuplestores, and other spilling operations. By compressing at this level, the feature can potentially benefit multiple executor node types without each needing custom compression logic.

The implementation modifies BufFileDumpBuffer() (write path) and BufFileLoadBuffer() (read path) to transparently compress/decompress data blocks. Each block is written with a CompressHeader containing both the compressed and original lengths, enabling the read path to allocate appropriately and verify decompression correctness.

The Random Access Constraint

A critical architectural constraint is that compression breaks random access (BufFileSeek). Compressed blocks have variable sizes on disk, so seeking to a logical offset is no longer a simple calculation. The initial patch therefore only enables compression for hash join spill files (which are read sequentially after being written) and later extends to tuplestores that don't require backward scanning (EXEC_FLAG_BACKWARD).

Tomas Vondra raises an interesting future direction: maintaining an offset index (mapping logical block numbers to physical file offsets) could enable random access on compressed files. He estimates this would cost ~128KB per 1GB of temp data — negligible overhead. This would unlock compression for tuplesorts and other operations requiring seeks, but it remains future work.

Compression Algorithm Selection

The thread explores four compression algorithms with very different characteristics:

pglz: PostgreSQL's built-in algorithm. Always available, no external dependencies. However, benchmarking consistently shows it is catastrophically slow for this use case — 2-7x slower than uncompressed in many scenarios. Tomas's TPC-H results and Filip's later benchmarks both confirm this. The final recommendation is to not offer pglz for temp file compression at all, as it creates a "trap for users."
LZ4: Fast compression with moderate ratios. Requires --with-lz4 at build time. Shows near-neutral performance when I/O isn't the bottleneck, and meaningful speedups under I/O pressure. The worst-case overhead is ~44% slower (single connection, everything cached), but under real I/O pressure it delivers 2-2.5x speedups.
zstd: Best compression ratios and, surprisingly, the best speedups under I/O pressure (up to 3.5x). Slightly more CPU-intensive than LZ4 but the better compression ratio more than compensates when I/O is the bottleneck.
gzip/libz: Added experimentally by Tomas. Suffers from a Windows portability issue (uLongf vs size_t type mismatch) that Álvaro Herrera flags from CI failures.

Compression Block Size

Filip's later benchmarks reveal that the unit of compression matters significantly. The patch initially compresses at BLCKSZ (8KB) granularity. Filip introduces COMPRESS_BLCKSZ and tests 8KB, 32KB, and 64KB:

Block Size	Time (% of uncompressed)	Compressed Size
8 KB	58%	7.47 GB
32 KB	52%	7.22 GB
64 KB	56%	7.14 GB

32KB is the sweet spot: fewer compress/decompress calls (4x fewer than 8KB), less per-block header overhead, better compression ratios, without the cache pressure issues of 64KB blocks.

Memory Management Evolution

The patch went through several iterations on buffer management:

Initial: palloc/pfree per compression operation — expensive, especially for buffers >8KB that bypass the memory context freelist.
Static shared buffer: A single buffer shared across all BufFiles in the backend. Tomas identifies a flaw: if two files use different compression methods, the buffer sizing could be wrong, and pfree() of the shared buffer while another file references it is unsafe.
Per-file allocation (final): Each compressed BufFile gets its own buffer. Filip notes the static buffer provided "negligible performance benefit while keeping memory allocated for the backend's lifetime."

The curOffset Correction Problem

A subtle and persistent bug/confusion involves the curOffset adjustment in BufFileDumpBuffer. The existing code adjusts curOffset to account for partially-consumed buffers. With compression, the bytes written to disk differ from the logical bytes, requiring a different correction. The code:

if (!file->compress)
    file->curOffset -= (file->nbytes - file->pos);
else
    if (nbytesOriginal - file->pos != 0)
        file->curOffset -= bytestowrite;

Tomas and Dmitry Dolgov both flag that the compressed path's use of bytestowrite from the last loop iteration would be incorrect if the while loop executes multiple times (for tuples wider than BLCKSZ). Dmitry notes that existing tests don't exercise this case, and asks how to construct such a test — wide text columns get TOASTed before reaching the temp file, making it hard to trigger multi-loop writes naturally.

GUC Design

The feature is controlled by temp_file_compression GUC (default: none). Tomas identifies an important correctness issue: the patch initially stored only a boolean compress flag per file and checked the GUC at compression/decompression time. But the GUC can change mid-session (e.g., SET temp_file_compression between cursor FETCH calls), so the actual compression method must be recorded per-file at creation time.

The Performance Debate: The Central Tension

The most significant technical drama in this thread is the performance evaluation, which reveals a nuanced and somewhat discouraging picture.

Tomas's Benchmark (March 2026)

Tomas constructs a rigorous benchmark: hash joins with 1M/10M/100M rows, varying duplicates (compressibility), 1/4/8 concurrent connections, on a 64GB RAM machine with SSD/NVMe storage. His findings:

pglz: 1.6-6.9x slower. Unacceptable.
lz4: 0.58-1.48x of baseline. Mostly slower, occasionally faster. Only consistently wins with 100M rows × 8 connections × high compressibility.

The problem: with 64GB RAM and 8GB shared_buffers, ~56GB remains for page cache. Even 8 connections × 10GB = 80GB of temp files largely stays cached. Compression's I/O reduction doesn't help because there wasn't much synchronous I/O to begin with. The CPU cost of compression becomes pure overhead.

Tomas's conclusion is sobering: "I feel rather awful about this, mostly because I'm the one who suggested working on this back in 2024."

Filip's Counter-Benchmark (May 2026)

Filip runs the same benchmark on systems with real I/O pressure:

Machine A: HDD storage (rotational), 64GB RAM
Machine B: SATA SSD, RAM capped to 16GB

Results are dramatically different:

lz4: 39-94% of baseline (up to 2.5x speedup)
zstd: 27-98% of baseline (up to 3.5x speedup)
Even pglz wins under extreme I/O pressure (w=8 on HDD), though it never beats lz4/zstd

The insight: compression's value is entirely determined by the I/O-to-CPU cost ratio. On systems with fast storage and ample page cache, compression is overhead. On I/O-constrained systems (slow storage, limited RAM, concurrent workloads competing for page cache), compression delivers substantial wins.

The Opt-In Resolution

The resolution of this tension is the opt-in GUC approach: default to none (no regression risk), let administrators enable lz4/zstd when they know their systems are I/O-constrained. Tomas acknowledges this but notes a practical concern: "systems generally are not under such pressure 24/7, but only for some part of a day. But people will mostly set the GUC in the config file."

Unresolved Issues and Future Work

Compression for tuplesort: Requires random access (seeking to arbitrary positions for external merge sort). The offset-index approach could solve this but isn't implemented.
Stream compression: LZ4 and zstd both support streaming modes that maintain compression context across blocks, potentially improving ratios. Tomas notes this was very beneficial for pg_dump compression. However, it would require a different API than block-at-a-time compression.
Generic compression API: Alexander Korotkov suggests abstracting WAL/TOAST/temp compression into a shared API. Tomas pushes back: the use cases are too different (block vs. stream, random vs. sequential access). This mirrors the decision made for pg_dump, which has its own compression abstraction.
Compression block size as a tunable: Filip's 32KB results suggest this matters, but it's not yet a GUC or even a well-defined compile-time constant in the patch.
The BufFileSeek/BufFileTell correctness gap: Zsolt Parragi notes that seek/tell don't work correctly with compression enabled, but there are no assertions protecting against misuse.
Windows portability: The gzip/libz support has type mismatches on Windows (uLongf vs size_t).

Proposal: Adding compression of temporary files

Latest Update