Technical Analysis: Secondary Hash to Split Stuck Hash-Join Batches
The Core Problem: Hash Join Batch Splitting Degeneracy
PostgreSQL's hash join executor partitions the inner relation into batches when the hash table exceeds work_mem. The partitioning uses the upper bits of the hash value to assign tuples to batches. When nbatch is doubled, one additional bit is consumed to split each existing batch into two.
The fundamental degeneracy occurs when many tuples share identical upper hash bits. In this scenario, batch doubling cannot split the batch — all tuples remain mapped to the same batch regardless of how many times nbatch is increased. This is analogous to all keys landing in the same hash bucket in a hash table, but at the batch-partitioning level.
Why This Matters Architecturally
The hash join executor currently has two fallback paths when confronted with stuck batches:
-
Futile batch doubling: The executor keeps doubling
nbatch, consuming metadata resources and I/O for batch file management, but never actually reducing the size of the problematic batch. PG18 introduced a cap to prevent runaway doubling. -
Silent
work_memviolation: The executor inflatesspaceAllowedbeyond the configuredwork_memto accommodate the oversized batch in memory. This violates the administrator's memory configuration guarantees — a hash join might consume multiples ofwork_memwithout any visibility into why.
Both paths are unsatisfying. The first wastes resources; the second breaks the memory contract. PG18 improved the situation by capping futile doubling but still falls back to inflating memory, meaning work_mem violations remain routine for skewed data distributions.
Proposed Solution: Rehash-Based Sub-Batching
Mechanism
The proposal introduces ExecHashRehashBatch, a new function invoked when ExecHashIncreaseNumBatches detects a stuck batch (a batch that cannot be split by the standard bit-based partitioning):
- Detection: When batch doubling fails to reduce the stuck batch's size, the rehash path is triggered.
- Secondary hash: A different hash function is applied to tuples in the stuck batch, producing independent bits for partitioning.
- Sub-batch creation: Tuples are assigned to sub-batches based on the secondary hash value.
- Recursive splitting: Sub-batches are handled identically to normal batches — they can be further split if needed until each fits within
spaceAllowed. - Graceful fallback: If the secondary hash also cannot split the batch (e.g., all tuples are truly identical), the existing inflate-memory path is taken.
Key Design Decisions
- GUC-gated:
enable_hashjoin_rehashallows the feature to be toggled, preserving backward compatibility and enabling A/B testing. - Sub-batches reuse existing infrastructure: Rather than introducing a parallel batch management system, sub-batches flow through the same code paths as normal batches.
- Secondary hash function independence: The secondary hash must produce values uncorrelated with the primary hash to actually split stuck batches. If tuples collide on primary hash bits, an independent hash function will (with high probability) distribute them across different sub-batches.
Additional Heuristics Explored
The author describes layered intelligence on top of the rehash mechanism:
- Planner-guided skew detection: Use cardinality estimates to distinguish "many tuples with same hash" (skew) from "few very large tuples." This determines whether to attempt batch splitting or jump directly to rehash.
- Avoiding PG18 memory inflation: When skew is detected, skip
spaceAllowedinflation entirely and attempt rehash first — maintainingwork_memcompliance. - Diminishing returns detection: If a batch split achieves less than a threshold reduction, switch to rehash rather than continuing to split.
Technical Implications and Tradeoffs
Advantages
- Respects
work_mem: The primary win — hash joins can stay within configured memory limits even with skewed data. - Minimal runtime overhead: The benchmark shows nearly identical execution time (10.071ms vs 10.288ms), suggesting the rehash cost is amortized.
- Memory reduction: 5974kB → 3083kB in the demonstration case, bringing the join under the 4MB
work_memlimit.
Concerns and Open Questions
-
Probe-side coordination: When the inner side is rehashed into sub-batches, the outer (probe) side must also be able to find matching tuples. The probe side needs to know which sub-batch to check, requiring the same secondary hash computation during probing. This coordination complexity is not detailed in the RFC.
-
Temp file I/O amplification: Sub-batches likely require additional temp file writes/reads. The net I/O cost versus simply holding the oversized batch in memory is workload-dependent.
-
Hash function choice: The secondary hash must be efficiently computable and independent of the primary. The choice of function affects both splitting quality and CPU overhead.
-
Interaction with parallel hash join: Parallel hash join has its own batch management complexities. How sub-batches interact with parallel workers needs consideration.
-
EXPLAIN output: The prototype adds
Rehash Repartitions: 1 (max sub-batches: 2)to EXPLAIN output, providing visibility into when rehashing occurs.
Context: Prior Work
The RFC references Tomas Vondra's description of the stuck-batch problem, indicating this is a known long-standing issue in the community. The PG18 cap on batch doubling was a partial mitigation but not a full solution. This proposal represents a more fundamental fix that attacks the root cause rather than just limiting the symptoms.
Assessment
This is a well-motivated proposal addressing a real architectural weakness in PostgreSQL's hash join implementation. The key insight — that when primary hash bits are exhausted for partitioning, an independent hash function can provide fresh partitioning entropy — is sound. The main technical risk lies in the probe-side coordination complexity and potential I/O amplification, which would need to be carefully evaluated in a full patch submission.