Bloom Filter Pushdown for Hash Joins: Deep Technical Analysis
Core Problem
Hash joins in PostgreSQL process outer tuples by probing the hash table for each one. When a join is highly selective (many outer tuples have no match), or when batching forces spilling to disk, significant work is wasted on tuples that will ultimately be discarded. The fundamental insight is that a Bloom filter — a probabilistic data structure that can definitively say "not present" but only probabilistically say "present" — can serve as a cheap pre-filter (≈50 cycles vs ≈500 cycles for a full hash table probe) to reject non-matching tuples before expensive operations.
This thread represents the third iteration of this idea (after 2015/16 and 2018 attempts), but with a critical new dimension: instead of using the Bloom filter only within the hash join node itself, the filter is pushed down to the scan node at the leaf of the outer subtree, functioning like an additional qual evaluated at tuple-production time.
Architectural Context and Why This Matters
The Star-Join Multiplication Effect
For a star-schema query joining a fact table F with dimension tables D1, D2, D3 in a left-deep plan:
HJ
/ \
D3 HJ
/ \
D2 HJ
/ \
D1 F
Without pushdown, each hash join node independently filters tuples — the outer side of the top HJ still receives all tuples that passed through D1 and D2's joins. With pushdown, all three Bloom filters are evaluated at the F scan, meaning tuples are discarded before they ever enter any hash join node. If each join has 20% selectivity, the scan produces only ~1% of tuples, and this benefit compounds multiplicatively.
The Bottom-Up Planning Incompatibility
This is the thread's central architectural tension. PostgreSQL's optimizer works bottom-up: scan paths are costed before join algorithms are chosen. But Bloom filter pushdown is specific to hash joins. The scan node cannot know at planning time whether it will be under a hash join (which enables the filter) or a merge/nested-loop join (which doesn't).
The proposed resolution is to treat this as an opportunistic post-planning optimization: planning proceeds normally, then create_hashjoin_plan() retroactively attaches filters to scan nodes. This means:
- The optimizer may miss plans that would be optimal with Bloom filter pushdown
- EXPLAIN output shows estimated rows computed without filter knowledge, while actual rows reflect the filter — creating confusing discrepancies
The alternative — maintaining 2^N path sets for N joins (one with filter, one without) — is combinatorially explosive and rejected as impractical.
Implementation Architecture (PoC)
The patch is surprisingly compact (~1000 lines of real logic):
create_hashjoin_plan(): After plan construction, walks the outer subtree to find a leaf scan node viafind_bloom_filter_recipient()ExecInitHashJoin(): Initializes the Bloom filter structure so the scan can locate it viaes_bloom_producerslookup- Hash node: Builds the filter alongside the hash table during the build phase
ExecScanExtended(): Scan nodes probe pushed-down filters, treating them as additional quals
Adaptive Behavior
The patch implements a sampling-based adaptive mechanism to handle filters that aren't selective enough:
- Track match rate per 1000 probes
- If >90% of tuples pass the filter → switch to 1% sampling mode (effectively disabling the filter)
- If sampling detects the rate drops below 80% → re-enable full probing
- This bounds worst-case regression to ~3% (from filter construction overhead) vs the ~50% regression seen with naïve always-on probing at high match rates
Key Sizing Dilemma
Bloom filters are sized by (ndistinct, target_false_positive_rate). Problems:
- Memory budget is constrained (fraction of work_mem, which the hash table already claims)
- ndistinct estimates can be wildly wrong (2x overcount → ~10% FPR; 10x → ~50% FPR)
- A "degraded" filter (10% FPR) may still be hugely beneficial if join selectivity is 1%
- Need a cost model to determine when building/probing is net-positive
Extension to Custom Scans and Column Stores
Andrew Dunstane identifies a compelling use case: columnar/compressed table AMs (implemented via CustomScan) can leverage Bloom filters at a coarser granularity — skipping entire row groups or eliminating dictionary entries — rather than the per-row rejection that heap scans get.
Per-Column vs. Combined Filters
The PoC builds a single filter on hash32(key1, key2, ...) — the combined hash of all join keys. This is maximally selective for per-row testing but cannot be decomposed for per-column dictionary checks.
The proposed solution is to optionally emit per-column filters in addition to the combined filter:
- Combined filter: precise per-row rejection (stays default, unchanged for heap)
- Per-column filters: less selective but enable row-group/dictionary skipping in column stores
- Only built when a recipient signals it can use them (opt-in via
CUSTOMPATH_SUPPORT_BLOOM_FILTERSflag)
This distinction matters because per-column filters are strictly less selective: with build pairs {(1,10),(2,20)}, an outer tuple (1,20) passes both per-column filters despite matching no build row. The combined filter correctly rejects it.
Multi-Source Join Keys
A more complex scenario arises when join keys come from different sides of a sub-join:
HJ (A.x = B.x AND A.y = C.y)
/ \
A HJ
/ \
B C
The combined filter on (x,y) can't be pushed to either B or C individually. Options:
- Push the combined filter above the BC join (not really "pushdown")
- Build separate filters on (x) and (y), push each to the appropriate side
- Both approaches simultaneously
Follow-up Patches (Andrew's 3 patches)
- CustomScan recipient support: Adds
T_CustomScantofind_bloom_filter_recipient(), correspondingset_customscan_references()fix, and the opt-in flag mechanism - Per-key filter building: Builds individual per-column filters alongside the combined filter when the recipient flag is set
- Eager filter creation: Disables the delayed hash-build optimization so the filter is available when the scan starts (necessary for storage-level consumers that need the filter before producing any tuples)
Open Technical Questions
- Cost model: What match-rate threshold makes a filter net-positive? Benchmark shows 75% is the inflection point, but this likely varies with tuple width, I/O patterns, etc.
- Memory allocation: How to budget filter memory relative to work_mem without starving the hash table?
- EXPLAIN clarity: How to present estimated vs. actual rows when the filter wasn't known at estimation time?
- Interaction with extended statistics and FK constraints: Modern stats infrastructure might provide better ndistinct estimates, reducing the sizing problem
- Delayed build vs. eager build: The standard optimization of building the hash table lazily (after the outer side produces some tuples) conflicts with scan-level consumers that need the filter upfront
Performance Characteristics
From the benchmark (star-join with 1-3 dimensions, selectivity 1%-100%):
- Up to 3x speedup with filter pushdown enabled
- ~3% regression in worst case (100% match rate) with adaptive sampling
- Without pushdown (filter only in HJ node): no measurable difference from master
- Benefits scale with number of dimensions and inverse selectivity