hashjoins vs. Bloom filters (yet again)

First seen: 2026-05-30 00:55:43+00:00 · Messages: 4 · Participants: 2

Latest Update

2026-06-01 · claude-opus-4-6

Bloom Filter Pushdown for Hash Joins: Deep Technical Analysis

Core Problem

Hash joins in PostgreSQL process outer tuples by probing the hash table for each one. When a join is highly selective (many outer tuples have no match), or when batching forces spilling to disk, significant work is wasted on tuples that will ultimately be discarded. The fundamental insight is that a Bloom filter — a probabilistic data structure that can definitively say "not present" but only probabilistically say "present" — can serve as a cheap pre-filter (≈50 cycles vs ≈500 cycles for a full hash table probe) to reject non-matching tuples before expensive operations.

This thread represents the third iteration of this idea (after 2015/16 and 2018 attempts), but with a critical new dimension: instead of using the Bloom filter only within the hash join node itself, the filter is pushed down to the scan node at the leaf of the outer subtree, functioning like an additional qual evaluated at tuple-production time.

Architectural Context and Why This Matters

The Star-Join Multiplication Effect

For a star-schema query joining a fact table F with dimension tables D1, D2, D3 in a left-deep plan:

        HJ
      /    \
    D3     HJ
         /    \
        D2    HJ
            /    \
           D1     F

Without pushdown, each hash join node independently filters tuples — the outer side of the top HJ still receives all tuples that passed through D1 and D2's joins. With pushdown, all three Bloom filters are evaluated at the F scan, meaning tuples are discarded before they ever enter any hash join node. If each join has 20% selectivity, the scan produces only ~1% of tuples, and this benefit compounds multiplicatively.

The Bottom-Up Planning Incompatibility

This is the thread's central architectural tension. PostgreSQL's optimizer works bottom-up: scan paths are costed before join algorithms are chosen. But Bloom filter pushdown is specific to hash joins. The scan node cannot know at planning time whether it will be under a hash join (which enables the filter) or a merge/nested-loop join (which doesn't).

The proposed resolution is to treat this as an opportunistic post-planning optimization: planning proceeds normally, then create_hashjoin_plan() retroactively attaches filters to scan nodes. This means:

  1. The optimizer may miss plans that would be optimal with Bloom filter pushdown
  2. EXPLAIN output shows estimated rows computed without filter knowledge, while actual rows reflect the filter — creating confusing discrepancies

The alternative — maintaining 2^N path sets for N joins (one with filter, one without) — is combinatorially explosive and rejected as impractical.

Implementation Architecture (PoC)

The patch is surprisingly compact (~1000 lines of real logic):

  1. create_hashjoin_plan(): After plan construction, walks the outer subtree to find a leaf scan node via find_bloom_filter_recipient()
  2. ExecInitHashJoin(): Initializes the Bloom filter structure so the scan can locate it via es_bloom_producers lookup
  3. Hash node: Builds the filter alongside the hash table during the build phase
  4. ExecScanExtended(): Scan nodes probe pushed-down filters, treating them as additional quals

Adaptive Behavior

The patch implements a sampling-based adaptive mechanism to handle filters that aren't selective enough:

Key Sizing Dilemma

Bloom filters are sized by (ndistinct, target_false_positive_rate). Problems:

Extension to Custom Scans and Column Stores

Andrew Dunstane identifies a compelling use case: columnar/compressed table AMs (implemented via CustomScan) can leverage Bloom filters at a coarser granularity — skipping entire row groups or eliminating dictionary entries — rather than the per-row rejection that heap scans get.

Per-Column vs. Combined Filters

The PoC builds a single filter on hash32(key1, key2, ...) — the combined hash of all join keys. This is maximally selective for per-row testing but cannot be decomposed for per-column dictionary checks.

The proposed solution is to optionally emit per-column filters in addition to the combined filter:

This distinction matters because per-column filters are strictly less selective: with build pairs {(1,10),(2,20)}, an outer tuple (1,20) passes both per-column filters despite matching no build row. The combined filter correctly rejects it.

Multi-Source Join Keys

A more complex scenario arises when join keys come from different sides of a sub-join:

        HJ (A.x = B.x AND A.y = C.y)
      /    \
     A     HJ
         /    \
        B      C

The combined filter on (x,y) can't be pushed to either B or C individually. Options:

  1. Push the combined filter above the BC join (not really "pushdown")
  2. Build separate filters on (x) and (y), push each to the appropriate side
  3. Both approaches simultaneously

Follow-up Patches (Andrew's 3 patches)

  1. CustomScan recipient support: Adds T_CustomScan to find_bloom_filter_recipient(), corresponding set_customscan_references() fix, and the opt-in flag mechanism
  2. Per-key filter building: Builds individual per-column filters alongside the combined filter when the recipient flag is set
  3. Eager filter creation: Disables the delayed hash-build optimization so the filter is available when the scan starts (necessary for storage-level consumers that need the filter before producing any tuples)

Open Technical Questions

  1. Cost model: What match-rate threshold makes a filter net-positive? Benchmark shows 75% is the inflection point, but this likely varies with tuple width, I/O patterns, etc.
  2. Memory allocation: How to budget filter memory relative to work_mem without starving the hash table?
  3. EXPLAIN clarity: How to present estimated vs. actual rows when the filter wasn't known at estimation time?
  4. Interaction with extended statistics and FK constraints: Modern stats infrastructure might provide better ndistinct estimates, reducing the sizing problem
  5. Delayed build vs. eager build: The standard optimization of building the hash table lazily (after the outer side produces some tuples) conflicts with scan-level consumers that need the filter upfront

Performance Characteristics

From the benchmark (star-join with 1-3 dimensions, selectivity 1%-100%):