2026-06-04 · claude-opus-4-6

Incremental Update: v2 Patch Series (Hash Join-Local Bloom Filters) and Architectural Discussion

Major Development: v2 Patch Series — Bloom Filters Within Hash Join Scope

Tomas Vondra posted a new, independent patch series (labeled v2 but orthogonal to the v1 pushdown patch) that adds Bloom filters within the hash join node itself — both for serial and parallel joins. This is a deliberate scoping-down to tackle shared prerequisite problems (adaptive behavior, sizing, batching) before attempting full pushdown.

Key Technical Details of v2

Adaptive Build Logic:

For nbatch=1 (in-memory): delay filter construction until 1000 lookups have been observed; only build if <90% of lookups find a match
For nbatch>1 (batched/spilling): must build the filter immediately before spilling, because tuples from later batches need to be filtered against the complete build set
Periodic rechecks allow late filter construction if match rates drop

Adaptive Probe Logic:

Track probes vs. rejections continuously
If <10% of probes are rejected → disable the filter (it's not paying for itself)
When disabled, sample 1% of probes to detect if conditions change
Re-enable if >20% rejection rate detected in samples

Parallel Hash Join Extension (0002):

Direct extension of serial case but operates on shared-memory hash tables
Inherently more complex due to parallel hash build coordination

Sizing/False-Positive Trade-off Issue Identified:

Current lib/bloomfilter.c targets 1-2% FPR, sometimes achieving 0.077% FPR
Vondra questions whether accepting ~10% FPR for a much smaller filter and fewer hash functions (fewer bits to check per probe) would be net-positive
Example: with work_mem=64MB, filter is 16384kB with 10 hash functions for a join using 82784kB across 16 batches

Performance Results (v2, scale=100: 100M fact × 10M dimension)

Serial (workers=0):

20% faster for joins with >50% outer tuple discard rate
~2x faster at 5% selectivity (95% discarded)
Adaptive thresholds correctly identify the break-even point

Parallel:

Benefits more limited and inverted pattern: serial benefits most from batched joins, parallel benefits most from unbatched joins
Vondra suspects possible bug in parallel path or hardware bottleneck; requests review

Planning/Costing: Academic Paper Reference

Vondra references a 2025 paper "Including Bloom Filters in Bottom-up Optimization" (arXiv:2505.02994v1) that directly addresses the planning problem from the prior analysis. Key observations:

Paper's solution resembles the "tracking separate path sets" idea but uses a two-phase process
Vondra considers this too invasive (would require fundamental rework of join rel construction)
Instead proposes a pathkeys-like analogy: define "potentially interesting filters" at baserel scan time, construct extra paths only for those, and reject at join planning if the join isn't a hash join

Architectural Framing: Runtime Knowledge Pushdown

Oleg Bartunov articulated a broader architectural vision distinguishing two roles:

Local HJ optimization — no physical ordering requirements, pure CPU savings from avoiding probes/spills
Scan-layer pushdown — requires coarse-grained structures to skip (partitions, row groups, dictionaries, BRIN summaries, min/max metadata)

He frames the real feature as a pluggable runtime-filter mechanism — not Bloom-specific — where join-derived knowledge flows downward against normal tuple flow, consumable by heap, CustomScan, FDW, columnar AMs, or chunk-based storage at whatever granularity they support.

Top-Down Planning Alternative

Andrei Lepikhov suggests using create_upper_paths_hook for a "top-down" iteration after bottom-up planning (referencing his "MiddleOut" approach). This could theoretically resolve the planning incompatibility but Vondra considers it too complex to integrate with the already-complex Bloom filter work.

Interaction with Executor Batching Work

Vondra notes that ongoing executor batching work could provide "row group" semantics even for in-HJ-node filters, partially bridging the gap between local and pushdown approaches (though tuples still flow up to the join node).

2026-06-01 · claude-opus-4-6

Bloom Filter Pushdown for Hash Joins: Deep Technical Analysis

Core Problem

Hash joins in PostgreSQL process outer tuples by probing the hash table for each one. When a join is highly selective (many outer tuples have no match), or when batching forces spilling to disk, significant work is wasted on tuples that will ultimately be discarded. The fundamental insight is that a Bloom filter — a probabilistic data structure that can definitively say "not present" but only probabilistically say "present" — can serve as a cheap pre-filter (≈50 cycles vs ≈500 cycles for a full hash table probe) to reject non-matching tuples before expensive operations.

This thread represents the third iteration of this idea (after 2015/16 and 2018 attempts), but with a critical new dimension: instead of using the Bloom filter only within the hash join node itself, the filter is pushed down to the scan node at the leaf of the outer subtree, functioning like an additional qual evaluated at tuple-production time.

Architectural Context and Why This Matters

The Star-Join Multiplication Effect

For a star-schema query joining a fact table F with dimension tables D1, D2, D3 in a left-deep plan:

Without pushdown, each hash join node independently filters tuples — the outer side of the top HJ still receives all tuples that passed through D1 and D2's joins. With pushdown, all three Bloom filters are evaluated at the F scan, meaning tuples are discarded before they ever enter any hash join node. If each join has 20% selectivity, the scan produces only ~1% of tuples, and this benefit compounds multiplicatively.

The Bottom-Up Planning Incompatibility

This is the thread's central architectural tension. PostgreSQL's optimizer works bottom-up: scan paths are costed before join algorithms are chosen. But Bloom filter pushdown is specific to hash joins. The scan node cannot know at planning time whether it will be under a hash join (which enables the filter) or a merge/nested-loop join (which doesn't).

The proposed resolution is to treat this as an opportunistic post-planning optimization: planning proceeds normally, then create_hashjoin_plan() retroactively attaches filters to scan nodes. This means:

The optimizer may miss plans that would be optimal with Bloom filter pushdown
EXPLAIN output shows estimated rows computed without filter knowledge, while actual rows reflect the filter — creating confusing discrepancies

The alternative — maintaining 2^N path sets for N joins (one with filter, one without) — is combinatorially explosive and rejected as impractical.

Implementation Architecture (PoC)

The patch is surprisingly compact (~1000 lines of real logic):

create_hashjoin_plan(): After plan construction, walks the outer subtree to find a leaf scan node via find_bloom_filter_recipient()
ExecInitHashJoin(): Initializes the Bloom filter structure so the scan can locate it via es_bloom_producers lookup
Hash node: Builds the filter alongside the hash table during the build phase
ExecScanExtended(): Scan nodes probe pushed-down filters, treating them as additional quals

Adaptive Behavior

The patch implements a sampling-based adaptive mechanism to handle filters that aren't selective enough:

Track match rate per 1000 probes
If >90% of tuples pass the filter → switch to 1% sampling mode (effectively disabling the filter)
If sampling detects the rate drops below 80% → re-enable full probing
This bounds worst-case regression to ~3% (from filter construction overhead) vs the ~50% regression seen with naïve always-on probing at high match rates

Key Sizing Dilemma

Bloom filters are sized by (ndistinct, target_false_positive_rate). Problems:

Memory budget is constrained (fraction of work_mem, which the hash table already claims)
ndistinct estimates can be wildly wrong (2x overcount → ~10% FPR; 10x → ~50% FPR)
A "degraded" filter (10% FPR) may still be hugely beneficial if join selectivity is 1%
Need a cost model to determine when building/probing is net-positive

Extension to Custom Scans and Column Stores

Andrew Dunstane identifies a compelling use case: columnar/compressed table AMs (implemented via CustomScan) can leverage Bloom filters at a coarser granularity — skipping entire row groups or eliminating dictionary entries — rather than the per-row rejection that heap scans get.

Per-Column vs. Combined Filters

The PoC builds a single filter on hash32(key1, key2, ...) — the combined hash of all join keys. This is maximally selective for per-row testing but cannot be decomposed for per-column dictionary checks.

The proposed solution is to optionally emit per-column filters in addition to the combined filter:

Combined filter: precise per-row rejection (stays default, unchanged for heap)
Per-column filters: less selective but enable row-group/dictionary skipping in column stores
Only built when a recipient signals it can use them (opt-in via CUSTOMPATH_SUPPORT_BLOOM_FILTERS flag)

This distinction matters because per-column filters are strictly less selective: with build pairs {(1,10),(2,20)}, an outer tuple (1,20) passes both per-column filters despite matching no build row. The combined filter correctly rejects it.

Multi-Source Join Keys

A more complex scenario arises when join keys come from different sides of a sub-join:

        HJ (A.x = B.x AND A.y = C.y)
      /    \
     A     HJ
         /    \
        B      C

The combined filter on (x,y) can't be pushed to either B or C individually. Options:

Push the combined filter above the BC join (not really "pushdown")
Build separate filters on (x) and (y), push each to the appropriate side
Both approaches simultaneously

Follow-up Patches (Andrew's 3 patches)

CustomScan recipient support: Adds T_CustomScan to find_bloom_filter_recipient(), corresponding set_customscan_references() fix, and the opt-in flag mechanism
Per-key filter building: Builds individual per-column filters alongside the combined filter when the recipient flag is set
Eager filter creation: Disables the delayed hash-build optimization so the filter is available when the scan starts (necessary for storage-level consumers that need the filter before producing any tuples)

Open Technical Questions

Cost model: What match-rate threshold makes a filter net-positive? Benchmark shows 75% is the inflection point, but this likely varies with tuple width, I/O patterns, etc.
Memory allocation: How to budget filter memory relative to work_mem without starving the hash table?
EXPLAIN clarity: How to present estimated vs. actual rows when the filter wasn't known at estimation time?
Interaction with extended statistics and FK constraints: Modern stats infrastructure might provide better ndistinct estimates, reducing the sizing problem
Delayed build vs. eager build: The standard optimization of building the hash table lazily (after the outer side produces some tuples) conflicts with scan-level consumers that need the filter upfront

Performance Characteristics

From the benchmark (star-join with 1-3 dimensions, selectivity 1%-100%):

Up to 3x speedup with filter pushdown enabled
~3% regression in worst case (100% match rate) with adaptive sampling
Without pushdown (filter only in HJ node): no measurable difference from master
Benefits scale with number of dimensions and inverse selectivity

hashjoins vs. Bloom filters (yet again)

Latest Update