Optimize RelfilenumberMapInvalidateCallback for Logical Decoding Performance
Core Problem
The thread addresses a significant performance bottleneck in PostgreSQL's logical decoding (walsender) pathway, specifically in the RelfilenumberMapInvalidateCallback function. This function is part of the relfilenumber map cache infrastructure — the mapping between relation OIDs and their physical file numbers (formerly relfilenode numbers, renamed in PG16+).
Architectural Context
PostgreSQL's logical decoding must process catalog invalidation messages embedded in WAL to maintain a consistent snapshot of the system catalog as it replays changes. The RelfilenumberMapInvalidateCallback is invoked whenever an invalidation message indicates that a relation's physical storage mapping has changed (e.g., due to DDL like ALTER TABLE, TRUNCATE, ANALYZE with statistics changes, etc.).
The relfilenumber map cache (RelfilenumberMapHash) is a hash table keyed by RelfilenumberMapKey (which contains the relfilenumber and the tablespace OID) and maps to the relation OID. This is used during logical decoding to resolve physical file references back to logical relations.
The Scaling Problem
The current implementation has a critical algorithmic deficiency:
When invalidating a SPECIFIC relation (relid != InvalidOid):
- The code performs a FULL sequential scan of the entire hash table
- It must iterate all entries to find those matching the given relid
- This is O(n) where n = number of cached entries
This is because the hash table is keyed by (relfilenumber, tablespace) → relid, but invalidation messages arrive with a relid. There is no reverse index from relid back to the hash entry.
In multi-tenant deployments or systems with many tables (common in SaaS/PaaS scenarios), the cache can grow to tens of thousands of entries. A single DDL statement can generate multiple invalidation messages, and each ALTER TABLE in a migration touching many tables compounds the problem quadratically: N tables × O(N) per invalidation = O(N²) total cost.
Why This Matters for Logical Replication
Logical decoding/replication is increasingly critical infrastructure for:
- Zero-downtime upgrades
- Change data capture (CDC) pipelines
- Multi-region replication
Walsender throughput directly impacts replication lag. When DDL-heavy workloads (schema migrations, ANALYZE runs on many tables) cause the walsender to stall processing invalidations, replication lag spikes can cascade into production issues. The benchmarks show this can reach 60+ seconds for 50,000 tables with separate-transaction DDL — an unacceptable stall for a replication stream.
Proposed Solution
The patch introduces two supplementary data structures to enable O(1) invalidation:
1. Reverse Hash Table (RelfilenumberReverseHash)
A new hash table mapping relid → RelfilenumberMapKey. This provides O(1) lookup when an invalidation message specifies a particular relation:
- On invalidation for relid X: look up X in reverse hash → get the
(relfilenumber, tablespace)key → remove from both the forward hash and the reverse hash - Maintains bidirectional consistency: insertions and deletions update both hash tables
This is a classic space-time tradeoff — doubling the memory for the map cache in exchange for eliminating the O(n) scan.
2. Singly-Linked List (NegativeEntryList)
Negative cache entries (where relid == InvalidOid) represent "known-missing" mappings — relfilenumber values that have been looked up but don't correspond to any known relation. These cannot be indexed in the reverse hash (since their relid is Invalid/0), so they are tracked in a separate linked list.
When a "wildcard" invalidation occurs (invalidating all entries), both the hash tables and this list are cleared. When a specific relid is invalidated, negative entries are unaffected (they don't match any specific relid anyway).
Algorithmic Complexity Improvement
| Operation | Before | After |
|---|---|---|
| Invalidate specific relid | O(n) scan | O(1) hash lookup + delete |
| Invalidate all (relid=0) | O(n) scan | O(n) — unchanged, must clear all |
| Insert new mapping | O(1) | O(1) — two hash inserts |
| Memory overhead | 1 hash table | 2 hash tables + linked list |
Performance Analysis
The benchmark results are compelling and demonstrate the expected quadratic-to-linear improvement:
Separate transactions (50K tables): 61.8s → 4.1s (15.2× speedup)
- Each ALTER TABLE is its own transaction, so the cache grows incrementally and each invalidation scans the full cache at that point. The improvement is roughly linear vs. the quadratic baseline.
Single transaction (50K tables): 29.2s → 0.34s (85.2× speedup)
- All DDL in one transaction means the invalidation processing happens during commit/decode of a large transaction. The cache is fully populated (all 50K entries) when invalidations fire, making each one maximally expensive. The 85× speedup for 50K tables is consistent with eliminating O(n) per invalidation: previously ~50K × 50K/2 comparisons reduced to ~50K × O(1).
The fact that DML-only performance is unchanged confirms the patch has no regression for the common case (DML-only workloads don't trigger relfilenumber invalidations frequently).
Design Considerations and Potential Discussion Points
Memory Overhead
The reverse hash table roughly doubles the memory consumption of the relfilenumber map cache. For 50K tables, this might be ~4-8 MB additional memory per walsender process. Given that this cache only exists in logical decoding contexts and the memory is proportional to active relations, this is likely acceptable.
Maintenance Burden
Two synchronized data structures (forward and reverse hash) plus a linked list introduce more code paths that must remain consistent. Any bug that desynchronizes them would cause cache corruption — entries that can't be invalidated or phantom invalidations.
Alternative Approaches Not Discussed (Yet)
- Could use a single hash table with a secondary index
- Could batch invalidations and do a single scan
- Could use a hash table keyed by relid with a list of relfilenumber entries (since one relid could theoretically map to multiple relfilenumbers across tablespaces, though this is unusual)
Negative Entry Handling
The linked list for negative entries is simpler than a hash but has O(n) traversal for full invalidation. This is acceptable because negative entries should be rare in production (they represent failed lookups) and full invalidation already requires clearing the main hash table anyway.
Status
This is an initial submission (v1 patch) with no responses yet. Given the clear performance improvement and the relatively contained scope of changes (localized to the relfilenumber map cache in src/backend/utils/cache/ area), this patch has good potential for acceptance if:
- The implementation is clean and handles edge cases (concurrent access, memory contexts)
- The additional memory overhead is deemed acceptable
- No regressions are found in the regression test suite or stress testing
- The code is well-documented explaining the dual-hash-table design
The patch targets a real-world pain point that is increasingly relevant as PostgreSQL adoption grows in multi-tenant/microservice architectures with large numbers of tables.