Vacuumlo Improvements — May 2026 Summary
Overview
A critical correctness bug was identified in vacuumlo: the tool silently skips columns typed as domains over oid or lo, causing it to incorrectly classify referenced large objects as orphaned and delete them without warning. The thread rapidly converged on a fix and then expanded scope to address additional blind spots.
The Core Bug: Domain-over-OID Blindness
vacuumlo scans pg_attribute joined with pg_type to find columns that could hold large object references, but only checks for exact type matches (oid or lo). It does not resolve domain types via pg_type.typbasetype. Since domains over base types are idiomatic PostgreSQL, this is a fundamental correctness failure in reference resolution — not an edge case.
The failure mode is particularly dangerous because:
- No warnings are emitted for skipped columns
- Dry-run mode provides no indication of why objects appear orphaned
- The existing documentation note (from commit
64c604898e) is buried and doesn't emphasize data-loss risk
Fix: Recursive Domain Resolution (Patch Provided)
Nathan Bossart provided a patch that modifies the catalog query to recursively walk pg_type.typbasetype chains (handling domains-over-domains) until reaching a base type (typbasetype = 0), then checks if that base type is oid or lo. This is a well-understood catalog query pattern with negligible performance impact given shallow type hierarchies.
Scope Expansion: Composite Types and Arrays
Sami Imseih proposed extending detection to:
- Composite types with embedded
oid/lofields - Array types (
OID[],LO[])
Both are catalog-detectable but introduce architectural complexity: vacuumlo currently generates simple column IN (...) predicates, but composites require field-access expressions (e.g., ("data")."lo_ref") and arrays require unnest() or ANY() constructs. Imseih himself noted diminishing returns, since users sometimes store LO references in text columns which are fundamentally undetectable.
Converged Strategy
The thread settled on a three-pronged approach (explicitly agreed by Bossart):
- Improve detection — recursive domain resolution (immediate fix), plus composite/array support (future work)
- Improve documentation — make data-loss risk prominent, document what
vacuumlocannot catch - Steer users toward the
loextension — recommendlo_manage()triggers as the preferred lifecycle management approach, with statement-levelBEFORE TRUNCATEtriggers to handle theTRUNCATEgap
Additional Notes
- Bossart identified that
vacuumlo's test suite is inadequate ("pretty sad"), and this work is an opportunity to improve coverage - The idea of user-specified reference queries was floated but deferred due to complexity with multi-database mode (dblink)
- The original domain exclusion appears to be an oversight rather than a deliberate design decision