Vacuumlo Silent Data Loss with Domain-over-OID Columns
Core Problem
The vacuumlo utility — PostgreSQL's tool for removing orphaned large objects — has a fundamental blind spot in its column-type scanning logic that can cause silent, irreversible data loss. When vacuumlo scans the system catalog to identify which columns hold large object references, it performs a simple name-based type check: it looks only for columns typed exactly as oid or lo. Crucially, it does not resolve domain types back to their base types via pg_type.typbasetype.
This means that if a user creates a domain over oid (a completely standard PostgreSQL practice for adding semantic clarity or constraints), any large object OIDs stored in those domain-typed columns will be invisible to vacuumlo's reference scan. The tool will conclude those large objects are orphaned and delete them, destroying referenced data with no warning.
Why This Matters Architecturally
Large objects in PostgreSQL live in the pg_largeobject system catalog, separate from the tables that reference them. The only link between a large object and its "owner" row is the OID value stored in a user column. This is an inherently fragile reference model — there are no foreign keys, no system-level tracking of which large objects are in use. vacuumlo exists precisely to bridge this gap by scanning all columns that could hold LO references and building a set of "referenced" OIDs. Any large object not in that set is deemed orphaned.
The correctness of this entire approach hinges on vacuumlo being exhaustive in its column scan. Missing even one column that stores LO references turns a maintenance operation into a data-destruction operation. The domain-type blind spot is therefore not a minor edge case — it's a fundamental correctness bug in the reference-resolution logic.
The Subtlety of the Failure Mode
Several factors make this particularly dangerous:
- No diagnostic output:
vacuumloemits no warning when it skips domain-typed columns. There is no indication that a column was excluded from the scan. - Dry-run is misleading: The
--dry-run/-nflag will report large objects as "would be removed," but provides no context about why they appear orphaned. A careful operator running dry-run first would still see no red flags. - Domains are idiomatic: Creating domains over base types is a well-established PostgreSQL best practice for adding semantic meaning and constraints to columns. Users doing this are following good database design principles.
- Documentation is buried: While the limitation is documented (added by commit
64c604898e), it's a parenthetical note that's easy to miss, and it doesn't explicitly call out the data-loss consequence.
Proposed Solutions
1. Resolve Domain Types Recursively (Patch Provided)
Nathan Bossart identified that the fix is relatively straightforward: modify vacuumlo's catalog query to recursively resolve domain types back to their base types using pg_type.typbasetype. The recursion is necessary because PostgreSQL allows domains over domains (e.g., CREATE DOMAIN my_oid AS oid; CREATE DOMAIN my_special_oid AS my_oid;), so a simple single-level lookup is insufficient — a recursive CTE or equivalent is needed.
Bossart provided an initial patch for testing. This is the most complete fix as it eliminates the blind spot entirely.
2. Emit Warnings for Skipped Domain Columns
A lighter-weight alternative (or complement): have vacuumlo detect domain-over-oid/lo columns and emit a WARNING even if it doesn't include them in the scan. This would at least alert operators to the risk.
3. Documentation Improvements
Multiple participants agreed that at minimum, the documentation should be improved to:
- Make the data-loss risk far more prominent (not a parenthetical)
- Recommend the
loextension andlo_manage()trigger as the preferred approach - Position
vacuumloas a legacy cleanup tool rather than a routine maintenance requirement
4. Steer Users Toward the lo Extension
Sami Imseih advocated for a broader strategy of recommending the lo extension with lo_manage() triggers, which handle LO lifecycle management at the row level. This approach works for both simple types and domains over simple types, and can be extended with custom triggers for complex cases (e.g., composite types holding LO references). The one gap is TRUNCATE (per-row triggers don't fire), but this can be handled with a statement-level BEFORE TRUNCATE trigger.
5. User-Specified Reference Queries
Bossart floated the idea of allowing users to specify which tables/columns refer to LOs, possibly via a user-provided query. This would handle arbitrary schemas but adds complexity, especially with vacuumlo's multi-database mode (via dblink), where each database may have different schemas.
Key Technical Details
The Catalog Query Fix
The core of vacuumlo's logic queries pg_attribute joined with pg_type to find columns of type oid or lo. The fix requires changing this to either:
- Use a recursive CTE that walks
pg_type.typbasetypechains until it reaches a base type (typbasetype = 0), then checks if that base type isoidor thelotype - Or equivalently, use
pg_type.typbasetypein a subquery withCONNECT BY-style recursion
This is a well-understood pattern in PostgreSQL catalog queries and is unlikely to have performance concerns since the type hierarchy is typically shallow.
Test Infrastructure Gap
Bossart noted that vacuumlo's test suite is "pretty sad," suggesting this is an opportunity to improve test coverage. This is relevant because the existing tests clearly didn't cover domain-typed columns — a gap that allowed this bug to persist.
Historical Context
Commit 64c604898e added the documentation note about domains being excluded, but neither the commit message nor the associated thread explain why domains were excluded. Given vacuumlo's quiet commit history, the most likely explanation is that domain resolution was simply never implemented — an oversight rather than a deliberate design decision.
Assessment
This thread represents a clear, well-motivated bug report that quickly converged on a solution. The fix is straightforward (recursive domain resolution in a catalog query), the risk is severe (silent data loss), and the patch was provided within two days. The main open questions are around test coverage and the broader documentation/recommendation strategy around the lo extension.