Deep Technical Analysis: Data-Level Dependencies in pg_dump/pg_upgrade
The Core Problem
PostgreSQL's pg_dump architecture separates output into distinct sections: pre-data (schema definitions), data (table contents via COPY), and post-data (indexes, constraints, triggers). This separation is fundamental to how pg_dump enables parallel restore and logical ordering. However, it creates an implicit assumption: no schema definition depends on the data content of another table.
This assumption breaks with certain extension patterns, most notably PostGIS. The canonical example involves PostGIS's spatial_ref_sys table, which contains spatial reference system definitions (SRID entries). When a user table has a column defined with a geometry type constrained to a specific SRID (e.g., geometry(Point, 4326)), the column's type modifier validation requires that the referenced SRID already exists as a row in spatial_ref_sys at the time the column is created. This creates a cross-section dependency: a pre-data object (table DDL) depends on another table's data section content.
During pg_restore or pg_upgrade, this manifests as:
- Pre-data section runs: all table definitions are created
- User table with geometry column references SRID 4326
- The type modifier function queries
spatial_ref_sys— but it's empty because data hasn't been loaded yet - Restore fails
This is architecturally significant because pg_depend — PostgreSQL's dependency tracking catalog — records object-to-object dependencies (OIDs), not object-to-row dependencies. There is no existing catalog infrastructure to express "this column definition depends on specific rows in another table."
Proposed Solutions
Solution 1: --extra-dependencies Option (Original Patch)
Jeevan Chalke's patch introduces a user-facing option for both pg_dump and pg_upgrade:
Mechanism:
- Users specify dependencies in the format
table#referenced_table - pg_dump injects these as synthetic edges in the topological sort graph (the same graph used for standard
pg_depend-based ordering) - For referenced tables, data is dumped immediately after the table's schema definition, breaking the normal section separation
Key Implementation Details:
- Leverages
dumpTableData()infrastructure but invokes it during pre-data emission - Effectively creates a "pre-data data" subsection for specific tables
- The dependency hint forces ordering in the sort algorithm without requiring catalog changes
Tradeoffs:
- Requires user knowledge of implicit dependencies (bad UX)
- Works for pg_upgrade where the tool can be told about known extension patterns
- Doesn't solve the general case; every new extension with this pattern needs manual configuration
- Breaks the clean section separation that parallel restore relies on
Solution 2: user_catalog_table + pg_dump Ordering (Peter Eisentraut)
Peter Eisentraut proposes a more principled approach:
Mechanism:
- PostGIS marks
spatial_ref_syswithALTER TABLE ... SET (user_catalog_table = true) - pg_dump is modified to recognize user catalog tables and dump their contents in the pre-data section (before other DDL that might depend on them)
Why this is architecturally superior:
user_catalog_tablealready exists in PostgreSQL (it enables HOT-avoidance and marks tables as infrastructure-like)- No user-facing options needed; the dependency information is stored in the catalog (
reloptions) - pg_dump can generically handle all such cases without per-extension knowledge
- The extension author declares intent, not the end user
Remaining work: pg_dump doesn't currently inspect user_catalog_table for ordering decisions, so code changes are still needed.
Solution 3: Extension-First Restoration (Paul Ramsey's Suggestion)
Paul Ramsey questions why extensions aren't restored completely (schema + data) before user objects:
Rationale:
- Extension tables like
spatial_ref_sysare already known to pg_dump as extension members (viapg_extension_config_dump) - If all extension objects (including their data) were fully materialized before any user DDL, the dependency would be naturally satisfied
Challenges:
- This would require restructuring pg_dump's section model significantly
- Extensions can themselves depend on user objects in complex upgrade scenarios
- May introduce circular dependency issues in edge cases
- Parallel restore optimization assumes section boundaries
Architectural Tension
The fundamental tension in this thread is between three philosophies:
- User-side workaround (Chalke): Give users/tools a knob to inject missing dependency information. Pragmatic but inelegant.
- Catalog-driven autodiscovery (Eisentraut): Use existing catalog mechanisms (
user_catalog_table) to let pg_dump infer the correct behavior. Clean but requires extension cooperation. - Structural reordering (Ramsey): Change pg_dump's fundamental ordering to handle extensions as atomic units. Most comprehensive but most invasive.
Matthias van de Meent's feedback crystallizes the community preference: dependencies should be stored in catalogs and resolved automatically, not require user input. This aligns with PostgreSQL's general philosophy of catalog-driven behavior.
Current Status
As of the last message (May 2025), no consensus has been reached. Peter Eisentraut's user_catalog_table approach appears to have the most traction as a "principled" solution, but Paul Ramsey raises valid questions about whether it introduces unintended side effects for extension tables and whether a simpler ordering change (extension-complete-first) would suffice. The original patch remains a proof-of-concept without commitment path.