Technical Analysis: Statistics Import/Export for Performance Testing
Core Problem
The thread addresses a fundamental question about the PostgreSQL statistics import/export feature (introduced in recent versions): can exported statistics from a production environment be used in a staging environment to reproduce the same query plans for performance testing?
This is a common enterprise scenario: organizations maintain separate production and staging environments with different data volumes but want to validate that application changes won't cause plan regressions in production. The Statistics Import and Export feature (pg_dump --statistics-only, pg_restore_relation_stats(), pg_restore_attribute_stats()) seems like it could solve this, but the reality is more nuanced.
Why This Matters Architecturally
The PostgreSQL planner does not rely solely on pg_statistic catalog entries to estimate costs. Several other factors influence plan selection:
1. Relation Size Estimation (table_block_relation_estimate_size())
This is the critical architectural insight in the thread. The planner's row estimation works as follows:
density = reltuples / relpages (from pg_class)
estimated_rows = density * actual_current_pages (from filesystem)
The function RelationGetNumberOfBlocks(rel) queries the actual physical size of the relation on disk. This means even if you perfectly replicate pg_class.reltuples and pg_class.relpages via statistics import, the planner will still use the real number of blocks in the staging table to scale the tuple estimate. This design exists intentionally — it allows statistics to remain useful as tables grow between ANALYZE runs without becoming stale.
This is well-documented in the row estimation examples: the planner multiplies tuple density by current pages, making statistics import insufficient on its own when table sizes differ.
2. Index OID Ordering
When multiple paths have "fuzzily equal" costs (see add_path() and compare_path_costs_fuzzily()), the tiebreaker can depend on the order indexes are presented. Since get_relation_info() sorts indexes by OID, and OIDs are not guaranteed to match after dump/restore, plan choices could differ even with identical cost estimates. This is an edge case but represents a real source of non-determinism.
3. GUC Configuration
Parameters like random_page_cost, seq_page_cost, effective_cache_size, work_mem, cpu_tuple_cost, etc., directly affect cost calculations and must be identical between environments.
Key Technical Disagreement
An important exchange occurred between Corey Huinker (author of the statistics import/export feature) and David Rowley regarding how table_block_relation_estimate_size() works:
- Huinker initially claimed that the function determines relation size from
pg_classvalues (which are set bypg_restore_relation_stats()), implying statistics import would be sufficient. - Rowley corrected this, pointing out that while
pg_class.relpagesandpg_class.reltuplesare used to compute tuple density, the actual page count comes fromRelationGetNumberOfBlocks(rel)— a filesystem-level call that returns the real table size.
This correction is architecturally significant: it means statistics import alone cannot make the planner behave identically unless the physical table sizes also match. Huinker acknowledged the correction.
Proposed Solutions
Workaround (David Rowley)
Use get_relation_info_hook to override:
- Relation sizes (to match production)
- Index ordering (to ensure deterministic tiebreaking)
This hook-based approach would require an extension but could theoretically produce identical plans.
Documentation Patch (Yugo Nagata)
A documentation patch (CF entry: https://commitfest.postgresql.org/patch/6155/) adds a warning to the statistics manipulation functions documentation:
"Manually restored statistics do not guarantee that the same query plans will be generated as in the source environment, since factors such as relation sizes, index OIDs, and configuration parameters may affect planner behavior."
Cross-Reference Follow-up (Vellaipandiyan)
A follow-up documentation patch adds a cross-reference from the statistics manipulation warning to the planner statistics documentation section.
Limitations Even With Perfect Statistics Matching
Even if plans could be made identical, Huinker correctly notes that statistics import only helps with EXPLAIN plan generation in isolation. It cannot reproduce:
work_memoverflow behavior with real data volumes- Buffer cache behavior and I/O patterns
- Lock contention from concurrent workloads
- Actual tuple processing costs with different data distributions
Status
The documentation patch is registered in commitfest. The thread represents an important clarification of the feature's intended scope versus user expectations, and the documentation improvement helps set appropriate expectations for the statistics import/export feature.