Import Statistics in postgres_fdw before resorting to sampling.

First seen: 2025-08-12 17:02:39+00:00 · Messages: 82 · Participants: 7

Latest Update

2026-05-14 · claude-opus-4-6

Import Statistics in postgres_fdw: Deep Technical Analysis

Core Problem

When PostgreSQL performs ANALYZE on a foreign table managed by postgres_fdw, it must collect a representative sample of rows from the remote server across the network, then locally compute statistics (histograms, MCVs, distinct counts, etc.) from that sample. This is architecturally problematic for several reasons:

  1. Network bandwidth: Fetching thousands of sample rows across the wire is expensive, especially for wide tables or high-latency connections.
  2. Sample quality degradation: The local sample size is constrained by network cost, meaning the remote server's own ANALYZE (operating on local storage with much larger samples) produces superior statistics.
  3. Redundant computation: If the remote table already has up-to-date statistics in its pg_statistic catalog, re-deriving them locally from a inferior sample is wasteful.

The proposed solution allows postgres_fdw to directly import pre-computed statistics from the remote server's pg_stats view, bypassing the expensive sampling operation entirely.

Architectural Design & Evolution

The FDW Callback Interface

The patch introduces a new FDW callback ImportForeignStatistics (originally named ImportStatistics) in the FdwRoutine struct. This is analogous to existing callbacks like AnalyzeForeignTable and ImportForeignSchema. The callback is invoked by analyze_rel() in the core before attempting the traditional sampling path.

Initially, a two-callback design was proposed by Etsuro Fujita:

This was later simplified back to a single callback when the fallback-to-sampling behavior was accepted, making the importability check less critical as a separate step.

Data Flow

  1. Query pg_class on the remote server for relation-level stats (relpages, reltuples, relkind)
  2. If reltuples indicates the table has been analyzed (not 0 or -1), query pg_stats for attribute-level statistics
  3. Match remote column stats to local column definitions via name mapping
  4. Call pg_restore_relation_stats() and pg_restore_attribute_stats() via SPI to import the data
  5. If any step fails, fall back to traditional row sampling

Why SPI + pg_restore_*_stats() Instead of Direct Catalog Manipulation

An early design question was whether to construct VacAttrStats and call update_attstats() directly. This was rejected because:

The SPI approach uses a prepared statement for attribute stats (called once per column) and parameterized queries for relation stats.

Column Matching Strategy

The implementation builds a RemoteAttributeMapping array that maps local attribute names to remote attribute names (via column_name options). The array is sorted by remote attname, and the remote query uses ORDER BY s.attname, enabling an O(N) merge-join match rather than O(N²) nested loops. A critical correctness fix addressed collation: the ORDER BY must use COLLATE "C" to match C-library qsort.

Inherited Statistics Handling

For remote tables that are inheritance parents, the query uses DISTINCT ON (s.attname) ... ORDER BY s.attname, s.inherited DESC to prefer inherited stats (inh=true) over non-inherited. This is correct because querying a foreign table backed by an inheritance parent fetches rows from all children.

A subtle bug was identified: if the local foreign table is an inheritance parent, the original design would skip the recursive ANALYZE needed for inherited stats. This was fixed by ensuring ImportForeignStatistics only replaces the non-recursive ANALYZE path.

Stale Statistics Problem

A significant design tension emerged around what happens when remote stats are stale. Fujita argued strongly that:

This led to the final decision to default the feature to off, with explicit opt-in required.

The remote_analyze Option (Deferred)

A remote_analyze option was proposed to automatically ANALYZE the remote table before fetching stats. This was ultimately deferred to a future patch due to:

Critical Bug Fix: Stale Local Stats

Fujita discovered that re-importing stats would leave old attribute statistics intact if new stats had NULL values for some stat kinds (e.g., removing histogram_bounds when data becomes uniform). The fix: call pg_clear_attribute_stats before pg_restore_attribute_stats for each column.

Extended Statistics Limitation

The StatisticsAreImportable check (before it was removed) disqualified tables with extended statistics objects, because those require actual row samples to compute and cannot be derived from imported base statistics.

Partition/Sharding Limitation

A major acknowledged limitation: when foreign tables are children in an inheritance/partition hierarchy, acquire_inherited_sample_rows() still needs actual row samples to compute parent-level statistics. The import path only helps with direct ANALYZE of individual foreign tables, not the recursive case. This means the common sharding use case (partitioned table with foreign table partitions) only gets 50% benefit — individual partition stats are imported, but parent stats still require sampling.

Post-Commit Issues

After commit, Coverity flagged unsafe strncpy() usage where remote_attname could exceed NAMEDATALEN. The fix replaced fixed-size char buffers with dynamically allocated strings (pstrdup) and added a free_remattrmap() cleanup function, with the NULL checks converted to assertions since build_remattrmap() guarantees non-NULL values.

Key Design Decisions Summary

Decision Outcome Rationale
SPI vs direct catalog writes SPI Reuses validation logic in pg_restore_*_stats
Default for import_stats false Cannot guarantee freshness of remote stats
Fallback to sampling Yes (with WARNING) User expects ANALYZE to produce some stats
Error vs warning on failure Error for structural/network failures, notice for "no stats found" Network errors likely affect fallback too
remote_analyze Deferred to v20 Complexity, permissions, time constraints
Single vs dual callback Single (ImportForeignStatistics) Simpler when fallback is allowed
Protocol for remote queries Simple query (PQsendQuery) Less overhead than extended query protocol