Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

First seen: 2020-06-12 23:28:10+00:00 · Messages: 172 · Participants: 22

Latest Update

2026-05-18 · claude-opus-4-6

Deep Technical Analysis: Reducing EXPLAIN ANALYZE Timing Overhead via RDTSC

Core Problem

EXPLAIN ANALYZE with TIMING ON in PostgreSQL nearly doubles query execution time for simple, row-intensive queries. The root cause is the overhead of calling clock_gettime() twice per executor node per tuple (once in InstrStartNode(), once in InstrStopNode()). For a 50M row sequential scan, this means ~100M timestamp acquisitions, each involving:

  1. The rdtscp instruction (used internally by the Linux vDSO implementation of clock_gettime), which acts as a one-way pipeline barrier — it waits for all prior instructions to complete before reading the TSC
  2. Normalization of the raw cycle count into struct timespec (division into seconds + nanoseconds)
  3. Function call overhead through the vDSO

The pipeline stall from rdtscp is the dominant cost. Profile data showed that InstrStartNode + InstrStopNode consumed 43% of total execution time (21.74% + 21.44%), with the rdtscp instruction itself accounting for 65% of the time within those functions. This destroys instruction-level parallelism, effectively turning a superscalar CPU into an in-order processor for those critical paths.

Architectural Significance

This problem affects:

Solution Architecture (Multi-Phase)

Phase 1: Represent instr_time as int64 nanoseconds (committed Jan 2023)

The first optimization converted instr_time from struct timespec (two fields: seconds + nanoseconds) to a single int64 counting nanoseconds. This:

Range analysis confirmed int64 nanoseconds provides ~292 years of range — sufficient since instr_time is never persisted to disk.

Phase 2: Direct TSC access via RDTSC (committed April 2026)

The core innovation: bypass clock_gettime() entirely and read the CPU's Time-Stamp Counter directly using the non-serializing RDTSC instruction for per-node timing.

Key design decisions:

RDTSC vs RDTSCP: The patch uses RDTSC (no pipeline barrier) for INSTR_TIME_SET_CURRENT_FAST() in InstrStartNode()/InstrStopNode(), and RDTSCP (half-barrier) for INSTR_TIME_SET_CURRENT() used in overall query timing. The rationale: for per-node timing, the barrier actually makes measurements less accurate by changing execution behavior, while for whole-query timing, precision matters more.

TSC-to-nanoseconds conversion: Since TSC returns raw cycle counts at the "reference frequency" (invariant since Nehalem ~2008), conversion requires knowing the TSC frequency. The patch implements a scaled integer multiplication approach:

ns = (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT

With overflow detection for intervals > ~6.5 days, falling back to division. This avoids floating-point entirely in the hot path.

Frequency detection hierarchy:

  1. CPUID leaf 0x40000010 (KVM/VMware hypervisors)
  2. CPUID leaf 0x15/0x16 (bare metal Intel)
  3. TSC calibration loop (AMD, unknown hypervisors) — measures cycles elapsed over wall-clock time, converging in <1ms typically

Safety mechanisms:

Performance Results

Scenario System Clock TSC Clock Speedup
COUNT(*) 50M rows 4202ms → 1889ms (after int64) 1477ms 1.6x faster timing
Timing overhead ratio 2.06x slowdown 1.27x slowdown
pg_test_timing loop 23.5ns 11.8ns 2x faster
TPCH queries baseline ~20% faster with ANALYZE
Complex plan (LIMIT+OFFSET 10M) 2.06x overhead 1.27x overhead 1.63x improvement

Key Technical Challenges Resolved

Hypervisor TSC Frequency Detection

A critical bug was discovered on AWS Windows instances ("drongo" buildfarm member): CPUID 0x15/0x16 reported 7 kHz as TSC frequency under the "nitro" hypervisor, causing wildly incorrect timings (negative values, 6.5 billion ms execution times). The fix: never trust CPUID 0x15/0x16 when running under a hypervisor — always use the hypervisor-specific CPUID leaf or calibration instead.

EXEC_BACKEND (Windows) Support

On Windows (fork-less architecture), the TSC frequency determined by the postmaster must be passed to child processes through BackendParameters, since each connection goes through restore_backend_variables rather than inheriting memory via fork.

Dynamic GUC Changes

Changing timing_clock_source mid-query produces nonsensical results because stored instr_time values are in one "frame of reference" (TSC ticks vs nanoseconds) and subsequent reads are in another. This is documented as unsupported behavior, with the GUC set to SUSET level to prevent casual misuse.

Compilation Overhead

Including <immintrin.h> (for __rdtsc()) in the widely-included instr_time.h doubled build times. Solution: use GCC/Clang built-in functions (__builtin_ia32_rdtsc(), __builtin_ia32_rdtscp()) directly, avoiding expensive header inclusion. Only MSVC requires <intrin.h>.

Design Tradeoffs

  1. Linux clocksource check vs. self-detection: The patch checks /sys/devices/system/clocksource/clocksource0/current_clocksource on Linux as part of "auto" detection, encoding an OS dependency. This was debated but kept because the kernel's TSC validation (watchdog, multi-socket sync) is more sophisticated than what's practical to replicate.

  2. GUC vs. automatic-only: Robert Haas argued that a GUC is essential if auto-detection can't be 100% reliable — the alternative is abandoning the feature entirely. The timing_clock_source GUC provides the necessary escape hatch.

  3. Calibration loop at startup: Adds <50ms to postmaster startup on AMD systems and unknown hypervisors. Acceptable given the benefits, and avoided on Windows EXEC_BACKEND children by passing the frequency from the parent.

  4. Not converting track_io_timing: Explicitly deferred as future work, though RDTSC would significantly reduce the overhead of IO timing in page-cache-heavy workloads.