2026-05-18 · claude-opus-4-6

Deep Technical Analysis: Reducing EXPLAIN ANALYZE Timing Overhead via RDTSC

Core Problem

EXPLAIN ANALYZE with TIMING ON in PostgreSQL nearly doubles query execution time for simple, row-intensive queries. The root cause is the overhead of calling clock_gettime() twice per executor node per tuple (once in InstrStartNode(), once in InstrStopNode()). For a 50M row sequential scan, this means ~100M timestamp acquisitions, each involving:

The rdtscp instruction (used internally by the Linux vDSO implementation of clock_gettime), which acts as a one-way pipeline barrier — it waits for all prior instructions to complete before reading the TSC
Normalization of the raw cycle count into struct timespec (division into seconds + nanoseconds)
Function call overhead through the vDSO

The pipeline stall from rdtscp is the dominant cost. Profile data showed that InstrStartNode + InstrStopNode consumed 43% of total execution time (21.74% + 21.44%), with the rdtscp instruction itself accounting for 65% of the time within those functions. This destroys instruction-level parallelism, effectively turning a superscalar CPU into an in-order processor for those critical paths.

Architectural Significance

This problem affects:

Query optimization debugging: Timing skew makes it hard to identify actual bottlenecks
Relative timing accuracy: Two nodes with 2x actual runtime difference appear closer due to constant timing overhead
Production monitoring: Features like track_io_timing/track_wal_io_timing carry similar overhead, discouraging their use
pg_stat_statements: Timing overhead inflates reported query durations

Solution Architecture (Multi-Phase)

Phase 1: Represent instr_time as int64 nanoseconds (committed Jan 2023)

The first optimization converted instr_time from struct timespec (two fields: seconds + nanoseconds) to a single int64 counting nanoseconds. This:

Eliminated normalization arithmetic in hot paths
Reduced Instrumentation struct size from 448 to 368 bytes (80 bytes saved)
Reduced instruction count in InstrStopNode from 1016 to 96 instructions
Enabled compiler auto-vectorization of BufferUsageAccumDiff (four INSTR_TIME_ACCUM_DIFF calls vectorized into one SIMD operation)
Yielded ~10% improvement in timing-heavy benchmarks

Range analysis confirmed int64 nanoseconds provides ~292 years of range — sufficient since instr_time is never persisted to disk.

Phase 2: Direct TSC access via RDTSC (committed April 2026)

The core innovation: bypass clock_gettime() entirely and read the CPU's Time-Stamp Counter directly using the non-serializing RDTSC instruction for per-node timing.

Key design decisions:

RDTSC vs RDTSCP: The patch uses RDTSC (no pipeline barrier) for INSTR_TIME_SET_CURRENT_FAST() in InstrStartNode()/InstrStopNode(), and RDTSCP (half-barrier) for INSTR_TIME_SET_CURRENT() used in overall query timing. The rationale: for per-node timing, the barrier actually makes measurements less accurate by changing execution behavior, while for whole-query timing, precision matters more.

TSC-to-nanoseconds conversion: Since TSC returns raw cycle counts at the "reference frequency" (invariant since Nehalem ~2008), conversion requires knowing the TSC frequency. The patch implements a scaled integer multiplication approach:

ns = (ticks * ticks_per_ns_scaled) >> TICKS_TO_NS_SHIFT

With overflow detection for intervals > ~6.5 days, falling back to division. This avoids floating-point entirely in the hot path.

Frequency detection hierarchy:

CPUID leaf 0x40000010 (KVM/VMware hypervisors)
CPUID leaf 0x15/0x16 (bare metal Intel)
TSC calibration loop (AMD, unknown hypervisors) — measures cycles elapsed over wall-clock time, converging in <1ms typically

Safety mechanisms:

Requires CPU flags: RDTSCP available, TSC invariant, TSC_ADJUST (Intel)
GUC timing_clock_source with values: auto (default), system, tsc
auto mode on Linux also checks if kernel clocksource is "tsc"
Falls back to system clock if TSC frequency can't be determined
pg_test_timing enhanced to show both clock sources and warn if calibrated frequency diverges >10% from CPUID frequency

Performance Results

Scenario	System Clock	TSC Clock	Speedup
COUNT(*) 50M rows	4202ms → 1889ms (after int64)	1477ms	1.6x faster timing
Timing overhead ratio	2.06x slowdown	1.27x slowdown	—
pg_test_timing loop	23.5ns	11.8ns	2x faster
TPCH queries	baseline	~20% faster with ANALYZE	—
Complex plan (LIMIT+OFFSET 10M)	2.06x overhead	1.27x overhead	1.63x improvement

Key Technical Challenges Resolved

Hypervisor TSC Frequency Detection

A critical bug was discovered on AWS Windows instances ("drongo" buildfarm member): CPUID 0x15/0x16 reported 7 kHz as TSC frequency under the "nitro" hypervisor, causing wildly incorrect timings (negative values, 6.5 billion ms execution times). The fix: never trust CPUID 0x15/0x16 when running under a hypervisor — always use the hypervisor-specific CPUID leaf or calibration instead.

EXEC_BACKEND (Windows) Support

On Windows (fork-less architecture), the TSC frequency determined by the postmaster must be passed to child processes through BackendParameters, since each connection goes through restore_backend_variables rather than inheriting memory via fork.

Dynamic GUC Changes

Changing timing_clock_source mid-query produces nonsensical results because stored instr_time values are in one "frame of reference" (TSC ticks vs nanoseconds) and subsequent reads are in another. This is documented as unsupported behavior, with the GUC set to SUSET level to prevent casual misuse.

Compilation Overhead

Including <immintrin.h> (for __rdtsc()) in the widely-included instr_time.h doubled build times. Solution: use GCC/Clang built-in functions (__builtin_ia32_rdtsc(), __builtin_ia32_rdtscp()) directly, avoiding expensive header inclusion. Only MSVC requires <intrin.h>.

Design Tradeoffs

Linux clocksource check vs. self-detection: The patch checks /sys/devices/system/clocksource/clocksource0/current_clocksource on Linux as part of "auto" detection, encoding an OS dependency. This was debated but kept because the kernel's TSC validation (watchdog, multi-socket sync) is more sophisticated than what's practical to replicate.
GUC vs. automatic-only: Robert Haas argued that a GUC is essential if auto-detection can't be 100% reliable — the alternative is abandoning the feature entirely. The timing_clock_source GUC provides the necessary escape hatch.
Calibration loop at startup: Adds <50ms to postmaster startup on AMD systems and unknown hypervisors. Acceptable given the benefits, and avoided on Windows EXEC_BACKEND children by passing the frequency from the parent.
Not converting track_io_timing: Explicitly deferred as future work, though RDTSC would significantly reduce the overhead of IO timing in page-cache-heavy workloads.

Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Latest Update