Technical Analysis: RISC-V Zbb Popcount Optimization and Clang Auto-Vectorization Bug
Core Problem
This thread encompasses two distinct but related problems for PostgreSQL on RISC-V:
-
Performance gap: RISC-V CPUs with the Zbb (bit manipulation) and Zbc (carry-less multiplication) extensions can execute popcount and CRC32C operations dramatically faster than software fallbacks (4x and 2000x respectively), but PostgreSQL has no mechanism to exploit these hardware features.
-
Correctness bug: Clang 20's auto-vectorizer generates incorrect RISC-V Vector (RVV) scatter-store instructions for certain loop patterns, causing silent data corruption in
des_init()and potentiallyzic.c. This is a compiler miscompilation that affects the buildfarm animal "greenfly."
The architectural significance lies in RISC-V's design philosophy of optional extensions — unlike x86-64 where SSE2 is a baseline guarantee, or AArch64 where NEON is mandatory, RISC-V makes even basic bit manipulation instructions optional. This forces any optimization to include runtime CPU feature detection (dispatch), adding complexity and a small performance tax on the non-accelerated path.
The Clang Auto-Vectorization Bug (0001)
Root Cause
Clang 20 at -O2 with -march=rv64gcv (or even without explicit RVV enablement, since the compiler can infer vectorization targets) transforms scatter-write idioms of the form dst[idx[i]] = expr into RVV vsoxei8 (vector scatter indexed) instructions. The codegen for these scatter stores is buggy — it produces incorrect permutation tables in des_init(), causing un_pbox values to be wrong.
The bug maps to LLVM issue #176001 ("RISC-V Wrong code at -O1") involving vector peephole optimization with vmerge folding, reportedly fixed by PR #176077. However, the fix hasn't shipped in any released Clang version available on RISC-V platforms (Clang 20 in Debian/Ubuntu noble/trixie is affected).
Scope of Impact
Greg performed a systematic audit of the PostgreSQL source tree, scanning for the scatter-write idiom and compiling with Clang 20 at -O2 -march=rv64gcv. Results:
| File | Pattern | Risk |
|---|---|---|
contrib/pgcrypto/crypt-des.c |
3× vsoxei8 indexed scatter |
Confirmed miscompile |
src/timezone/zic.c ~L2330 |
2× vsoxei8 indexed scatter |
Same shape, likely affected |
contrib/pg_trgm/trgm_op.c |
3× vsse8 strided store |
Different pattern, lower risk |
The zic.c case is particularly insidious: it runs at make install time to compile IANA timezone data. A miscompile would produce silently corrupt timezone files that the test suite would likely never catch, since tests exercise only a tiny fraction of tzdata.
Proposed Solutions (Evolved)
The initial approach of sprinkling pg_memory_barrier() around affected loops was rightly criticized by Andres as treating symptoms rather than causes. The discussion converged toward:
- Per-loop pragma
vectorize(disable)— doesn't scale as new sites are discovered - Per-function attribute — coarser but defensive within a function
- Per-file
-fno-tree-vectorizefor affected files on clang+riscv — simple, static - Global
-fno-tree-vectorizefor clang+riscv64 until a fixed version — broadest fix - Hard error in configure for affected clang version range — cleanest if version bounds are known
The thread was heading toward option (3) or (4)+(5), pending a build of Clang from LLVM HEAD on the RISC-V hardware to determine if the upstream fix actually resolves the issue and to pin a specific version cutoff.
Performance Patches (0002, 0003)
Popcount via Zbb (0002)
The Zbb extension provides a hardware cpop (count population) instruction. Benchmarks show:
- GCC with Zbb: 4.17× faster (2288 MB/s vs 548 MB/s)
- Clang with Zbb: 4.02× faster (1795 MB/s vs 447 MB/s)
The patch follows the existing pattern used for ARM NEON popcount: compile-time detection of Zbb support to enable the code path, with runtime detection to actually use it. This is necessary because RISC-V binaries compiled with -march=rv64gc_zbb will SIGILL on CPUs without Zbb.
The initial v1 patch naively added -march=rv64gc_zbb globally to CFLAGS without runtime dispatch — a fatal flaw immediately caught by Andres. v2 corrected this to match the ARM pattern.
CRC32C via Zbc (0003)
Adapted from the Google Abseil project's RISC-V CRC32C implementation, this uses the clmul (carry-less multiply) instruction from the Zbc extension to implement CRC32C in hardware:
- GCC with Zbc: 2004× faster (308,052 MB/s vs 154 MB/s)
- Clang with Zbc: 1807× faster (309,282 MB/s vs 171 MB/s)
The extraordinary speedup is because the software CRC32C fallback is byte-at-a-time table lookup, while clmul processes 64 bits per instruction and pipelines well. This is architecturally significant because CRC32C is used in WAL checksums and page verification — a 2000× improvement in CRC computation could meaningfully impact write-heavy workloads on RISC-V.
Key Design Tension: "Is the Juice Worth the Squeeze?"
Andres raised the fundamental question: does RISC-V have enough production adoption to justify the maintenance burden of CPU dispatch logic? His arguments:
- No real-world PostgreSQL production workloads on RISC-V yet
- CPU dispatch adds runtime overhead even when the extension isn't used
- RISC-V's fragmented extension landscape makes this an ongoing burden
- Even AVX-512 popcount on x86-64 was questionably worthwhile
Greg's counter-arguments:
- The patches follow existing patterns (ARM NEON) — bounded complexity
- RISC-V servers, cloud instances (Scaleway), and desktops exist
- CRC32C improvement is enormous (2000×) and affects WAL performance
- The patches serve as durable reference for when RISC-V matures
Nathan Bossart added context that AVX-512 popcount was partially motivated by vector operations that used popcount heavily, and that the trend is toward using simd.h abstractions (SSE2/NEON baseline) rather than architecture-specific optimizations.
Buildfarm Considerations
The thread also addressed buildfarm logistics:
- "greenfly" was switched from Clang to GCC per John Naylor's suggestion
- Álvaro Herrera noted that changing compilers requires a new animal (different personality)
- A new animal "mollusk" was registered for the same hardware running GCC
- greenfly was disabled (--test mode) until the Clang situation resolves
Current State
As of the last message, the thread is blocked on:
- Clang building from LLVM HEAD on greenfly (~24h compile time on the RISC-V CPU)
- Confirmation of whether the upstream fix resolves the DES and zic.c issues
- A v5 patch that will drop
pg_memory_barrier()in favor of a proper vectorization suppression strategy
The performance patches (0002, 0003) remain viable but face philosophical resistance about whether RISC-V optimizations are premature given the platform's current adoption level.