Add RISC-V Zbb popcount optimization

First seen: 2026-03-21 16:54:10+00:00 · Messages: 20 · Participants: 6

Latest Update

2026-06-01 · claude-opus-4-6

Technical Analysis: RISC-V Zbb Popcount Optimization and Clang Auto-Vectorization Bug

Core Problem

This thread encompasses two distinct but related problems for PostgreSQL on RISC-V:

  1. Performance gap: RISC-V CPUs with the Zbb (bit manipulation) and Zbc (carry-less multiplication) extensions can execute popcount and CRC32C operations dramatically faster than software fallbacks (4x and 2000x respectively), but PostgreSQL has no mechanism to exploit these hardware features.

  2. Correctness bug: Clang 20's auto-vectorizer generates incorrect RISC-V Vector (RVV) scatter-store instructions for certain loop patterns, causing silent data corruption in des_init() and potentially zic.c. This is a compiler miscompilation that affects the buildfarm animal "greenfly."

The architectural significance lies in RISC-V's design philosophy of optional extensions — unlike x86-64 where SSE2 is a baseline guarantee, or AArch64 where NEON is mandatory, RISC-V makes even basic bit manipulation instructions optional. This forces any optimization to include runtime CPU feature detection (dispatch), adding complexity and a small performance tax on the non-accelerated path.

The Clang Auto-Vectorization Bug (0001)

Root Cause

Clang 20 at -O2 with -march=rv64gcv (or even without explicit RVV enablement, since the compiler can infer vectorization targets) transforms scatter-write idioms of the form dst[idx[i]] = expr into RVV vsoxei8 (vector scatter indexed) instructions. The codegen for these scatter stores is buggy — it produces incorrect permutation tables in des_init(), causing un_pbox values to be wrong.

The bug maps to LLVM issue #176001 ("RISC-V Wrong code at -O1") involving vector peephole optimization with vmerge folding, reportedly fixed by PR #176077. However, the fix hasn't shipped in any released Clang version available on RISC-V platforms (Clang 20 in Debian/Ubuntu noble/trixie is affected).

Scope of Impact

Greg performed a systematic audit of the PostgreSQL source tree, scanning for the scatter-write idiom and compiling with Clang 20 at -O2 -march=rv64gcv. Results:

File Pattern Risk
contrib/pgcrypto/crypt-des.c vsoxei8 indexed scatter Confirmed miscompile
src/timezone/zic.c ~L2330 vsoxei8 indexed scatter Same shape, likely affected
contrib/pg_trgm/trgm_op.c vsse8 strided store Different pattern, lower risk

The zic.c case is particularly insidious: it runs at make install time to compile IANA timezone data. A miscompile would produce silently corrupt timezone files that the test suite would likely never catch, since tests exercise only a tiny fraction of tzdata.

Proposed Solutions (Evolved)

The initial approach of sprinkling pg_memory_barrier() around affected loops was rightly criticized by Andres as treating symptoms rather than causes. The discussion converged toward:

  1. Per-loop pragma vectorize(disable) — doesn't scale as new sites are discovered
  2. Per-function attribute — coarser but defensive within a function
  3. Per-file -fno-tree-vectorize for affected files on clang+riscv — simple, static
  4. Global -fno-tree-vectorize for clang+riscv64 until a fixed version — broadest fix
  5. Hard error in configure for affected clang version range — cleanest if version bounds are known

The thread was heading toward option (3) or (4)+(5), pending a build of Clang from LLVM HEAD on the RISC-V hardware to determine if the upstream fix actually resolves the issue and to pin a specific version cutoff.

Performance Patches (0002, 0003)

Popcount via Zbb (0002)

The Zbb extension provides a hardware cpop (count population) instruction. Benchmarks show:

The patch follows the existing pattern used for ARM NEON popcount: compile-time detection of Zbb support to enable the code path, with runtime detection to actually use it. This is necessary because RISC-V binaries compiled with -march=rv64gc_zbb will SIGILL on CPUs without Zbb.

The initial v1 patch naively added -march=rv64gc_zbb globally to CFLAGS without runtime dispatch — a fatal flaw immediately caught by Andres. v2 corrected this to match the ARM pattern.

CRC32C via Zbc (0003)

Adapted from the Google Abseil project's RISC-V CRC32C implementation, this uses the clmul (carry-less multiply) instruction from the Zbc extension to implement CRC32C in hardware:

The extraordinary speedup is because the software CRC32C fallback is byte-at-a-time table lookup, while clmul processes 64 bits per instruction and pipelines well. This is architecturally significant because CRC32C is used in WAL checksums and page verification — a 2000× improvement in CRC computation could meaningfully impact write-heavy workloads on RISC-V.

Key Design Tension: "Is the Juice Worth the Squeeze?"

Andres raised the fundamental question: does RISC-V have enough production adoption to justify the maintenance burden of CPU dispatch logic? His arguments:

  1. No real-world PostgreSQL production workloads on RISC-V yet
  2. CPU dispatch adds runtime overhead even when the extension isn't used
  3. RISC-V's fragmented extension landscape makes this an ongoing burden
  4. Even AVX-512 popcount on x86-64 was questionably worthwhile

Greg's counter-arguments:

  1. The patches follow existing patterns (ARM NEON) — bounded complexity
  2. RISC-V servers, cloud instances (Scaleway), and desktops exist
  3. CRC32C improvement is enormous (2000×) and affects WAL performance
  4. The patches serve as durable reference for when RISC-V matures

Nathan Bossart added context that AVX-512 popcount was partially motivated by vector operations that used popcount heavily, and that the trend is toward using simd.h abstractions (SSE2/NEON baseline) rather than architecture-specific optimizations.

Buildfarm Considerations

The thread also addressed buildfarm logistics:

Current State

As of the last message, the thread is blocked on:

  1. Clang building from LLVM HEAD on greenfly (~24h compile time on the RISC-V CPU)
  2. Confirmation of whether the upstream fix resolves the DES and zic.c issues
  3. A v5 patch that will drop pg_memory_barrier() in favor of a proper vectorization suppression strategy

The performance patches (0002, 0003) remain viable but face philosophical resistance about whether RISC-V optimizations are premature given the platform's current adoption level.