vectorized CRC on ARM64

First seen: 2025-05-14 10:36:27+00:00 · Messages: 25 · Participants: 4

Latest Update

2026-05-06 · opus 4.7

Vectorized CRC on ARM64: Bringing PMULL-Based CRC32C to AArch64

Core Problem and Architectural Context

PostgreSQL uses CRC32C pervasively — every WAL record is protected by a CRC, as are control files, backup manifests, and replication messages. Because WAL insertion happens under the WALInsertLock and is on the hot path of every write transaction, the throughput of pg_comp_crc32c() has direct implications for OLTP write performance. Historically PostgreSQL has had three implementations selected at configure/runtime:

  1. sb8 — a software "slicing-by-8" table-driven fallback.
  2. armv8 — uses the ARMv8 CRC32 extension (scalar CRC32CX/CRC32CW/CRC32CB instructions), introduced years ago. Processes 8 bytes per instruction after an 8-byte alignment preamble.
  3. SSE4.2 / VPCLMULQDQ on x86 — in PG 18, commit 3c6e8c12389 added a vectorized implementation using carryless multiplication (PCLMUL) to fold 64-byte chunks in parallel, dramatically speeding up long CRCs (relevant for big WAL records, large base backup manifests, etc.).

On ARM64, the analog of PCLMULQDQ is the PMULL/PMULL2 instructions from the "crypto" extension (they operate on 64-bit halves of 128-bit NEON registers, producing 128-bit polynomial products). The thread's goal is to reach feature parity with the x86 vectorized path: fold multiple 16-byte lanes in parallel via PMULL, then reduce down to a scalar CRC32C instruction at the tail.

Design of the Patch Series

The patches evolved substantially across v1→v5, but the final shape committed in 648818ba3 has these components:

1. Runtime dispatch with a fallback

A function pointer pg_comp_crc32c is initialized by pg_comp_crc32_choose(). This is the same indirection model used on x86. Importantly, for ARM Naylor chose to always go through the pointer, diverging from the x86 approach where small constant-length inputs get inlined to avoid the indirect call overhead:

"I think always indirecting through a pointer will have less risk of regressions in a realistic setting than for x86 since Arm chips typically have low latency for carryless multiplication instructions."

Nonetheless, to protect the WAL-insert-lock-critical small-input case, the direct call is still emitted for compile-time-constant small inputs, using pg_integer_constant_p() (a portable replacement for __builtin_constant_p — needed because MSVC on ARM tripped over the GCC builtin; this caused a post-commit buildfarm failure on hoatzin).

2. PMULL main loop

The generated code folds the input through 4 × 16-byte accumulators using PMULL/PMULL2 + EOR, which on Apple cores can fuse into a single µop when adjacent — a micro-architectural detail Naylor confirmed via Dougall Johnson's Firestorm tables. The same Python generator used for x86 emits the ARM variant by swapping CLMUL mnemonics for NEON equivalents.

The main loop requires 16-byte alignment (vs. the scalar path's 8-byte alignment), so the preamble absorbs up to 15 bytes scalar-first. A minimum-input-length cutoff (~80 bytes) guarantees at least one full vector iteration; below that, the code drops straight into the scalar CRC32C path. Benchmarks on a Neoverse N1 showed the vector loop winning even at 64 bytes and roughly 2× faster on long inputs.

3. Inline assembly vs. intrinsics

A substantive review point from Haibo Tristan Yim: given that the rest of the code uses NEON intrinsics (vld1q_u64, etc.), why are the PMULL+EOR helpers inline assembly? Naylor's answer is pragmatic rather than principled — the upstream reference implementation used inline asm, and converting it to intrinsics would require re-validation across multiple (compiler × vendor) combinations too close to the PG 19 feature freeze. The risk/reward of changing it was judged unfavorable. This is a documented tech-debt item for future cleanup, not a conclusion that intrinsics are unsuitable.

A small but telling artifact: one NEON-literal construction in the generated code confused pgindent (unmatched parens from (uint64x2_t){crc0, 0}), so that one line was rewritten using veorq_u64() intrinsics purely to placate the formatter. A comment explaining this was itself reformatted by pgindent in the commit that Bossart ribbed Naylor about ("koel is going to complain…").

4. Build system: the +crypto flag problem

This is where most of the design friction occurred. The existing autoconf logic sets CFLAGS_CRC to one of:

PMULL requires the +crypto feature. Naylor's initial v3 added a fourth probe that rewrote CFLAGS_CRC to -march=armv8-a+crc+simd+crypto, which had two bad failure modes:

Resolution (v4): use GCC/Clang __attribute__((target("+crypto"))) on the PMULL functions, so the global CFLAGS don't need the feature flag at all. Works back to gcc 6.3; fails gracefully on clang <16 (just no PMULL support there). Nathan Bossart endorsed this direction as the sturdier option.

5. macOS runtime detection regression

The runtime-check dispatcher pg_crc32c_armv8_available() uses getauxval() / elf_aux_info() to probe HWCAP bits — neither exists on macOS. v3 consequently caused pg_comp_crc32_choose() to recurse infinitely on macOS because the "is armv8 CRC available" probe returned false even though the compiler was targeting the CRC extension unconditionally. Bossart diagnosed it by tracing through the #else return false; #endif fallthrough.

The v5 fix establishes the correct default before attempting runtime detection:

#ifdef USE_ARMV8_CRC32C
    pg_comp_crc32c = pg_comp_crc32c_armv8;   /* assume, since compiler targets it */
#else
    pg_comp_crc32c = pg_comp_crc32c_sb8;
#endif
    /* then attempt runtime upgrade to pmull path */

This preserves pre-existing macOS behavior (scalar CRC32 always used) while allowing Linux to upgrade to PMULL when HWCAP advertises it.

6. Post-commit buildfarm fallout

Two fixes landed after the main commit:

Key Technical Tradeoffs

Decision Alternative Why chosen
Indirect call even for small inputs Inline constant-length path like x86 ARM PMULL latency is low; indirection cost is a smaller fraction of total
__attribute__((target)) for +crypto Append +crypto to CFLAGS_CRC Doesn't clobber packager-specified -march; no interaction with tests ordering
Inline asm for PMULL helpers NEON intrinsics Inherited from upstream; revalidation cost too high pre-freeze
16-byte alignment always Skip alignment on short inputs (as x86 does) 4 accumulators need 16B anyway; ARM alignment cost is lower
80-byte cutoff to vector path Lower cutoff Guarantees ≥1 full loop iteration after preamble

Participant Dynamics

Architectural Significance

This completes the x86+ARM vectorized CRC story for PG 19. Given WAL's centrality, even single-digit percent CRC improvements show up in benchmarks with large records (large tuples, logical decoding output, etc.). More importantly, the committed structure — compile-time target attributes + runtime HWCAP probing + per-platform default fallback — establishes a reusable pattern for future SIMD dispatch in PostgreSQL (e.g. if SVE2 or future ARM extensions become worth targeting, the scaffolding is in place).