2026-05-06 · opus 4.7

Vectorized CRC on ARM64: Bringing PMULL-Based CRC32C to AArch64

Core Problem and Architectural Context

PostgreSQL uses CRC32C pervasively — every WAL record is protected by a CRC, as are control files, backup manifests, and replication messages. Because WAL insertion happens under the WALInsertLock and is on the hot path of every write transaction, the throughput of pg_comp_crc32c() has direct implications for OLTP write performance. Historically PostgreSQL has had three implementations selected at configure/runtime:

sb8 — a software "slicing-by-8" table-driven fallback.
armv8 — uses the ARMv8 CRC32 extension (scalar CRC32CX/CRC32CW/CRC32CB instructions), introduced years ago. Processes 8 bytes per instruction after an 8-byte alignment preamble.
SSE4.2 / VPCLMULQDQ on x86 — in PG 18, commit 3c6e8c12389 added a vectorized implementation using carryless multiplication (PCLMUL) to fold 64-byte chunks in parallel, dramatically speeding up long CRCs (relevant for big WAL records, large base backup manifests, etc.).

On ARM64, the analog of PCLMULQDQ is the PMULL/PMULL2 instructions from the "crypto" extension (they operate on 64-bit halves of 128-bit NEON registers, producing 128-bit polynomial products). The thread's goal is to reach feature parity with the x86 vectorized path: fold multiple 16-byte lanes in parallel via PMULL, then reduce down to a scalar CRC32C instruction at the tail.

Design of the Patch Series

The patches evolved substantially across v1→v5, but the final shape committed in 648818ba3 has these components:

1. Runtime dispatch with a fallback

A function pointer pg_comp_crc32c is initialized by pg_comp_crc32_choose(). This is the same indirection model used on x86. Importantly, for ARM Naylor chose to always go through the pointer, diverging from the x86 approach where small constant-length inputs get inlined to avoid the indirect call overhead:

"I think always indirecting through a pointer will have less risk of regressions in a realistic setting than for x86 since Arm chips typically have low latency for carryless multiplication instructions."

Nonetheless, to protect the WAL-insert-lock-critical small-input case, the direct call is still emitted for compile-time-constant small inputs, using pg_integer_constant_p() (a portable replacement for __builtin_constant_p — needed because MSVC on ARM tripped over the GCC builtin; this caused a post-commit buildfarm failure on hoatzin).

2. PMULL main loop

The generated code folds the input through 4 × 16-byte accumulators using PMULL/PMULL2 + EOR, which on Apple cores can fuse into a single µop when adjacent — a micro-architectural detail Naylor confirmed via Dougall Johnson's Firestorm tables. The same Python generator used for x86 emits the ARM variant by swapping CLMUL mnemonics for NEON equivalents.

The main loop requires 16-byte alignment (vs. the scalar path's 8-byte alignment), so the preamble absorbs up to 15 bytes scalar-first. A minimum-input-length cutoff (~80 bytes) guarantees at least one full vector iteration; below that, the code drops straight into the scalar CRC32C path. Benchmarks on a Neoverse N1 showed the vector loop winning even at 64 bytes and roughly 2× faster on long inputs.

3. Inline assembly vs. intrinsics

A substantive review point from Haibo Tristan Yim: given that the rest of the code uses NEON intrinsics (vld1q_u64, etc.), why are the PMULL+EOR helpers inline assembly? Naylor's answer is pragmatic rather than principled — the upstream reference implementation used inline asm, and converting it to intrinsics would require re-validation across multiple (compiler × vendor) combinations too close to the PG 19 feature freeze. The risk/reward of changing it was judged unfavorable. This is a documented tech-debt item for future cleanup, not a conclusion that intrinsics are unsuitable.

A small but telling artifact: one NEON-literal construction in the generated code confused pgindent (unmatched parens from (uint64x2_t){crc0, 0}), so that one line was rewritten using veorq_u64() intrinsics purely to placate the formatter. A comment explaining this was itself reformatted by pgindent in the commit that Bossart ribbed Naylor about ("koel is going to complain…").

4. Build system: the `+crypto` flag problem

This is where most of the design friction occurred. The existing autoconf logic sets CFLAGS_CRC to one of:

"" (when the packager already put +crc in global CFLAGS, so nothing extra needed → USE_ARMV8_CRC32C),
-march=armv8-a+crc+simd (runtime check path → USE_ARMV8_CRC32C_WITH_RUNTIME_CHECK),
-march=armv8-a+crc.

PMULL requires the +crypto feature. Naylor's initial v3 added a fourth probe that rewrote CFLAGS_CRC to -march=armv8-a+crc+simd+crypto, which had two bad failure modes:

If the packager had supplied a custom -march (e.g. targeting a specific microarch), silently overwriting it would erase that tuning.
If CFLAGS_CRC was "" because +crc came from external CFLAGS, there was no existing -march to append +crypto to.

Resolution (v4): use GCC/Clang __attribute__((target("+crypto"))) on the PMULL functions, so the global CFLAGS don't need the feature flag at all. Works back to gcc 6.3; fails gracefully on clang <16 (just no PMULL support there). Nathan Bossart endorsed this direction as the sturdier option.

5. macOS runtime detection regression

The runtime-check dispatcher pg_crc32c_armv8_available() uses getauxval() / elf_aux_info() to probe HWCAP bits — neither exists on macOS. v3 consequently caused pg_comp_crc32_choose() to recurse infinitely on macOS because the "is armv8 CRC available" probe returned false even though the compiler was targeting the CRC extension unconditionally. Bossart diagnosed it by tracing through the #else return false; #endif fallthrough.

The v5 fix establishes the correct default before attempting runtime detection:

#ifdef USE_ARMV8_CRC32C
    pg_comp_crc32c = pg_comp_crc32c_armv8;   /* assume, since compiler targets it */
#else
    pg_comp_crc32c = pg_comp_crc32c_sb8;
#endif
    /* then attempt runtime upgrade to pmull path */

This preserves pre-existing macOS behavior (scalar CRC32 always used) while allowing Linux to upgrade to PMULL when HWCAP advertises it.

6. Post-commit buildfarm fallout

Two fixes landed after the main commit:

hoatzin (MSVC/ARM): __builtin_constant_p unresolved — replaced with the new pg_integer_constant_p macro (PG 19-only; the x86 path has the same latent issue but is left for a back-patchable follow-up).
32-bit ARM clang (rpi5, reported by Tomas Vondra): pg_pmull_available flagged as unused. clang warns on unused static functions but gcc only warns on unused non-inline statics; fixed by wrapping in #ifdef USE_PMULL_CRC32C_WITH_RUNTIME_CHECK.

Key Technical Tradeoffs

Decision	Alternative	Why chosen
Indirect call even for small inputs	Inline constant-length path like x86	ARM PMULL latency is low; indirection cost is a smaller fraction of total
`__attribute__((target))` for +crypto	Append `+crypto` to `CFLAGS_CRC`	Doesn't clobber packager-specified `-march`; no interaction with tests ordering
Inline asm for PMULL helpers	NEON intrinsics	Inherited from upstream; revalidation cost too high pre-freeze
16-byte alignment always	Skip alignment on short inputs (as x86 does)	4 accumulators need 16B anyway; ARM alignment cost is lower
80-byte cutoff to vector path	Lower cutoff	Guarantees ≥1 full loop iteration after preamble

Participant Dynamics

John Naylor is both author and committer; he drove the design and has clear expertise in low-level performance work (he did the x86 v18 version). His judgment calls — "not worth rewriting to intrinsics pre-freeze", "don't inline small paths on ARM" — carry weight because he owns this subsystem.
Nathan Bossart (committer) provided the substantive review: caught the macOS infinite recursion, pushed back on the fragile CFLAGS_CRC rewriting, suggested the __attribute__((target)) approach that became the final design, and gave LGTM on v5.
Haibo (Tristan) Yim raised the asm-vs-intrinsics question. Naylor acknowledged the critique as valid in principle but deferred it on schedule/risk grounds.
Tomas Vondra (committer) caught a 32-bit ARM clang warning post-commit via his rpi5 testing.

Architectural Significance

This completes the x86+ARM vectorized CRC story for PG 19. Given WAL's centrality, even single-digit percent CRC improvements show up in benchmarks with large records (large tuples, logical decoding output, etc.). More importantly, the committed structure — compile-time target attributes + runtime HWCAP probing + per-platform default fallback — establishes a reusable pattern for future SIMD dispatch in PostgreSQL (e.g. if SVE2 or future ARM extensions become worth targeting, the scaffolding is in place).

vectorized CRC on ARM64

Latest Update

Vectorized CRC on ARM64: Bringing PMULL-Based CRC32C to AArch64

Core Problem and Architectural Context

Design of the Patch Series

1. Runtime dispatch with a fallback

2. PMULL main loop

3. Inline assembly vs. intrinsics

4. Build system: the `+crypto` flag problem

5. macOS runtime detection regression

6. Post-commit buildfarm fallout

Key Technical Tradeoffs

Participant Dynamics

Architectural Significance

vectorized CRC on ARM64

Latest Update

Vectorized CRC on ARM64: Bringing PMULL-Based CRC32C to AArch64

Core Problem and Architectural Context

Design of the Patch Series

1. Runtime dispatch with a fallback

2. PMULL main loop

3. Inline assembly vs. intrinsics

4. Build system: the +crypto flag problem

5. macOS runtime detection regression

6. Post-commit buildfarm fallout

Key Technical Tradeoffs

Participant Dynamics

Architectural Significance

4. Build system: the `+crypto` flag problem