Can I Beat Clang’s Auto-Vectorizer on Apple Silicon? A SAXPY Case Study

On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel. I wanted to know: If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer? Short-answer: not easily. This post walks through: A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon. Benchmark results for scalar vs auto-vectorized vs manual NEON implementations. How changing loop unrolling in the manual NEON version affects performance. A look at the AAArch64 assembly generated by Clang for the auto-vectorized version. What this says about hand-written vector code vs compiler auto-vectorization. ...

November 22, 2025 · 6 min · Samarth Narang

Want to get notified when new posts go live?

✉️ Email me to subscribe