Neon | Tiled Thoughts: A Verbose Debug Build

On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel. I wanted to know: If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer? Short-answer: not easily. This post walks through: A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon. Benchmark results for scalar vs auto-vectorized vs manual NEON implementations. How changing loop unrolling in the manual NEON version affects performance. A look at the AAArch64 assembly generated by Clang for the auto-vectorized version. What this says about hand-written vector code vs compiler auto-vectorization. ...