Can I Beat Clang’s Auto-Vectorizer on Apple Silicon? A SAXPY Case Study

On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel. I wanted to know: If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer? Short-answer: not easily. This post walks through: A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon. Benchmark results for scalar vs auto-vectorized vs manual NEON implementations. How changing loop unrolling in the manual NEON version affects performance. A look at the AAArch64 assembly generated by Clang for the auto-vectorized version. What this says about hand-written vector code vs compiler auto-vectorization. ...

November 22, 2025 · 6 min · Samarth Narang

Data Structure and Iterator Kung Fu in LLVM

Practical patterns, zero-copy views, and safe mutation loops for faster LLVM passes.

November 9, 2025 · 8 min · Samarth Narang

Booting Up: A Verbose Debug Build of Life and Compilers

“If life had a compiler, I’d probably still be tuning the optimization flags.” Welcome to Tiled Thoughts — my verbose debug build. I’m Samarth, a compiler engineer at Qualcomm. My work revolves around building efficient ML systems, contributing to open-source compiler infrastructures like LLVM and MLIR, and exploring the intersection of programming languages and machine learning. This blog is where I log the things that don’t quite fit into a Git commit message — reflections, experiments, and observations tiled across compilers, ML systems, open-source, and education. ...

November 5, 2025 · 1 min · Samarth Narang

Want to get notified when new posts go live?

✉️ Email me to subscribe