I tried to hand-write NEON intrinsics on Apple Silicon, tune loop unrolling, and beat Clang’s auto-vectorizer. Spoiler: it’s harder than it looks.
On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel.
I wanted to know:
If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer?
Short-answer: not easily.
This post walks through:
- A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon.
- Benchmark results for scalar vs auto-vectorized vs manual NEON implementations.
- How changing loop unrolling in the manual NEON version affects performance.
- A look at the AAArch64 assembly generated by Clang for the auto-vectorized version.
- What this says about hand-written vector code vs compiler auto-vectorization.
Code
All the code in this post lives in:
https://github.com/snarang181/vector-playground
The core binary is vec_bench, a microbenchmark that runs different
kernels/variants and prints out performance metrics, such as time taken
and Giga Floating Point Operations per Second (GFLOPS).
1. The Playground: vec_bench on Apple Silicon
The benchmark supports:
- Kernels
saxpy: The SAXPY operation defined asY = A * X + Y, whereAis a scalar, andXandYare vectors.dot: The dot product operation defined asresult = sum(X[i] * Y[i]).
- Variants
scalar: A straightforward scalar implementation with vectorization explicitly disabled.auto: A version that relies on Clang’s auto-vectorization capabilities.neon: A hand-written NEON intrinsic implementation, with configurable loop unrolling.
THe CLI looks like:
./vec_bench \
--kernel saxpy \
--variant auto \
--n 2000000 \
--iters 200
and outputs performance metrics like:
Benchmark Results:
Kernel: saxpy
Variant: auto
Size: 2000000
Iterations: 200
Unroll Factor: 2
Total Time (s): 0.0592365
Performance (GFLOPS): 13.5052
Checksum: -93580.5
--n: Size of the vectors.--iters: Number of iterations to run the benchmark.- GLOP/s assumes 2 FLOPs per element processed for SAXPY (1 multiplication + 1 addition).
Building on Apple Silicon
To build the benchmark on Apple Silicon, use:
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS_RELEASE="-O3 -ffast-math -march=native" \
..
cmake --build .
This basically means “Clang, optimize aggressively, and target the native architecture (Apple Silicon).”
The Three SAXPY Variants
- Scalar Variant: Vectorization disabled using
#pragma clang loop vectorize(disable). In this baseline version, each element of the vectors is processed one at a time.
void saxpy_scalar(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(disable)
for (std::size_t i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
- Auto-Vectorized Variant: Relies on Clang’s auto-vectorization. Here, we simply write the loop normally and let Clang handle vectorization.
void saxpy_auto(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(enable)
for (std::size_t i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
Same code, but the compiler is free to:
- Use NEON/SIMD instructions.
- Unroll loops.
- Use FMA (Fused Multiply-Add) instructions if available.
- Do other optimizations.
- Manual NEON Variant: Hand-written NEON intrinsics performing SAXPY. The user-facing API looks like:
void saxpy_manual_unrolled(float *y, const float *x, float a, std::size_t n, int unroll_factor);
which dispatches to different implementations based on the
unroll_factor. For example, the unroll factor of 1 version looks like:
static void saxpy_manual_unroll1(float *y, const float *x, float a, std::size_t n) {
#if VECPLAY_HAS_NEON
std::size_t i = 0;
float32x4_t a_vec = vdupq_n_f32(a);
std::size_t limit = n & ~3u; // multiple of 4
for (; i < limit; i += 4) {
float32x4_t x0 = vld1q_f32(&x[i]);
float32x4_t y0 = vld1q_f32(&y[i]);
y0 = vmlaq_f32(y0, x0, a_vec);
vst1q_f32(&y[i], y0);
}
// scalar tail
for (; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
#else
(void) y;
(void) x;
(void) a;
(void) n;
#endif
}
2. Benchmarking: Scalar vs Auto-Vectorized vs Manual NEON
All benchmarks below are on:
- Apple Silicon Mac (M1 Pro)
- Compiler flags:
-O3 -ffast-math -march=native - Kernel:
saxpy - Vector size (n):
2,000,000 - Iterations:
200
| Variant | Unroll | Time (s) | GFLOP/s |
|---|---|---|---|
| scalar | — | 0.2482 | 3.22 |
| auto | — | 0.0650 | 12.31 |
| manual (NEON) | 1 | 0.0871 | 9.19 |
| manual (NEON) | 2 | 0.0552 | 14.50 |
| manual (NEON) | 4 | 0.0666 | 12.01 |
Observations
- Going from
scalartoautois a massive speedup (3.22 to 12.31 GFLOP/s), essentially “for free” by letting the compiler optimize. - A naive manual NEON kernel (unroll=1) is actually slower than the auto-vectorized version (9.19 vs 12.31 GFLOP/s).
- However, with loop unrolling (unroll=2), the manual NEON version does beat the auto-vectorized version (14.50 vs 12.31 GFLOP/s).
- Further unrolling (unroll=4) degrades performance again (12.01 GFLOP/s), likely due to instruction cache pressure or diminishing returns.
So, unrolling behaves like a tuning knob:
- Too little unrolling: not enough ILP (Instruction-Level Parallelism) to keep the pipeline busy.
- Too much unrolling: increase code size and register pressure, hurting performance.
- Just right unrolling: maximizes throughput.
3. Assembly Inspection: What’s Going On Under the Hood?
Auto: a 16-element-wide NEON loop
To understand why the auto-vectorized version is so good, let’s look at
the assembly generated by Clang for the saxpy_auto function.
0000000000000370 <__ZN7vecplay10saxpy_autoEPfPKffm>:
//different behavior for alias checks, smaller < n checks
...
3e8: 4f801025 fmla.4s v5, v1, v0[0] // y0 += a * x0
3ec: 4f801046 fmla.4s v6, v2, v0[0] // y1 += a * x1
3f0: 4f801067 fmla.4s v7, v3, v0[0] // y2 += a * x2
3f4: 4f801090 fmla.4s v16, v4, v0[0] // y3 += a * x3
3f8: ad3f1945 stp q5, q6, [x10, #-32]
3fc: ac824147 stp q7, q16, [x10], #64
400: f100416b subs x11, x11, #16 // processed 16 floats
404: 54fffea1 b.ne 0x3d8 <__ZN7vecplay10saxpy_autoEPfPKffm+0x68> // loop back if not done
Key Takeaways from the Assembly
- The auto-vectorized code processes 16 floats per loop iteration using NEON registers (4 neon vectors of 4 floats each).
- It uses
fmla.4sinstructions, which are Fused Multiply-Add operations, allowing it to perform multiplication and addition in a single instruction. This is a big win for performance. - Uses
ldpandstpinstructions to load/store pairs of NEON registers efficiently. - Has scalar fallbacks for small sizes, but the main loop is highly optimized.
4. Conclusion: Can You Beat Clang’s Auto-Vectorizer?
- Auto-vectorization is a strong baseline.
- Uses NEON and FMA effectively.
- Builds a 16-element-wide vectorized loop that keeps the pipeline busy.
- Handles aliasing and tail cases well.
- Hand-written NEON can beat auto-vectorization, but it’s tricky.
A naive manual NEON kernel unroll=1 is slower than auto-vectorized
code. However, with careful tuning (unroll=2), it can outperform the
compiler-generated code by about 18%. But going too far (unroll=4) hurts
performance again.
- Tuning matters.
Loop unrolling is a powerful tool to increase ILP, but it needs to be balanced against code size and register pressure.
Some tips for those wanting to explore further
Exploring the assembly:
To see the assembly generated by Clang, you can use:
cd build
cmake --build .
find . -name "kernels.cpp*" # locate the compiled source file
OBJFILE=./CMakeFiles/vecplay.dir/src/kernels.cpp.o
nm $OBJFILE | grep saxpy_auto # 0000000000000370 T __ZN7vecplay10saxpy_autoEPfPKffm
llvm-objdump -d $OBJFILE | sed -n '/<__ZN7vecplay10saxpy_autoEPfPKffm>/,/^$/p'