On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel.
I wanted to know:
If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer?
Short-answer: not easily.
This post walks through:
- A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon.
- Benchmark results for scalar vs auto-vectorized vs manual NEON implementations.
- How changing loop unrolling in the manual NEON version affects performance.
- A look at the AAArch64 assembly generated by Clang for the auto-vectorized version.
- What this says about hand-written vector code vs compiler auto-vectorization.
Code
All the code in this post lives in:
https://github.com/snarang181/vector-playground
The core binary is vec_bench, a microbenchmark that runs different kernels/variants and prints out performance metrics, such as time taken and Giga Floating Point Operations per Second (GFLOPS).
1. The Playground: vec_bench on Apple Silicon
The benchmark supports:
- Kernels
saxpy: The SAXPY operation defined asY = A * X + Y, whereAis a scalar, andXandYare vectors.dot: The dot product operation defined asresult = sum(X[i] * Y[i]).
- Variants
scalar: A straightforward scalar implementation with vectorization explicitly disabled.auto: A version that relies on Clang’s auto-vectorization capabilities.neon: A hand-written NEON intrinsic implementation, with configurable loop unrolling.
THe CLI looks like:
./vec_bench \
--kernel saxpy \
--variant auto \
--n 2000000 \
--iters 200
and outputs performance metrics like:
Benchmark Results:
Kernel: saxpy
Variant: auto
Size: 2000000
Iterations: 200
Unroll Factor: 2
Total Time (s): 0.0592365
Performance (GFLOPS): 13.5052
Checksum: -93580.5
--n: Size of the vectors.--iters: Number of iterations to run the benchmark.- GLOP/s assumes 2 FLOPs per element processed for SAXPY (1 multiplication + 1 addition).
Building on Apple Silicon
To build the benchmark on Apple Silicon, use:
cmake -DCMAKE_BUILD_TYPE=Release \
-DCMAKE_CXX_FLAGS_RELEASE="-O3 -ffast-math -march=native" \
..
cmake --build .
This basically means “Clang, optimize aggressively, and target the native architecture (Apple Silicon).”
The Three SAXPY Variants
- Scalar Variant: Vectorization disabled using
#pragma clang loop vectorize(disable). In this baseline version, each element of the vectors is processed one at a time.
void saxpy_scalar(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(disable)
for (std::size_t i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
- Auto-Vectorized Variant: Relies on Clang’s auto-vectorization. Here, we simply write the loop normally and let Clang handle vectorization.
void saxpy_auto(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(enable)
for (std::size_t i = 0; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
}
Same code, but the compiler is free to:
- Use NEON/SIMD instructions.
- Unroll loops.
- Use FMA (Fused Multiply-Add) instructions if available.
- Do other optimizations.
- Manual NEON Variant: Hand-written NEON intrinsics performing SAXPY. The user-facing API looks like:
void saxpy_manual_unrolled(float *y, const float *x, float a, std::size_t n, int unroll_factor);
which dispatches to different implementations based on the unroll_factor. For example, the unroll factor of 1 version looks like:
static void saxpy_manual_unroll1(float *y, const float *x, float a, std::size_t n) {
#if VECPLAY_HAS_NEON
std::size_t i = 0;
float32x4_t a_vec = vdupq_n_f32(a);
std::size_t limit = n & ~3u; // multiple of 4
for (; i < limit; i += 4) {
float32x4_t x0 = vld1q_f32(&x[i]);
float32x4_t y0 = vld1q_f32(&y[i]);
y0 = vmlaq_f32(y0, x0, a_vec);
vst1q_f32(&y[i], y0);
}
// scalar tail
for (; i < n; ++i) {
y[i] = a * x[i] + y[i];
}
#else
(void) y;
(void) x;
(void) a;
(void) n;
#endif
}
2. Benchmarking: Scalar vs Auto-Vectorized vs Manual NEON
All benchmarks below are on:
- Apple Silicon Mac (M1 Pro)
- Compiler flags:
-O3 -ffast-math -march=native - Kernel:
saxpy - Vector size (n):
2,000,000 - Iterations:
200
| Variant | Unroll | Time (s) | GFLOP/s |
|---|---|---|---|
| scalar | — | 0.2482 | 3.22 |
| auto | — | 0.0650 | 12.31 |
| manual (NEON) | 1 | 0.0871 | 9.19 |
| manual (NEON) | 2 | 0.0552 | 14.50 |
| manual (NEON) | 4 | 0.0666 | 12.01 |
Observations
- Going from
scalartoautois a massive speedup (3.22 to 12.31 GFLOP/s), essentially “for free” by letting the compiler optimize. - A naive manual NEON kernel (unroll=1) is actually slower than the auto-vectorized version (9.19 vs 12.31 GFLOP/s).
- However, with loop unrolling (unroll=2), the manual NEON version does beat the auto-vectorized version (14.50 vs 12.31 GFLOP/s).
- Further unrolling (unroll=4) degrades performance again (12.01 GFLOP/s), likely due to instruction cache pressure or diminishing returns.
So, unrolling behaves like a tuning knob:
- Too little unrolling: not enough ILP (Instruction-Level Parallelism) to keep the pipeline busy.
- Too much unrolling: increase code size and register pressure, hurting performance.
- Just right unrolling: maximizes throughput.
3. Assembly Inspection: What’s Going On Under the Hood?
Auto: a 16-element-wide NEON loop
To understand why the auto-vectorized version is so good, let’s look at the assembly generated by Clang for the saxpy_auto function.
0000000000000370 <__ZN7vecplay10saxpy_autoEPfPKffm>:
//different behavior for alias checks, smaller < n checks
...
3e8: 4f801025 fmla.4s v5, v1, v0[0] // y0 += a * x0
3ec: 4f801046 fmla.4s v6, v2, v0[0] // y1 += a * x1
3f0: 4f801067 fmla.4s v7, v3, v0[0] // y2 += a * x2
3f4: 4f801090 fmla.4s v16, v4, v0[0] // y3 += a * x3
3f8: ad3f1945 stp q5, q6, [x10, #-32]
3fc: ac824147 stp q7, q16, [x10], #64
400: f100416b subs x11, x11, #16 // processed 16 floats
404: 54fffea1 b.ne 0x3d8 <__ZN7vecplay10saxpy_autoEPfPKffm+0x68> // loop back if not done
Key Takeaways from the Assembly
- The auto-vectorized code processes 16 floats per loop iteration using NEON registers (4 neon vectors of 4 floats each).
- It uses
fmla.4sinstructions, which are Fused Multiply-Add operations, allowing it to perform multiplication and addition in a single instruction. This is a big win for performance. - Uses
ldpandstpinstructions to load/store pairs of NEON registers efficiently. - Has scalar fallbacks for small sizes, but the main loop is highly optimized.
4. Conclusion: Can You Beat Clang’s Auto-Vectorizer?
- Auto-vectorization is a strong baseline.
- Uses NEON and FMA effectively.
- Builds a 16-element-wide vectorized loop that keeps the pipeline busy.
- Handles aliasing and tail cases well.
- Hand-written NEON can beat auto-vectorization, but it’s tricky.
A naive manual NEON kernel unroll=1 is slower than auto-vectorized code. However, with careful tuning (unroll=2), it can outperform the compiler-generated code by about 18%. But going too far (unroll=4) hurts performance again.
- Tuning matters.
Loop unrolling is a powerful tool to increase ILP, but it needs to be balanced against code size and register pressure.
Some tips for those wanting to explore further
Exploring the assembly:
To see the assembly generated by Clang, you can use:
cd build
cmake --build .
find . -name "kernels.cpp*" # locate the compiled source file
OBJFILE=./CMakeFiles/vecplay.dir/src/kernels.cpp.o
nm $OBJFILE | grep saxpy_auto # 0000000000000370 T __ZN7vecplay10saxpy_autoEPfPKffm
llvm-objdump -d $OBJFILE | sed -n '/<__ZN7vecplay10saxpy_autoEPfPKffm>/,/^$/p'