Can I Beat Clang’s Auto-Vectorizer on Apple Silicon? A SAXPY Case Study

On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel.

I wanted to know:

If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer?

Short-answer: not easily.

This post walks through:

A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon.
Benchmark results for scalar vs auto-vectorized vs manual NEON implementations.
How changing loop unrolling in the manual NEON version affects performance.
A look at the AAArch64 assembly generated by Clang for the auto-vectorized version.
What this says about hand-written vector code vs compiler auto-vectorization.

Code

All the code in this post lives in:

https://github.com/snarang181/vector-playground

The core binary is vec_bench, a microbenchmark that runs different kernels/variants and prints out performance metrics, such as time taken and Giga Floating Point Operations per Second (GFLOPS).

1. The Playground: `vec_bench` on Apple Silicon

The benchmark supports:

Kernels
- saxpy: The SAXPY operation defined as Y = A * X + Y, where A is a scalar, and X and Y are vectors.
- dot: The dot product operation defined as result = sum(X[i] * Y[i]).
Variants
- scalar: A straightforward scalar implementation with vectorization explicitly disabled.
- auto: A version that relies on Clang’s auto-vectorization capabilities.
- neon: A hand-written NEON intrinsic implementation, with configurable loop unrolling.

THe CLI looks like:

./vec_bench \
  --kernel saxpy \
  --variant auto \
  --n 2000000 \
  --iters 200

and outputs performance metrics like:

Benchmark Results:
  Kernel: saxpy
  Variant: auto
  Size: 2000000
  Iterations: 200
  Unroll Factor: 2
  Total Time (s): 0.0592365
  Performance (GFLOPS): 13.5052
  Checksum: -93580.5

--n: Size of the vectors.
--iters: Number of iterations to run the benchmark.
GLOP/s assumes 2 FLOPs per element processed for SAXPY (1 multiplication + 1 addition).

Building on Apple Silicon

To build the benchmark on Apple Silicon, use:

cmake -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_CXX_FLAGS_RELEASE="-O3 -ffast-math -march=native" \
    ..
cmake --build .

This basically means “Clang, optimize aggressively, and target the native architecture (Apple Silicon).”

The Three SAXPY Variants

Scalar Variant: Vectorization disabled using #pragma clang loop vectorize(disable). In this baseline version, each element of the vectors is processed one at a time.

void saxpy_scalar(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(disable)
    for (std::size_t i = 0; i < n; ++i) {
        y[i] = a * x[i] + y[i];
    }
}

Auto-Vectorized Variant: Relies on Clang’s auto-vectorization. Here, we simply write the loop normally and let Clang handle vectorization.

void saxpy_auto(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(enable)
    for (std::size_t i = 0; i < n; ++i) {
        y[i] = a * x[i] + y[i];
    }
}

Same code, but the compiler is free to:

Use NEON/SIMD instructions.
Unroll loops.
Use FMA (Fused Multiply-Add) instructions if available.
Do other optimizations.

Manual NEON Variant: Hand-written NEON intrinsics performing SAXPY. The user-facing API looks like:

void saxpy_manual_unrolled(float *y, const float *x, float a, std::size_t n, int unroll_factor);

which dispatches to different implementations based on the unroll_factor. For example, the unroll factor of 1 version looks like:

static void saxpy_manual_unroll1(float *y, const float *x, float a, std::size_t n) {
#if VECPLAY_HAS_NEON
  std::size_t i     = 0;
  float32x4_t a_vec = vdupq_n_f32(a);

  std::size_t limit = n & ~3u; // multiple of 4
  for (; i < limit; i += 4) {
    float32x4_t x0 = vld1q_f32(&x[i]);
    float32x4_t y0 = vld1q_f32(&y[i]);
    y0             = vmlaq_f32(y0, x0, a_vec);
    vst1q_f32(&y[i], y0);
  }

  // scalar tail
  for (; i < n; ++i) {
    y[i] = a * x[i] + y[i];
  }
#else
  (void) y;
  (void) x;
  (void) a;
  (void) n;
#endif
}

2. Benchmarking: Scalar vs Auto-Vectorized vs Manual NEON

All benchmarks below are on:

Apple Silicon Mac (M1 Pro)
Compiler flags: -O3 -ffast-math -march=native
Kernel: saxpy
Vector size (n): 2,000,000
Iterations: 200

Variant	Unroll	Time (s)	GFLOP/s
scalar	—	0.2482	3.22
auto	—	0.0650	12.31
manual (NEON)	1	0.0871	9.19
manual (NEON)	2	0.0552	14.50
manual (NEON)	4	0.0666	12.01

Observations

Going from scalar to auto is a massive speedup (3.22 to 12.31 GFLOP/s), essentially “for free” by letting the compiler optimize.
A naive manual NEON kernel (unroll=1) is actually slower than the auto-vectorized version (9.19 vs 12.31 GFLOP/s).
However, with loop unrolling (unroll=2), the manual NEON version does beat the auto-vectorized version (14.50 vs 12.31 GFLOP/s).
Further unrolling (unroll=4) degrades performance again (12.01 GFLOP/s), likely due to instruction cache pressure or diminishing returns.

So, unrolling behaves like a tuning knob:

Too little unrolling: not enough ILP (Instruction-Level Parallelism) to keep the pipeline busy.
Too much unrolling: increase code size and register pressure, hurting performance.
Just right unrolling: maximizes throughput.

3. Assembly Inspection: What’s Going On Under the Hood?

Auto: a 16-element-wide NEON loop

To understand why the auto-vectorized version is so good, let’s look at the assembly generated by Clang for the saxpy_auto function.

0000000000000370 <__ZN7vecplay10saxpy_autoEPfPKffm>:
//different behavior for alias checks, smaller < n checks
...

3e8: 4f801025      fmla.4s v5, v1, v0[0] // y0 += a * x0
3ec: 4f801046      fmla.4s v6, v2, v0[0] // y1 += a * x1
3f0: 4f801067      fmla.4s v7, v3, v0[0] // y2 += a * x2
3f4: 4f801090      fmla.4s v16, v4, v0[0] // y3 += a * x3

3f8: ad3f1945      stp     q5, q6, [x10, #-32]
3fc: ac824147      stp     q7, q16, [x10], #64
400: f100416b      subs    x11, x11, #16 // processed 16 floats
404: 54fffea1      b.ne    0x3d8 <__ZN7vecplay10saxpy_autoEPfPKffm+0x68> // loop back if not done

Key Takeaways from the Assembly

The auto-vectorized code processes 16 floats per loop iteration using NEON registers (4 neon vectors of 4 floats each).
It uses fmla.4s instructions, which are Fused Multiply-Add operations, allowing it to perform multiplication and addition in a single instruction. This is a big win for performance.
Uses ldp and stp instructions to load/store pairs of NEON registers efficiently.
Has scalar fallbacks for small sizes, but the main loop is highly optimized.

4. Conclusion: Can You Beat Clang’s Auto-Vectorizer?

Auto-vectorization is a strong baseline.

Uses NEON and FMA effectively.
Builds a 16-element-wide vectorized loop that keeps the pipeline busy.
Handles aliasing and tail cases well.

Hand-written NEON can beat auto-vectorization, but it’s tricky.

A naive manual NEON kernel unroll=1 is slower than auto-vectorized code. However, with careful tuning (unroll=2), it can outperform the compiler-generated code by about 18%. But going too far (unroll=4) hurts performance again.

Tuning matters.

Loop unrolling is a powerful tool to increase ILP, but it needs to be balanced against code size and register pressure.

Some tips for those wanting to explore further

Exploring the assembly:

To see the assembly generated by Clang, you can use:

cd build
cmake --build .
find . -name "kernels.cpp*"  # locate the compiled source file
OBJFILE=./CMakeFiles/vecplay.dir/src/kernels.cpp.o
nm $OBJFILE | grep saxpy_auto # 0000000000000370 T __ZN7vecplay10saxpy_autoEPfPKffm
llvm-objdump -d $OBJFILE | sed -n '/<__ZN7vecplay10saxpy_autoEPfPKffm>/,/^$/p'

Code#

1. The Playground: vec_bench on Apple Silicon#

Building on Apple Silicon#

The Three SAXPY Variants#

2. Benchmarking: Scalar vs Auto-Vectorized vs Manual NEON#

Observations#

3. Assembly Inspection: What’s Going On Under the Hood?#

4. Conclusion: Can You Beat Clang’s Auto-Vectorizer?#

Some tips for those wanting to explore further#