On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel.

I wanted to know:

If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer?

Short-answer: not easily.

This post walks through:

  • A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon.
  • Benchmark results for scalar vs auto-vectorized vs manual NEON implementations.
  • How changing loop unrolling in the manual NEON version affects performance.
  • A look at the AAArch64 assembly generated by Clang for the auto-vectorized version.
  • What this says about hand-written vector code vs compiler auto-vectorization.

Code

All the code in this post lives in:

https://github.com/snarang181/vector-playground

The core binary is vec_bench, a microbenchmark that runs different kernels/variants and prints out performance metrics, such as time taken and Giga Floating Point Operations per Second (GFLOPS).


1. The Playground: vec_bench on Apple Silicon

The benchmark supports:

  • Kernels
    • saxpy: The SAXPY operation defined as Y = A * X + Y, where A is a scalar, and X and Y are vectors.
    • dot: The dot product operation defined as result = sum(X[i] * Y[i]).
  • Variants
    • scalar: A straightforward scalar implementation with vectorization explicitly disabled.
    • auto: A version that relies on Clang’s auto-vectorization capabilities.
    • neon: A hand-written NEON intrinsic implementation, with configurable loop unrolling.

THe CLI looks like:

./vec_bench \
  --kernel saxpy \
  --variant auto \
  --n 2000000 \
  --iters 200

and outputs performance metrics like:

Benchmark Results:
  Kernel: saxpy
  Variant: auto
  Size: 2000000
  Iterations: 200
  Unroll Factor: 2
  Total Time (s): 0.0592365
  Performance (GFLOPS): 13.5052
  Checksum: -93580.5
  • --n: Size of the vectors.
  • --iters: Number of iterations to run the benchmark.
  • GLOP/s assumes 2 FLOPs per element processed for SAXPY (1 multiplication + 1 addition).

Building on Apple Silicon

To build the benchmark on Apple Silicon, use:

cmake -DCMAKE_BUILD_TYPE=Release \
        -DCMAKE_CXX_FLAGS_RELEASE="-O3 -ffast-math -march=native" \
    ..
cmake --build .

This basically means “Clang, optimize aggressively, and target the native architecture (Apple Silicon).”

The Three SAXPY Variants

  1. Scalar Variant: Vectorization disabled using #pragma clang loop vectorize(disable). In this baseline version, each element of the vectors is processed one at a time.
void saxpy_scalar(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(disable)
    for (std::size_t i = 0; i < n; ++i) {
        y[i] = a * x[i] + y[i];
    }
}
  1. Auto-Vectorized Variant: Relies on Clang’s auto-vectorization. Here, we simply write the loop normally and let Clang handle vectorization.
void saxpy_auto(float* y, const float* x, float a, std::size_t n) {
#pragma clang loop vectorize(enable)
    for (std::size_t i = 0; i < n; ++i) {
        y[i] = a * x[i] + y[i];
    }
}

Same code, but the compiler is free to:

  • Use NEON/SIMD instructions.
  • Unroll loops.
  • Use FMA (Fused Multiply-Add) instructions if available.
  • Do other optimizations.
  1. Manual NEON Variant: Hand-written NEON intrinsics performing SAXPY. The user-facing API looks like:
void saxpy_manual_unrolled(float *y, const float *x, float a, std::size_t n, int unroll_factor);

which dispatches to different implementations based on the unroll_factor. For example, the unroll factor of 1 version looks like:

static void saxpy_manual_unroll1(float *y, const float *x, float a, std::size_t n) {
#if VECPLAY_HAS_NEON
  std::size_t i     = 0;
  float32x4_t a_vec = vdupq_n_f32(a);

  std::size_t limit = n & ~3u; // multiple of 4
  for (; i < limit; i += 4) {
    float32x4_t x0 = vld1q_f32(&x[i]);
    float32x4_t y0 = vld1q_f32(&y[i]);
    y0             = vmlaq_f32(y0, x0, a_vec);
    vst1q_f32(&y[i], y0);
  }

  // scalar tail
  for (; i < n; ++i) {
    y[i] = a * x[i] + y[i];
  }
#else
  (void) y;
  (void) x;
  (void) a;
  (void) n;
#endif
}

2. Benchmarking: Scalar vs Auto-Vectorized vs Manual NEON

All benchmarks below are on:

  • Apple Silicon Mac (M1 Pro)
  • Compiler flags: -O3 -ffast-math -march=native
  • Kernel: saxpy
  • Vector size (n): 2,000,000
  • Iterations: 200
VariantUnrollTime (s)GFLOP/s
scalar0.24823.22
auto0.065012.31
manual (NEON)10.08719.19
manual (NEON)20.055214.50
manual (NEON)40.066612.01

Observations

  • Going from scalar to auto is a massive speedup (3.22 to 12.31 GFLOP/s), essentially “for free” by letting the compiler optimize.
  • A naive manual NEON kernel (unroll=1) is actually slower than the auto-vectorized version (9.19 vs 12.31 GFLOP/s).
  • However, with loop unrolling (unroll=2), the manual NEON version does beat the auto-vectorized version (14.50 vs 12.31 GFLOP/s).
  • Further unrolling (unroll=4) degrades performance again (12.01 GFLOP/s), likely due to instruction cache pressure or diminishing returns.

So, unrolling behaves like a tuning knob:

  • Too little unrolling: not enough ILP (Instruction-Level Parallelism) to keep the pipeline busy.
  • Too much unrolling: increase code size and register pressure, hurting performance.
  • Just right unrolling: maximizes throughput.

3. Assembly Inspection: What’s Going On Under the Hood?

Auto: a 16-element-wide NEON loop

To understand why the auto-vectorized version is so good, let’s look at the assembly generated by Clang for the saxpy_auto function.

0000000000000370 <__ZN7vecplay10saxpy_autoEPfPKffm>:
//different behavior for alias checks, smaller < n checks
...

3e8: 4f801025      fmla.4s v5, v1, v0[0] // y0 += a * x0
3ec: 4f801046      fmla.4s v6, v2, v0[0] // y1 += a * x1
3f0: 4f801067      fmla.4s v7, v3, v0[0] // y2 += a * x2
3f4: 4f801090      fmla.4s v16, v4, v0[0] // y3 += a * x3

3f8: ad3f1945      stp     q5, q6, [x10, #-32]
3fc: ac824147      stp     q7, q16, [x10], #64
400: f100416b      subs    x11, x11, #16 // processed 16 floats
404: 54fffea1      b.ne    0x3d8 <__ZN7vecplay10saxpy_autoEPfPKffm+0x68> // loop back if not done

Key Takeaways from the Assembly

  • The auto-vectorized code processes 16 floats per loop iteration using NEON registers (4 neon vectors of 4 floats each).
  • It uses fmla.4s instructions, which are Fused Multiply-Add operations, allowing it to perform multiplication and addition in a single instruction. This is a big win for performance.
  • Uses ldp and stp instructions to load/store pairs of NEON registers efficiently.
  • Has scalar fallbacks for small sizes, but the main loop is highly optimized.

4. Conclusion: Can You Beat Clang’s Auto-Vectorizer?

  1. Auto-vectorization is a strong baseline.
  • Uses NEON and FMA effectively.
  • Builds a 16-element-wide vectorized loop that keeps the pipeline busy.
  • Handles aliasing and tail cases well.
  1. Hand-written NEON can beat auto-vectorization, but it’s tricky.

A naive manual NEON kernel unroll=1 is slower than auto-vectorized code. However, with careful tuning (unroll=2), it can outperform the compiler-generated code by about 18%. But going too far (unroll=4) hurts performance again.

  1. Tuning matters.

Loop unrolling is a powerful tool to increase ILP, but it needs to be balanced against code size and register pressure.


Some tips for those wanting to explore further

Exploring the assembly:

To see the assembly generated by Clang, you can use:

cd build
cmake --build .
find . -name "kernels.cpp*"  # locate the compiled source file
OBJFILE=./CMakeFiles/vecplay.dir/src/kernels.cpp.o
nm $OBJFILE | grep saxpy_auto # 0000000000000370 T __ZN7vecplay10saxpy_autoEPfPKffm
llvm-objdump -d $OBJFILE | sed -n '/<__ZN7vecplay10saxpy_autoEPfPKffm>/,/^$/p'