Can I Beat Clang’s Auto-Vectorizer on Apple Silicon? A SAXPY Case Study

On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel. I wanted to know: If I hand-write NEON intrinsics on Apple Silicon, tune the loop unrolling, and stare at the assembly, can I beat Clang’s auto-vectorizer? Short-answer: not easily. This post walks through: A tiny vectorization playground for the SAXPY (Single-Precision A·X Plus Y) kernel on Apple Silicon. Benchmark results for scalar vs auto-vectorized vs manual NEON implementations. How changing loop unrolling in the manual NEON version affects performance. A look at the AAArch64 assembly generated by Clang for the auto-vectorized version. What this says about hand-written vector code vs compiler auto-vectorization. ...

November 22, 2025 · 6 min · Samarth Narang

Quantization in MLIR: Types, Scales, and Where to Put the q

Quantization is one of those things where everyone unanimously agrees “it’s important”, but the details are often fuzzy; I mean, no one really wants to think about it in their compiler’s IR, right? In most stacks, quantization lives in an (awkward?) place: High-level frameworks (like PyTorch) expose “quantized layers” or post-training quantization APIs. Backends (like TensorRT, ONNX Runtime) and kernels really care about bit-widths, scales, zero points, and data layouts. The glue in between is often a mishmash of ad-hoc passes, custom operators, and brittle assumptions. MLIR is actually a sweet spot to make quantization less cursed: ...

November 18, 2025 · 7 min · Samarth Narang

How GPUs Talk: A Practical Guide to Multi-GPU Training and Communication

Data parallel, FSDP, tensor parallel, pipeline parallel, and the collectives that hold them together.

November 15, 2025 · 7 min · Samarth Narang

Data Structure and Iterator Kung Fu in LLVM

Practical patterns, zero-copy views, and safe mutation loops for faster LLVM passes.

November 9, 2025 · 8 min · Samarth Narang

How Rewriting works in MLIR

If you only remember one thing from this post: rewriting in MLIR is “find a pattern, make a change, repeat until more changes can’t be made”, with two key components: Greedy pattern application (canonicalization and local clenups), and Dialect conversion (legalize/convert regions with invariants about the legal forms of ops). TL;DR Patterns live in a RewritePatternSet and are driven by either applyPatternsGreedily (for local greedy rewrites) or applyPartial/FullConversion (for dialect conversion with legality constraints). Write patterns by subclassing OpRewritePattern<YourOp> and overriding matchAndRewrite with your logic. Rewrite safely using PatternRewriter methods to create, replace, and erase ops. Canonicalization : MLIR has a single canonicalization pass which applies all registered patterns greedily until no more matches are found. Conversion: MLIR’s conversion framework allows you to define legality constraints and convert ops from one dialect to another while preserving invariants. We do this with ConversionTarget and TypeConverter. Folding: Take series of ops complements rewriting by simplifying constant expressions during pattern application. Part 1: The moving pieces RewritePatternSet and PatternRewriter RewritePatternSet is a container for your rewrite patterns. You populate it with instances of your custom patterns. MLIR runs these patterns for you; you don’t directly loop over operations. In your pattern’s matchAndRewrite, you Inspect the matched op. Optionally create the new IR (using the rewriter’s insertion point). Replace or erase the matched op. Greedy vs. Conversion Greedy (Canonicalization and Local Rewrites) Think “peephole + algebraic simplification”. Use applyPatternsGreedily to apply all patterns in a RewritePatternSet. applyPatternsGreedily(fop, std::move(patterns)); Conversion (Dialect Conversion) Define legality constraints for ops via ConversionTarget. Use TypeConverter to handle type conversions. Use applyPartialConversion or applyFullConversion to convert ops while respecting legality. Part 2: Your first greedy rewrite pattern. Let’s fold away arith.addi %x, 0: i32 into just %x. Yeah, it’s trivial, and MLIR’s canonicalization already does this, but it’s a great starting point. ...

November 8, 2025 · 5 min · Samarth Narang

Demystifying GPU Terminology: A Compiler Engineer's Field Guide

GPUs aren’t mysterious - just picky. Most performance cliffs are not about the math; they’re about how warps step, how memory is fetched, and how often the registers spill. This post decodes the jargon; and to be candid, it is me “spilling” my notes, trying to explain myself. TL;DR Think in warps, not threads. Coalesce or pay. Tile for reuse in shared memory, but watch the register pressure. Matrix units (Tensor Cores, FMAs, etc) love the right data types and tile sizes. Occupancy is a balancing act: a tool, not a goal. Just enough to hide latency. 1. Execution Model, Decoded SIMT vs SIMD (why is it confusing?) SIMD (CPU): Single Instruction, Multiple Data. One instruction operates on a fixed-width vector (e.g., a single AVX-512 instruction processes 16 floats at once). SIMT (GPU): Single Instruction, Multiple Threads. Many threads execute the same instruction in lockstep as a warp (NVIDIA) or wavefront (AMD); each thread has its own registers/control flow. Warps/Wavefronts Smallest lockstep unit: ...

November 6, 2025 · 4 min · Samarth Narang

Tile, Fuse, Repeat: Why Layout Matters for AI Performance

Every time a neural network runs, there’s a silent negotiation between compute and memory. It’s naive to think ML compilers optimize just the compute - the FLOPs. In reality, they optimize movement. The most expensive operation in modern compute isn’t your matrix multiply; it’s getting data from memory to the compute units. This post explores how layout, tiling, and fusion are the unsung heroes of ML compiler performance. 1. The Compiler’s Hidden Battle Deep Learning performance is a balancing act between compute and memory. You can think of layout as the grammar of that relationship - the way tensors are arranged, accessed, and aligned. ...

November 5, 2025 · 4 min · Samarth Narang

Booting Up: A Verbose Debug Build of Life and Compilers

“If life had a compiler, I’d probably still be tuning the optimization flags.” Welcome to Tiled Thoughts — my verbose debug build. I’m Samarth, a compiler engineer at Qualcomm. My work revolves around building efficient ML systems, contributing to open-source compiler infrastructures like LLVM and MLIR, and exploring the intersection of programming languages and machine learning. This blog is where I log the things that don’t quite fit into a Git commit message — reflections, experiments, and observations tiled across compilers, ML systems, open-source, and education. ...

November 5, 2025 · 1 min · Samarth Narang

Want to get notified when new posts go live?

✉️ Email me to subscribe