MLIR for People Who Only Know LLVM IR: A Guided Tour

MLIR for People Who Only Know LLVM IR: A Guided Tour A practical mental-model bridge from LLVM IR to MLIR for people who already think in terms of functions, basic blocks, and passes. November 25, 2025 · 10 min Table of Contents TL;DR: The Mental Mapping Modules, functions, blocks, and values LLVM IR Mental Model MLIR Mental Model Example: hello, function Dialects: Instruction Sets for Different Domains Dialects as namespaces Operations, Regions, and Nested Control Flow Regions in pratctice Nested IR everywhere Types and Attributes SSA value types Attributes A side-by-side example LLVM IR MLIR Breakdown Passes and pipelines Pattern rewrites: opt passes with a twist How does this become LLVM IR? How to start reading MLIR as an LLVM person Why MLIR? If you already speak LLVM IR, MLIR can feel like a cousin who redesigned the house while you were out: ...

February 22, 2026 · 10 min

Can I Beat Clang’s Auto-Vectorizer on Apple Silicon? A SAXPY Case Study

I tried to hand-write NEON intrinsics on Apple Silicon, tune loop unrolling, and beat Clang’s auto-vectorizer. Spoiler: it’s harder than it looks. On modern CPUs, the “slow” part of your code often isn’t the math, it’s how you feed the math units. Compilers try to fix this with clever optimizations, such as auto-vectorization: transforming scalar loops and turn them into SIMD (Single Instruction, Multiple Data) operations that process multiple data points in parallel. ...

November 22, 2025 · 6 min

Quantization in MLIR: Types, Scales, and Where to Put the q

Quantization is one of those things where everyone unanimously agrees “it’s important”, but the details are often fuzzy; I mean, no one really wants to think about it in their compiler’s IR, right? In most stacks, quantization lives in an (awkward?) place: High-level frameworks (like PyTorch) expose “quantized layers” or post-training quantization APIs. Backends (like TensorRT, ONNX Runtime) and kernels really care about bit-widths, scales, zero points, and data layouts. The glue in between is often a mishmash of ad-hoc passes, custom operators, and brittle assumptions. MLIR is actually a sweet spot to make quantization less cursed: ...

November 18, 2025 · 7 min

How GPUs Talk: A Practical Guide to Multi-GPU Training and Communication

An in-depth exploration of multi-GPU training techniques and the communication patterns that enable efficient distributed machine learning. Table of Contents 1. Why Multi-GPU Training is a communication problem 2. How GPUs are wired: intra-node and inter-node 3. Core communication patterns 4. Multi-GPU training strategies 4.1 Distributed Data Parallel (DDP) 4.2 Fully Sharded Data Parallel (FSDP) 4.3 Tensor Parallelism 4.4 Pipeline Parallelism 5. Topology-aware communication: rings, trees, and NVLink 6. Hiding communication latency 7. Conclusion When people talk about scaling deep learning, they usually mean throwing more GPUs at the problem. However, horizontal scaling can only get you so far without efficient communication strategies. ...

November 15, 2025 · 7 min

Data Structure and Iterator Kung Fu in LLVM

Practical patterns, zero-copy views, and safe mutation loops for faster LLVM passes. Table of Contents Why LLVM ships its own containers Core Value Types (which own nothing) StringRef ArrayRef / MutableArrayRef Twine Small-Size Optimized Containers SmallVector<T, N> SmallString SmallPtrSet<T*, N> “Hashy” Workhorses DenseMap<KeyT, ValueT> / DenseSet Custom Keys providing DenseMapInfo<Key> with: StringMap Erasing While Iterating Arenas, Uniquing, and more BumpPtrAllocator FoldingSet Error handling the LLVM way IR-Centric Must-Knows Traversal Idioms Mutation Safety CFG Helpers Range and Iterator (Halloween!) Candy Choosing the Right Data Structure (A Decision Matrix) Common “Shooting Yourself in the Foot” Pitfalls Micro-Benchmarks Compile and Run Conclusion When to use SmallVector vs std::vector, why DenseMap feels like cheating, how StringRef & ArrayRef avoid copies, and the iterator tricks that make LLVM code elegant and fast. ...

November 9, 2025 · 8 min

Demystifying GPU Terminology: A Compiler Engineer’s Field Guide

SIMD, warps, occupancy, coalescing, shared memory, spills, and matrix units—mapped to real compiler decisions. Table of Contents TL;DR 1. Execution Model, Decoded SIMT vs SIMD (why is it confusing?) Warps/Wavefronts CTA (Cooperative Thread Array) / Workgroup Occupancy (It’s not a religion) 2. Memory Hierarchy (where performance is won and lost) Coalesced Access (the golden rule) Shared Memory (on-chip scratchpad) Spills (the invisible tax) 3. Math Units: Matrix Engines, Precision, and Shapes 4. Scheduling and Latency Hiding Warp Scheduling Divergence and Predication 5. Vendor Term Crosswalk 6. Checklists you will actually use 7. Quick Reference Cheat Sheet GPUs aren’t mysterious - just picky. Most performance cliffs are not about the math; they’re about how warps step, how memory is fetched, and how often the registers spill. This post decodes the jargon; and to be candid, it is me “spilling” my notes, trying to explain myself. ...

November 9, 2025 · 5 min

How Rewriting works in MLIR

An in-depth look at the rewriting mechanisms in MLIR and how they enable powerful optimizations. Table of Contents TL;DR Part 1: The moving pieces RewritePatternSet and PatternRewriter Greedy vs. Conversion Greedy (Canonicalization and Local Rewrites) Conversion (Dialect Conversion) Part 2: Your first greedy rewrite pattern. Pattern Definition Part 3 : Running it: Tiny IR + Command Build and run Result Part 4: Dialect Conversion in a nutshell Core ingredients Example: Convert toy.addi to arith.addi Key differences from greedy patterns: Part 5 : Folding vs. Patterns Part 6: Match helpers, benefits, and ordering Part 7: Debugging and guardrails Part 8: Mini LIT Test Example Conclusion Further Reading If you only remember one thing from this post: rewriting in MLIR is “find a pattern, make a change, repeat until more changes can’t be made”, with two key components: ...

November 8, 2025 · 6 min

Booting Up: A Verbose Debug Build of Life and Compilers

Compiler passes, ML systems, and life — debug logs from the path between code and silicon. “If life had a compiler, I’d probably still be tuning the optimization flags.” Welcome to Tiled Thoughts — my verbose debug build. I’m Samarth, a compiler engineer at Qualcomm. My work revolves around building efficient ML systems, contributing to open-source compiler infrastructures like LLVM and MLIR, and exploring the intersection of programming languages and machine learning. This blog is where I log the things that don’t quite fit into a Git commit message — reflections, experiments, and observations tiled across compilers, ML systems, open-source, and education. ...

November 4, 2025 · 1 min