CUDA | Tiled Thoughts: A Verbose Debug Build

SIMD, warps, occupancy, coalescing, shared memory, spills, and matrix units—mapped to real compiler decisions. Table of Contents TL;DR 1. Execution Model, Decoded SIMT vs SIMD (why is it confusing?) Warps/Wavefronts CTA (Cooperative Thread Array) / Workgroup Occupancy (It’s not a religion) 2. Memory Hierarchy (where performance is won and lost) Coalesced Access (the golden rule) Shared Memory (on-chip scratchpad) Spills (the invisible tax) 3. Math Units: Matrix Engines, Precision, and Shapes 4. Scheduling and Latency Hiding Warp Scheduling Divergence and Predication 5. Vendor Term Crosswalk 6. Checklists you will actually use 7. Quick Reference Cheat Sheet GPUs aren’t mysterious - just picky. Most performance cliffs are not about the math; they’re about how warps step, how memory is fetched, and how often the registers spill. This post decodes the jargon; and to be candid, it is me “spilling” my notes, trying to explain myself. ...