Gpu

GPUs aren’t mysterious - just picky. Most performance cliffs are not about the math; they’re about how warps step, how memory is fetched, and how often the registers spill. This post decodes the jargon; and to be candid, it is me “spilling” my notes, trying to explain myself. TL;DR Think in warps, not threads. Coalesce or pay. Tile for reuse in shared memory, but watch the register pressure. Matrix units (Tensor Cores, FMAs, etc) love the right data types and tile sizes. Occupancy is a balancing act: a tool, not a goal. Just enough to hide latency. 1. Execution Model, Decoded SIMT vs SIMD (why is it confusing?) SIMD (CPU): Single Instruction, Multiple Data. One instruction operates on a fixed-width vector (e.g., a single AVX-512 instruction processes 16 floats at once). SIMT (GPU): Single Instruction, Multiple Threads. Many threads execute the same instruction in lockstep as a warp (NVIDIA) or wavefront (AMD); each thread has its own registers/control flow. Warps/Wavefronts Smallest lockstep unit: ...

How GPUs Talk: A Practical Guide to Multi-GPU Training and Communication

Demystifying GPU Terminology: A Compiler Engineer's Field Guide