Demystifying GPU Terminology: A Compiler Engineer's Field Guide

GPUs aren’t mysterious - just picky. Most performance cliffs are not about the math; they’re about how warps step, how memory is fetched, and how often the registers spill. This post decodes the jargon; and to be candid, it is me “spilling” my notes, trying to explain myself.

TL;DR

Think in warps, not threads.
Coalesce or pay.
Tile for reuse in shared memory, but watch the register pressure.
Matrix units (Tensor Cores, FMAs, etc) love the right data types and tile sizes.
Occupancy is a balancing act: a tool, not a goal. Just enough to hide latency.

1. Execution Model, Decoded

SIMT vs SIMD (why is it confusing?)

SIMD (CPU): Single Instruction, Multiple Data. One instruction operates on a fixed-width vector (e.g., a single AVX-512 instruction processes 16 floats at once).
SIMT (GPU): Single Instruction, Multiple Threads. Many threads execute the same instruction in lockstep as a warp (NVIDIA) or wavefront (AMD); each thread has its own registers/control flow.

Warps/Wavefronts

Smallest lockstep unit:

NVIDIA: warp = 32 threads
AMD: wavefront = 32/64 threads (depending on architecture)

Tip: choose block sizes as multiples of warp size (e.g., 128, 256) to maximize utilization.

CTA (Cooperative Thread Array) / Workgroup

A cooperatively scheduled group of threads sharing on-chip memory and synchronization (barriers).

Balance:

Enough warps to hide latency.
Avoid register pressure that causes spills.
Keep shared memory usage in check.

Occupancy (It’s not a religion)

Active warps per Streaming Multiprocessor (SM) / Compute Unit (CU) relative to the maximum possible. This helps hide latency but chasing 100% occupancy can backfire due to increased register pressure and shared memory contention.

2. Memory Hierarchy (where performance is won and lost)

Registers -> Shared Memory (L1) -> L2 Cache -> Global Memory (HBM/DRAM)

Latency increases and bandwidth decreases as you go down the hierarchy.

Coalesced Access (the golden rule)

Adjacent threads access adjacent addresses -> fewer memory transactions -> higher effective bandwidth.

Layout Matters: NHWC/NCHW/blocked layouts decide whether threadIdx.x walks through contiguous memory.

Quick Check: If indexing A[threadIdx.x] results in strided access, performance will suffer.

Shared Memory (on-chip scratchpad)

On-chip, software-managed cache memory.

Perfect for tiling working sets to maximize data reuse.
Beware of bank conflicts: ensure threads access different memory banks to avoid serialization.

Trick: Add padding to spread accesses across banks.

Spills (the invisible tax)

Too many live variables lead to register spills to local memory (off-chip), causing massive slowdowns.

Mitigation:

Shorten live ranges. (e.g. reorder, store intermediate results in shared memory)
Fuse judiciously to reduce temporary variables.
Split kernels if necessary.

3. Math Units: Matrix Engines, Precision, and Shapes

Modern GPUs ship with specialized matrix units (Tensor Cores on NVIDIA, Matrix Cores on AMD) that accelerate matrix multiplications and convolutions.

NVIDIA: Tensor Core
AMD: Matrix Core
Intel: XMX (GPU) / AMX (CPU)

They love:

Low precision types: FP16, BF16, INT8, etc.
Right tile sizes: which vary by architecture, but multiples aligning to MMA (Matrix Multiply-Accumulate) shapes are best (e.g., 16x16 for NVIDIA Tensor Cores).
Proper accumulation types to avoid precision loss. e.g., multiply in FP16, accumulate in FP32.

Compiler Move: Choose pack/layout + tile sizes that map exactly to these hardware shapes for maximum throughput. “Almost right” shapes silently leave throughput on the table.

4. Scheduling and Latency Hiding

Warp Scheduling

SMs/CUs keep several warps ready to execute. When one warp stalls (e.g., waiting for memory), another can run, hiding latency. If you under-provision warps (low occupancy), the GPU may sit idle.

Divergence and Predication

If a branch is hot and divergent (threads in a warp take different paths), the warp serializes execution, hurting performance. We can then “flatten” branches with predication or data-parallel transformations. If a branch is rare but heavy, split into separate kernels.

5. Vendor Term Crosswalk

Concept	NVIDIA	AMD	Intel
Lockstep unit	Warp (32)	Wavefront (32/64)	Sub-group (varies)
SM/CU block	SM	CU	Xe-core
Matrix engines	Tensor Cores	MFMA/WMMA	XMX (GPU) / AMX

Use the crosswalk to translate docs without changing mental models.

6. Checklists you will actually use

Block size sweep (first pass):

Start at warp-size multiples.
Cap registers/thread; if spills occur, reduce tile size or split computation.
Keep shared memory < threshold.

Coalescing Sanity:

Ensure threadIdx.x accesses contiguous memory.
If strided, consider layout transformations, add a pack step, or adjust indexing.

Shared Memory Tiling:

Tile only when reuse > copy-in + copy-out cost.
Add padding to avoid bank conflicts. (even 1 element can help)

Matrix Unit Mapping:

Match dtypes and tile sizes to hardware specs.
Prefer fp32 accumulations when using low-precision inputs.

7. Quick Reference Cheat Sheet

Execution Stack

Grid -> Block (CTA) -> Warp -> Thread

Memory Stack

Registers -> Shared Memory -> L2 Cache -> Global Memory

Coalesced vs Non-coalesced

Threads: t0, t1, t2, t3, t4, t5, t6, t7

Coalesced: A[0], A[1], A[2], A[3], A[4], A[5], A[6], A[7]

Non-coalesced: A[0], A[8], A[16], A[24]…

TL;DR#

1. Execution Model, Decoded#

SIMT vs SIMD (why is it confusing?)#

Warps/Wavefronts#

CTA (Cooperative Thread Array) / Workgroup#

Occupancy (It’s not a religion)#

2. Memory Hierarchy (where performance is won and lost)#

Coalesced Access (the golden rule)#

Shared Memory (on-chip scratchpad)#

Spills (the invisible tax)#

3. Math Units: Matrix Engines, Precision, and Shapes#

4. Scheduling and Latency Hiding#

Warp Scheduling#

Divergence and Predication#

5. Vendor Term Crosswalk#

6. Checklists you will actually use#

7. Quick Reference Cheat Sheet#