How GPUs Talk: A Practical Guide to Multi-GPU Training and Communication

An in-depth exploration of multi-GPU training techniques and the communication patterns that enable efficient distributed machine learning. Table of Contents 1. Why Multi-GPU Training is a communication problem 2. How GPUs are wired: intra-node and inter-node 3. Core communication patterns 4. Multi-GPU training strategies 4.1 Distributed Data Parallel (DDP) 4.2 Fully Sharded Data Parallel (FSDP) 4.3 Tensor Parallelism 4.4 Pipeline Parallelism 5. Topology-aware communication: rings, trees, and NVLink 6. Hiding communication latency 7. Conclusion When people talk about scaling deep learning, they usually mean throwing more GPUs at the problem. However, horizontal scaling can only get you so far without efficient communication strategies. ...

November 15, 2025 · 7 min

Demystifying GPU Terminology: A Compiler Engineer’s Field Guide

SIMD, warps, occupancy, coalescing, shared memory, spills, and matrix units—mapped to real compiler decisions. Table of Contents TL;DR 1. Execution Model, Decoded SIMT vs SIMD (why is it confusing?) Warps/Wavefronts CTA (Cooperative Thread Array) / Workgroup Occupancy (It’s not a religion) 2. Memory Hierarchy (where performance is won and lost) Coalesced Access (the golden rule) Shared Memory (on-chip scratchpad) Spills (the invisible tax) 3. Math Units: Matrix Engines, Precision, and Shapes 4. Scheduling and Latency Hiding Warp Scheduling Divergence and Predication 5. Vendor Term Crosswalk 6. Checklists you will actually use 7. Quick Reference Cheat Sheet GPUs aren’t mysterious - just picky. Most performance cliffs are not about the math; they’re about how warps step, how memory is fetched, and how often the registers spill. This post decodes the jargon; and to be candid, it is me “spilling” my notes, trying to explain myself. ...

November 9, 2025 · 5 min