NVIDIA GPU Programming (CUDA)
Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge
Overview
Related fields
Introduction
NVIDIA GPU Architectures
How NVIDIA GPUs have Evolved From Tesla to Ampere to Hopper - techovedas
2017 (Volta) – successor to Pascal and Maxwell
- Introduction of Tensor Cores
- V100
- AGX Xavier NVIDIA Jetson AGX Xavier Delivers 32 TeraOps for New Era of AI in Robotics | NVIDIA Technical Blog
2020 (Ampere)
2022 (Hopper)
2024 (Blackwell)
- B100
- DGX Spark
- Jetson Thor
Kind of main goal
- you want high utilisation to fully use the GPU
- almost all work of writing gpu software is about figuring out how to divide the work you want to do into little blocks to run of different parts of GPU efficiently → then how to setup the data flow on GPU such that you’re not just sitting around waiting for memory bottlenecks / data to arrive → goal is want everything to be running all the time
- if get this right, and workload uses the tensor cores, then the gpu will run at q high utils
Hardware architecture
Notes on AI Hardware - Benjamin Spector | Stanford MLSys #88 - YouTube
h100 example
SM - streaming microprocessor
Memory hierarchy
DRAM (HBM) vs SRAM (L2 Cache, L1 Cache, Registers - fastest)
- chipsandcheese.com/p/nvidias-h100-funny-l2-and-tons-of-bandwidth
Systolic Arrays
- how modern matmuls are done in hardware
- club logic together without moving data in long distances

- 0s are fed into next rows and cols to wait for previous PEs to pass onto them
NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

- NCU → machine code?
- L1 instruction cache
- don’t write gigantic unrolled loops, there will be instruction cache misses and kernel will be slow
- this is mainly how instructions get issued
- At SM layer, there is a Tensor Memory Accelerator (TMA)
- loading memory, gen addresses async
- but not fundamental to NVIDIA programming model
- is important if want good perf on H100
- large L1 data cache / shared memory
- user configurable between L1 cache vs shared memory
- most AI kernels want most of it as shared memory as its faster (no cache tagging / lookup procedures)
- more controllable in shared memory
- generally goal is to not touch cache as don’t want register spills, will touch L1 cache if want some global memory reuse
- thus typ will set all as shared memory and control addressing yourself
- 4 quadrants

- generally similar to 4090
Theoretical References
Articles
Courses
Videos
- Notes on AI Hardware - Benjamin Spector | Stanford MLSys #88 - YouTube
- CUDA + ThunderKittens, but increasingly drunk. - YouTube