NVIDIA GPU Programming (CUDA)

Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge

Overview

Introduction

2024 (Blackwell)

B100
DGX Spark
Jetson Thor

Kind of main goal

you want high utilisation to fully use the GPU
almost all work of writing gpu software is about figuring out how to divide the work you want to do into little blocks to run of different parts of GPU efficiently → then how to setup the data flow on GPU such that you’re not just sitting around waiting for memory bottlenecks / data to arrive → goal is want everything to be running all the time
if get this right, and workload uses the tensor cores, then the gpu will run at q high utils

Hardware architecture

Notes on AI Hardware - Benjamin Spector | Stanford MLSys #88 - YouTube h100 example

SM - streaming microprocessor

Memory hierarchy DRAM (HBM) vs SRAM (L2 Cache, L1 Cache, Registers - fastest)

chipsandcheese.com/p/nvidias-h100-funny-l2-and-tons-of-bandwidth

Systolic Arrays

how modern matmuls are done in hardware
club logic together without moving data in long distances
0s are fed into next rows and cols to wait for previous PEs to pass onto them

NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

NCU → machine code?
L1 instruction cache
- don’t write gigantic unrolled loops, there will be instruction cache misses and kernel will be slow
- this is mainly how instructions get issued
At SM layer, there is a Tensor Memory Accelerator (TMA)
- loading memory, gen addresses async
- but not fundamental to NVIDIA programming model
- is important if want good perf on H100
large L1 data cache / shared memory
- user configurable between L1 cache vs shared memory
- most AI kernels want most of it as shared memory as its faster (no cache tagging / lookup procedures)
  - more controllable in shared memory
  - generally goal is to not touch cache as don’t want register spills, will touch L1 cache if want some global memory reuse
  - thus typ will set all as shared memory and control addressing yourself
4 quadrants
- generally similar to 4090

Darius Knowledge Hub

Explorer

NVIDIA GPU Programming (CUDA)

NVIDIA GPU Programming (CUDA)

Overview

Introduction

NVIDIA GPU Architectures

2017 (Volta) – successor to Pascal and Maxwell

2020 (Ampere)

2022 (Hopper)

2024 (Blackwell)

Kind of main goal

Hardware architecture

Theoretical References

Articles

Courses

Videos

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents

Darius Knowledge Hub

Explorer

NVIDIA GPU Programming (CUDA)

NVIDIA GPU Programming (CUDA)

Overview

Related fields

Introduction

NVIDIA GPU Architectures

2017 (Volta) – successor to Pascal and Maxwell

2020 (Ampere)

2022 (Hopper)

2024 (Blackwell)

Kind of main goal

Hardware architecture

Theoretical References

Articles

Courses

Videos

Code References

Methods

Tools, Frameworks

Graph View

Table of Contents