NVIDIA GPU Programming (CUDA)


Created: =dateformat(this.file.ctime,"dd MMM yyyy, hh:mm a") | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge


Overview

Introduction

NVIDIA GPU Architectures

How NVIDIA GPUs have Evolved From Tesla to Ampere to Hopper - techovedas

2017 (Volta) – successor to Pascal and Maxwell

2020 (Ampere)

2022 (Hopper)

2024 (Blackwell)

  • B100
  • DGX Spark
  • Jetson Thor

Kind of main goal

  • you want high utilisation to fully use the GPU
  • almost all work of writing gpu software is about figuring out how to divide the work you want to do into little blocks to run of different parts of GPU efficiently → then how to setup the data flow on GPU such that you’re not just sitting around waiting for memory bottlenecks / data to arrive → goal is want everything to be running all the time
  • if get this right, and workload uses the tensor cores, then the gpu will run at q high utils

Hardware architecture

Notes on AI Hardware - Benjamin Spector | Stanford MLSys #88 - YouTube h100 example

SM - streaming microprocessor

Memory hierarchy DRAM (HBM) vs SRAM (L2 Cache, L1 Cache, Registers - fastest)

  • chipsandcheese.com/p/nvidias-h100-funny-l2-and-tons-of-bandwidth

Systolic Arrays

  • how modern matmuls are done in hardware
  • club logic together without moving data in long distances
  • 0s are fed into next rows and cols to wait for previous PEs to pass onto them

NVIDIA Hopper Architecture In-Depth | NVIDIA Technical Blog

  • NCU → machine code?
  • L1 instruction cache
    • don’t write gigantic unrolled loops, there will be instruction cache misses and kernel will be slow
    • this is mainly how instructions get issued
  • At SM layer, there is a Tensor Memory Accelerator (TMA)
    • loading memory, gen addresses async
    • but not fundamental to NVIDIA programming model
    • is important if want good perf on H100
  • large L1 data cache / shared memory
    • user configurable between L1 cache vs shared memory
    • most AI kernels want most of it as shared memory as its faster (no cache tagging / lookup procedures)
      • more controllable in shared memory
      • generally goal is to not touch cache as don’t want register spills, will touch L1 cache if want some global memory reuse
      • thus typ will set all as shared memory and control addressing yourself
  • 4 quadrants
    • generally similar to 4090

Theoretical References

Articles

Courses

Videos


Code References

Methods

Tools, Frameworks