Profiling model training and inference (FLOPs, MACs, Latency, Throughput)
Created: 15 Dec 2022, 01:49 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, tools
Generic Benchmarking
Benchmarking Latency, Throughput
PyTorch Benchmark - Lei Mao’s Log Book
Latency (p.137)
- Measures the delay for a specific task (p.137)
- in ms
- lower latency is better!
-17.png)
- can be compute or memory bounded
Throughput (p.138)
- Measures the rate at which data is processed (p.138)
- in videos / s, or images / s, or instances / s
- higher throughput is better!
Latency vs. Throughput (p.139)
- they do not correlate to each other ⇒ not translatable in between both topics
- batching / parallel processing across more CUDA cores improves the throughput
- but latency does not necessarily reduce!
- optimising for latency is generally more difficult ⇒ how?
Hardware Specific
Benchmarking FLOPs, MACs
- GitHub - thop: Count the MACs / FLOPs of your PyTorch model.
- GitHub - torchprofile: A general and accurate MACs / FLOPs profiler for PyTorch models
PyTorch Profiler / Tensorboard Trace View
FSDL Course Material
Lab 5: Troubleshooting & Testing - The Full Stack
Other tools built upon profiler
Dynolog - Automated trace collection and analysis | PyTorch
- GitHub - facebookincubator/dynolog: Lightweight monitoring daemon for heterogeneous CPU-GPU systems with PyTorch Profiler
- HTA - PyTorch Trace Analysis for the Masses | PyTorch
Tutorials / Docs
- PyTorch Docs Example of doing optimisations based on the PyTorch Profiler
- How to do performance profiling on PyTorch · GitHub
- PyTorch Profiler With TensorBoard — PyTorch Tutorials 2.4.0+cu121 documentation
- autograd profiler — PyTorch 2.1 documentation
Tensorboard Trace View
Analysing the Trace view https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/artifacts/trace/trace-67j1qxws/latest/files/training_step.pt.trace.json

- This many lines (sync between CPU-GPU)are not ideal! Would rather the CPU be far ahead of the GPU in terms of execution, then CPU is just waiting for GPU to be done ⇒ fully maximising your GPU compute
- e.g. this >

- Is preferred, since GPU has high util, got many things running while the CPU is waiting to Sync! Shows that your bottleneck is GPU, as if GPU is faster, the time to wait for sync will be lesser
- But then again, would also question why your CPU is waiting to sync, can your CPU compute other things first before GPU comes in ready to sync?
This poor perf is actually in the Adam optimizer, where it is not really optim, since many syncs between CPU / GPU.
There has been work to run fused adam optimizer in PyTorch that aims to make this better!
Fused=True option recently added to PyTorch in v1.13 for Adam optim in torch github issues 68041 85507!
How to profile?
CUDA Profiling
- developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9956-best-practices-when-benchmarking-cuda-applications_V2.pdf
- Lecture 1 How to profile CUDA kernels in PyTorch - YouTube
- Solving Machine Learning Performance Anti-Patterns: a Systematic Approach | paulbridger.com
- Navigating NVIDIA Nsight Systems for Efficient Profiling
- nvidia nsight
Nsight Systems

- Solving Machine Learning Performance Anti-Patterns: a Systematic Approach | paulbridger.com
- Navigating NVIDIA Nsight Systems for Efficient Profiling
Generic profiler?
ARM profiling
- Profile the Performance of AI and ML Mobile Applications on Arm | Arm Learning Paths
Transclude of Profiling-model-training-and-inference-(FLOPs,-MACs,-Latency,-Throughput)-2025-11-14-14.19.33.excalidraw
Transclude of Profiling-model-training-and-inference-(FLOPs,-MACs,-Latency,-Throughput)-2025-11-14-15.00.14.excalidraw
- Prefill stage – Backend Memory Stall Cycles is 10% of total backend stall cycles (0.65/5.42 mega-cycles)
- Decode stage – Backend Memory Stall Cycles is 50% of total backend stall cycles (4.01/7.7 mega-cycles) –> MEMORY BOUND!
-19.png)
-18.png)