Darius Knowledge Hub

❯

Profiling model training and inference (FLOPs, MACs, Latency, Throughput)

Profiling model training and inference (FLOPs, MACs, Latency, Throughput)

Jan 21, 20264 min read

knowledge
tools

Profiling model training and inference (FLOPs, MACs, Latency, Throughput)

Created: 15 Dec 2022, 01:49 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, tools

Generic Benchmarking

PyTorch_Trainer/torch_trainer/profiler.py at main · DheerajMadda/PyTorch_Trainer · GitHub
11 Benchmarking AI – Machine Learning Systems
[1911.02549] MLPerf Inference Benchmark
- GitHub - mlcommons/inference: Reference implementations of MLPerf™ inference benchmarks
  - Benchmark MLPerf Inference: Edge | MLCommons V3.1 Results
- GitHub - mlcommons/tiny: MLPerf™ Tiny is an ML benchmark suite for extremely low-power systems such as microcontrollers

Benchmarking Latency, Throughput

PyTorch Benchmark - Lei Mao’s Log Book

Benchmark Utils - torch.utils.benchmark — PyTorch 2.4 documentation

Latency (p.137)

Measures the delay for a specific task (p.137)
in ms
lower latency is better!
can be compute or memory bounded

Throughput (p.138)

Measures the rate at which data is processed (p.138)
in videos / s, or images / s, or instances / s
higher throughput is better!

Latency vs. Throughput (p.139)

they do not correlate to each other ⇒ not translatable in between both topics
batching / parallel processing across more CUDA cores improves the throughput
- but latency does not necessarily reduce!
optimising for latency is generally more difficult ⇒ how?
- overlapping the compute with the memory access

Hardware Specific

GitHub - microsoft/nn-Meter (2021): A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Benchmarking FLOPs, MACs

GitHub - thop: Count the MACs / FLOPs of your PyTorch model.
GitHub - torchprofile: A general and accurate MACs / FLOPs profiler for PyTorch models

PyTorch Profiler / Tensorboard Trace View

FSDL Course Material

Lab 5: Troubleshooting & Testing - The Full Stack

lab05_troubleshooting.ipynb - Colaboratory
Using the PyTorch Profiler with W&B

Other tools built upon profiler

Dynolog - Automated trace collection and analysis | PyTorch

GitHub - facebookincubator/dynolog: Lightweight monitoring daemon for heterogeneous CPU-GPU systems with PyTorch Profiler
HTA - PyTorch Trace Analysis for the Masses | PyTorch
- facebookresearch/HolisticTraceAnalysis: A library to analyze PyTorch traces.
- trace_analysis_demo.ipynb

Tutorials / Docs

PyTorch Docs Example of doing optimisations based on the PyTorch Profiler
How to do performance profiling on PyTorch · GitHub
PyTorch Profiler With TensorBoard — PyTorch Tutorials 2.4.0+cu121 documentation
autograd profiler — PyTorch 2.1 documentation

Tensorboard Trace View

Analysing the Trace view https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/artifacts/trace/trace-67j1qxws/latest/files/training_step.pt.trace.json

This many lines (sync between CPU-GPU)are not ideal! Would rather the CPU be far ahead of the GPU in terms of execution, then CPU is just waiting for GPU to be done ⇒ fully maximising your GPU compute
e.g. this >
Is preferred, since GPU has high util, got many things running while the CPU is waiting to Sync! Shows that your bottleneck is GPU, as if GPU is faster, the time to wait for sync will be lesser
But then again, would also question why your CPU is waiting to sync, can your CPU compute other things first before GPU comes in ready to sync?

This poor perf is actually in the Adam optimizer, where it is not really optim, since many syncs between CPU / GPU.

There has been work to run fused adam optimizer in PyTorch that aims to make this better!

Fused=True option recently added to PyTorch in v1.13 for Adam optim in torch github issues 68041 85507!

How to profile?

ChatGPT - Inference profiling resources

CUDA Profiling

developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9956-best-practices-when-benchmarking-cuda-applications_V2.pdf
Lecture 1 How to profile CUDA kernels in PyTorch - YouTube
Solving Machine Learning Performance Anti-Patterns: a Systematic Approach | paulbridger.com
Navigating NVIDIA Nsight Systems for Efficient Profiling
nvidia nsight

Nsight Systems

Solving Machine Learning Performance Anti-Patterns: a Systematic Approach | paulbridger.com
Navigating NVIDIA Nsight Systems for Efficient Profiling

Generic profiler?

Perfetto - System profiling, app tracing and trace analysis

ARM profiling

Profile the Performance of AI and ML Mobile Applications on Arm | Arm Learning Paths
- Transclude of Profiling-model-training-and-inference-(FLOPs,-MACs,-Latency,-Throughput)-2025-11-14-14.19.33.excalidraw
- Transclude of Profiling-model-training-and-inference-(FLOPs,-MACs,-Latency,-Throughput)-2025-11-14-15.00.14.excalidraw
  - Prefill stage – Backend Memory Stall Cycles is 10% of total backend stall cycles (0.65/5.42 mega-cycles)
  - Decode stage – Backend Memory Stall Cycles is 50% of total backend stall cycles (4.01/7.7 mega-cycles) –> MEMORY BOUND!

C++ Profiling

Tracy
- An Introduction to Tracy Profiler in C++ - Marcos Slomp - CppCon 2023 - YouTube
- Integrating Tracy Profiler in C++

Graph View

Profiling model training and inference (FLOPs, MACs, Latency, Throughput)
Generic Benchmarking
Benchmarking Latency, Throughput
Benchmarking FLOPs, MACs
PyTorch Profiler / Tensorboard Trace View
FSDL Course Material
Other tools built upon profiler
Tutorials / Docs
Tensorboard Trace View
How to profile?
CUDA Profiling
Nsight Systems
Generic profiler?
ARM profiling
C++ Profiling

Created with Quartz v4.5.2 © 2026

GitHub
Discord Community