Profiling model training and inference (FLOPs, MACs, Latency, Throughput)


Created: 15 Dec 2022, 01:49 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, tools


Generic Benchmarking


Benchmarking Latency, Throughput

PyTorch Benchmark - Lei Mao’s Log Book

Latency (p.137)

  • Measures the delay for a specific task (p.137)
  • in ms
  • lower latency is better!
  • can be compute or memory bounded

Throughput (p.138)

  • Measures the rate at which data is processed (p.138)
  • in videos / s, or images / s, or instances / s
  • higher throughput is better!
Latency vs. Throughput (p.139)

  • they do not correlate to each other not translatable in between both topics
  • batching / parallel processing across more CUDA cores improves the throughput
    • but latency does not necessarily reduce!
  • optimising for latency is generally more difficult how?
    • overlapping the compute with the memory access

Hardware Specific


Benchmarking FLOPs, MACs


PyTorch Profiler / Tensorboard Trace View

FSDL Course Material

Lab 5: Troubleshooting & Testing - The Full Stack

Other tools built upon profiler

Dynolog - Automated trace collection and analysis | PyTorch

Tutorials / Docs

Tensorboard Trace View

Analysing the Trace view https://wandb.ai/cfrye59/fsdl-text-recognizer-2022-training/artifacts/trace/trace-67j1qxws/latest/files/training_step.pt.trace.json

  • This many lines (sync between CPU-GPU)are not ideal! Would rather the CPU be far ahead of the GPU in terms of execution, then CPU is just waiting for GPU to be done fully maximising your GPU compute
  • e.g. this >
  • Is preferred, since GPU has high util, got many things running while the CPU is waiting to Sync! Shows that your bottleneck is GPU, as if GPU is faster, the time to wait for sync will be lesser
  • But then again, would also question why your CPU is waiting to sync, can your CPU compute other things first before GPU comes in ready to sync?

This poor perf is actually in the Adam optimizer, where it is not really optim, since many syncs between CPU / GPU.

There has been work to run fused adam optimizer in PyTorch that aims to make this better!

Fused=True option recently added to PyTorch in v1.13 for Adam optim in torch github issues 68041 85507!


How to profile?

CUDA Profiling

Nsight Systems

Generic profiler?

ARM profiling

C++ Profiling