Optimising GPU code
Created: 29 Nov 2022, 01:57 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, GeneralDL
https://horace.io/brrr_intro.html
Can understand efficiency of your deep learning regime as consisting of 3 different components.
-
Compute: Time spent on your GPU computing actual floating point operations (FLOPS)
- FLOPS
- GPUs are optimised for matrix multiplication
- If it is not matrix multiplication, it will be slow (e.g. normalisation / element-wise ops)
-
Memory: Time spent transferring tensors within a GPU
- Moving the data from:
- CPU to GPU
- one node to another
- CUDA global memory to CUDA shared memory
- Moving the data from:
-
Overhead: Everything else
- python, pytorch overheads
- Use pytorch profiler to see the alignment between CPU vs GPU
- Or see from nvidia-smi, GPU-Util column (not volatile gpu-util, it is volatile uncorr. Ecc.)
