Optimising GPU code


Created: 29 Nov 2022, 01:57 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, GeneralDL


https://horace.io/brrr_intro.html

Can understand efficiency of your deep learning regime as consisting of 3 different components.

  1. Compute: Time spent on your GPU computing actual floating point operations (FLOPS)

    • FLOPS
    • GPUs are optimised for matrix multiplication
    • If it is not matrix multiplication, it will be slow (e.g. normalisation / element-wise ops)
  2. Memory: Time spent transferring tensors within a GPU

    • Moving the data from:
      • CPU to GPU
      • one node to another
      • CUDA global memory to CUDA shared memory
  3. Overhead: Everything else

    • python, pytorch overheads
    • Use pytorch profiler to see the alignment between CPU vs GPU
    • Or see from nvidia-smi, GPU-Util column (not volatile gpu-util, it is volatile uncorr. Ecc.)