Optimising GPU code

Created: 29 Nov 2022, 01:57 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, GeneralDL

Can understand efficiency of your deep learning regime as consisting of 3 different components.

Compute: Time spent on your GPU computing actual floating point operations (FLOPS)
- FLOPS
- GPUs are optimised for matrix multiplication
- If it is not matrix multiplication, it will be slow (e.g. normalisation / element-wise ops)
Memory: Time spent transferring tensors within a GPU
- Moving the data from:
  - CPU to GPU
  - one node to another
  - CUDA global memory to CUDA shared memory
Overhead: Everything else
- python, pytorch overheads
- Use pytorch profiler to see the alignment between CPU vs GPU
- Or see from nvidia-smi, GPU-Util column (not volatile gpu-util, it is volatile uncorr. Ecc.)

Darius Knowledge Hub