Navigating NVIDIA Nsight Systems for Efficient Profiling

omnivore ai-systems profiling

Read on Omnivore | Read Original

Highlights

Nsight systems profiling command is extremely simple:

nsys profile python [INPUT_FILENAME]

That’s it!

But don’t run this just yet, we have to add some flags for making the most out of our profiling:

  1. --trace=cuda,nvtx,osrt
    • Here’s what each option captures:
    • cuda: CUDA API Calls/CUDA kernels information
    • nvtx(NVIDIA Tools Extension): Custom annotations which show up in the visualizer
    • osrt(Operations Systems Runtime): Any communications with hardware, multi-threading synchronizations, and kernel schedulers
  2. -o [OUTPUT_FILENAME]
    • Custom output filename

Here’s the final template:

nsys profile --trace=cuda,nvtx,osrt -o [OUTPUT_FILENAME] python [INPUT_FILENAME] ⤴️

Nsight Systems Baseline Annotated

  • CUDA HW: Shows GPU Utilization Rate
  • python(Names will vary): Shows CPU Utilization Rate
  • OS runtime libraries: Especially useful for viewing thread activities(e.g. synchronizations, semaphores, etc)
  • CUDA API: Shows CUDA API Calls ⤴️

If you want to see detailed information about CUDA Kernel calls, click the CUDA HW dropdown and see the Kernels row. It contains information about grid sizes, block sizes, registers per thread, and such which is useful for calculating the occupancy of SMs.

Nsight Systems Kernel Details ⤴️

We need to put the parts in our interest within the loop of
with nvtx.annotate("YOUR_COMMENT_HERE" ⤴️

Nsight Systems with NVTX Annotations

We now see our annotations showing up in blue in the NVTX row. ⤴️

why the first pass of forward and backward calls were not utilizing the GPU and why they were taking so long. I thought it was a small mistake on my end or with the code.

When I went and did some digging, I realized this is a property of torch.compile! torch.compile will compile the model into optimized kernels as it executes, so naturally it will take much longer during the first run than the rest. ⤴️

torch.compile outputs triton kernels, which is why we see many triton kernel launches in our Nsight Systems too.

Triton Kernels in Nsight Systems ⤴️

After profiling, however, is where the real work starts. You should first identify which category of bottleneck your program is facing out of these:

1. GPU Compute Bound

2. CUDA API Bound

3. Synchronization Bound

4. CPU Compute Bound ⤴️