GPU falling off the bus


Created: 17 Nov 2022, 03:51 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a") Tags: knowledge, unix


When using googoo (oldest server)

  • Faced this issue of GPU 2 dying, when GPU 0 and 1 are max memory utilisation
  • When call “nvidia-smi” it will not respond, when call “nvidia-smi -L” it will show:
  • To determine the error I called “dmesg”, scrolled down and saw:
  • The Xid: 79 indicates overheating or insufficient/flaky power supply

From <https://forums.developer.nvidia.com/t/unable-to-determine-the-device-handle-for-gpu-gpu-is-lost/57641>

  • Shen mentioned that this is known problem, use other unused GPU