GPU falling off the bus
Created: 17 Nov 2022, 03:51 PM | Modified: =dateformat(this.file.mtime,"dd MMM yyyy, hh:mm a")
Tags: knowledge, unix
When using googoo (oldest server)
- Faced this issue of GPU 2 dying, when GPU 0 and 1 are max memory utilisation
- When call “nvidia-smi” it will not respond, when call “nvidia-smi -L” it will show:

- To determine the error I called “dmesg”, scrolled down and saw:

- The Xid: 79 indicates overheating or insufficient/flaky power supply
- Shen mentioned that this is known problem, use other unused GPU
