How we fixed CUDA Error 101: invalid device ordinal ... torch._C._cuda_getDeviceCount() > 0 🤯
In an AI developer's life, a few things are more unsettling than strange CUDA errors —except maybe unsolicited Windows updates 🔥. But I digress. Our story begins on an otherwise ordinary day when our trusty GPU server, a behemoth designed to house 8 GPUs, decided to throw us a curveball.
A routine check with nvidia-smi
revealed a peculiar anomaly: one of our GPUs had decided to play hide and seek, leaving us with just 7 visible GPUs. The output was stark, showing us a lineup of 7 GPU attendees, with the 8th conspicuously absent 🏖️.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla P40 Off | 00000000:05:00.0 Off | Off |
| N/A 33C P0 50W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla P40 Off | 00000000:08:00.0 Off | Off |
| N/A 40C P0 52W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla P40 Off | 00000000:09:00.0 Off | Off |
| N/A 39C P0 51W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla P40 Off | 00000000:84:00.0 Off | Off |
| N/A 35C P0 52W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 Tesla P40 Off | 00000000:85:00.0 Off | Off |
| N/A 36C P0 50W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 Tesla P40 Off | 00000000:88:00.0 Off | Off |
| N/A 34C P0 50W / 250W | 0MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 Tesla P40 Off | 00000000:89:00.0 Off | Off |
| N/A 33C P0 49W / 250W | 0MiB / 24576MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
Our journey into the rabbit hole deepened when we discovered that despite the presence of these 7 GPUs, CUDA was throwing a tantrum, refusing to acknowledge any of them 😡:
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/joy/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
return torch._C._cuda_getDeviceCount() > 0
False
Digging through kernel messages with sudo dmesg
, we stumbled upon a clue: a power connector on one of the GPUs was loose 🔌.
[ 5362.995401] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5362.995812] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5362.995852] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 5368.246428] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5368.246782] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5368.246833] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 5368.603123] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5368.603470] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5368.603520] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 5373.879295] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5373.879683] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5373.879726] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
So, to use the rest of the GPUs, we somehow needed to disable this device. We tried at first by adding the following using -- sudo nano /etc/modprobe.d/nvidia.conf
followed by sudo update-initramfs -u
rebooting the machine. But it did not work.
options nvidia NVreg_AssignGpus="pci:0000:05:00.0,pci:0000:08:00.0,pci:0000:09:00.0,pci:0000:84:00.0,pci:0000:85:00.0,pci:0000:88:00.0,pci:0000:89:00.0"
Another option was to blacklist the GPU in GRUB using -- sudo vim /etc/default/grub
and adding
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci_stub.ids=10de:1b30"
Followed by sudo update-grub && sudo reboot
. But this seemed too complicated (and dangerous), and it would have probably disabled all the devices 🧨.
But there was a much easier and faster way (without rebooting the machine). Thanks to ChatGPT, we disabled the devices from being loaded by the NVIDIA driver.
cd /sys/bus/pci/devices/0000:04:00.0/
readlink driver
echo -n "0000:04:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/unbind
The 0000:04:00.0
is the PCI device ID, which can be found using lspci -k
or sudo dmesg
And that's it, everything was working ✅!
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
Our workaround did the trick, and though we'll need to physically address that loose power connector, the server's operational once again. This experience highlights how, sometimes, the solution doesn't require massive changes but just the correct command at the right time.
If you're navigating your own GPU issues, remember, often the answer lies in a simple, direct approach—aided, perhaps, by a nudge in the right direction from AI or a timely piece of advice.
I appreciate you reading through this journey. May your computing be seamless and your hardware cooperative 🖖.