I was running a slightly elaborate thing on a cloud instance, and it looked like the GPU disappeared between commands.
While training a model for a few hours, everything ran smoothly, then when it saved it and ran convert, I got the "no gpu detected" error:
Code: Select all
An unhandled exception occured loading pynvml. Original error: RM has detected an NVML/RM version mismatch.
No GPU detected. Switching to CPU mode
This is on a non-preemptible instance, so I assume the GPU isn't getting pulled out from under me.
I've seen it happen before where I get this error as soon as I run my first command, and I assumed it was a GCE bug where they didn't properly attach the GPU at setup... But I've lately been seeing it happen between commands (e.g. successful train, then convert goes to CPU).
Like, I've never seen it crash (or slow down significantly) midway through a round of training. This leads me to believe it's a software problem, though not sure what software. Like, is something (e.g. system update) deleting/modifying nvidia drivers, resulting in a driver version mismatch issue? Maybe that's the problem and I just need to refresh my image and/or configure it to disable all system updates.
Has anyone else encountered this?