-
Notifications
You must be signed in to change notification settings - Fork 204
Description
Discussed in #1190
Originally posted by Biotot December 19, 2023
I've been banging my head against this for a couple days and I'm still coming up empty.
I have multiple modules and multiple GPUs, however this sequence continues to fail. I've narrowed it down to being a problem with the model. I can load it fresh from a file each loop and the error no longer exists.
(Pseudocode)
ModuleA.to(cuda:0)
TrainLoop
ModuleA.to(cpu)
ModuleA.to(cuda:1)
TrainLoop
Exception: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1
The error is consistently in the loss output.backward() call if that helps.
This error doesn't happen if I load the module from a file each loop. Input data is not the issue, the model isn't correctly switching devices. I've tried many different combinations of code and have tried directly moving from cuda:0 to cuda:1 without luck.
I'm not sure what is going wrong, I've been porting over my code from pytorch and I've been trying to get over this hurdle. Any help would be appreciated.
Running on TorchSharp-cuda-windows 0.101.4