-
Notifications
You must be signed in to change notification settings - Fork 24.3k
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1) #28293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @ryh95, thank you for opening the issue. This seems to be an issue with MAGMA, possibly relating to the correctness of |
It seems that
So, why this relate to the correctness of |
Is there a temporary solution/work-around to this problem? |
I just move the tensor to CPU then use Someone has mentioned that |
@ryh95 I also want to back propagate the gradients through this computation. Converting it to cpu might break the computation graph (though I am not sure about this). |
@Hippogriff Did you manage to work around the issue? I am also running into the same issue and was thinking of computing on the cpu and moving it back to the gpu. |
@Nav94 This can happens either matrix is severely ill conditioned or because the singular values are very close or equal to each other.
`
` |
Thanks for your solution! @Hippogriff |
Yes, as far as I know. |
@Hippogriff |
@Abdelpakey I am not keeping track of this yet. There are severl todos, for example, this code doesn't compute the gradient with respect to right and left eigen vectors. I will update this repo when I am done: |
I am using this, since it is not solved
|
@chenhao1umbc Thanks for that snippet! Is there a particular reason why you multiply with |
Yes, the main idea is that convergence issue can be solved by adding some turbulence. But the scale of L is unknown, which means L could be at 1e4 scale or 1e-4 scale. This means we cannot simple adding a "small" random number, but a relatively small number. |
This affects torch.pinverse also, due to its underlying svd. |
For pytorch 1.8.1+cu101 the output of:
is:
The error message is unexpected (and misleading) and comes from a special pytorch/aten/src/ATen/native/LinearAlgebraUtils.h Lines 242 to 254 in ac67cda
If you look above and below e.g. batchCheckErrors(std::vector<int64_t>& infos, ...) and void singleCheckErrors() both have the case:
Earlier version of pytorch raised for the same code "RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 23)". I guess for cuda tensors pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu Lines 456 to 490 in 3f0b081
btw if somebody wonders why the output is |
Thank you, @andreaskoepf, for reporting the issue of incorrect error messages! We will have it fixed in the future PyTorch release. Unfortunately, the bugfixes would not be backported to older versions. |
…ance (#64533) Summary: Fix #64237 Fix #28293 Fix #4689 See also #47953 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano Pull Request resolved: #64533 Reviewed By: albanD Differential Revision: D31915794 Pulled By: ngimel fbshipit-source-id: 29ea48696531ced8a48474e891a9e2d5f11e9d7a
What is h in |
To add some noise you could use For support & coding questions please refer to the PyTorch Forums & see the docs, e.g. torch.rand. |
Thank you |
I had some luck with: from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(32)) # Or some other value
def func_with_svd(L: torch.Tensor):
try:
u, s, v = torch.svd(L)
except: # torch.svd may have convergence issues for GPU and CPU.
u, s, v = torch.svd(L + 1e-4*L.mean()*torch.rand_like(L))
... |
Uh oh!
There was an error while loading. Please reload this page.
Hi,
when I run torch.svd() on a matrix with GPU, it raises the error
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1)
However, the torch.svd() has no problem with the matrix on CPU
Can someone help me figure out the reason?
Thank you!
The matrix is attached
runtime_error_W.zip
cc @vishwakftw @ssnl @jianyuh
The text was updated successfully, but these errors were encountered: