-
Notifications
You must be signed in to change notification settings - Fork 24.3k
linalg matrix rank generates exception when calculating rank of a singular matrix #62166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for the suggestion @rustrust. Our linear algebra team will take a look |
I was not able to reproduce from the Python API. The following code works as expected: import torch
a = torch.eye(1024, 1024, device=torch.device("cuda")).reshape(1, 1, 1024, 1024)
a[..., -1, -1] = 0
print(torch.linalg.matrix_rank(a)) Could you provide a small example on pure PyTorch, python or c++, that reproduces this issue? I see that the error access the element |
Thanks for catching the out-of-bounds access, I didn't notice that. I managed to create a fairly minimal repro based on the tch-rs mnist_conv.rs example. I have not ported the repro to C++ or python so at the moment I am unsure of where the bug lives. You can read more about this, including the repro, in the tch-rs bug report. |
If you could provide a self-contained valid snippet that would be very helpful, @rustrust |
A follow-up on the script by @lezcano, I cannot reproduce it either: In [1]: import torch
In [3]: a = torch.rand(1024, 1024).reshape(1, 1, 1024, 1024).cuda()
In [4]: a[..., -1, :] = 0
In [5]: a[..., :, -1] = 0
In [6]: a[0, :, -1]
Out[6]: tensor([[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')
In [7]: torch.linalg.matrix_rank(a)
Out[7]: tensor([[1023]], device='cuda:0') |
@mruberry , @lezcano , but In [25]: a = torch.rand(1, 1, 1024, 1024).cuda()
In [26]: a[..., -1, :] = 0
In [27]: a[..., :, -1] = 0
In [28]: torch.linalg.matrix_rank(a)
Out[28]: tensor([[1019]], device='cuda:0')
In [29]: a.svd()[1].abs().topk(k=5, largest=False)
Out[29]:
torch.return_types.topk(
values=tensor([[[0.0000, 0.0011, 0.0157, 0.0202, 0.0401]]], device='cuda:0'),
indices=tensor([[[1023, 1022, 1021, 1020, 1019]]], device='cuda:0')) No way it is rank is not 1023, which the specter proofs. |
It turns out that this is related to #54151 as torch.linalg.svdvals(a)[..., 0] * 1024 * torch.finfo(torch.float32).eps for a matrix like the one you sampled returns I'd say you post these findings there, and they'll be fixed whenever @IvanYashchuk solves that issue. Thank you for finding this, this points to the fact that the defaults that we currently have are certainly no good. |
I am having problems with svd in addition to matrix rank. svd is also giving trouble here in that it is trying to hand me back a U which is 1025 by 1025 when I hand it a matrix of dimension 1024 by 1024. At the moment it's unclear to me whether this is an issue with pytorch or with tch-rs. In tch-rs this line gives the problem:
Please see further discussion here: LaurentMazare/tch-rs#393 (comment) It should be extremely trivial to get tch-rs working well enough to reproduce from the above comment. All you need to do is make sure you have libtorch and have the path variables configured, cargo will handle all of the other dependencies. Simply copy the tch-rs/examples/mnist/mnist_conv.rs example to tch-rs/examples/mnist/mnist_conv_svdbug.rs , apply the three simple edits in the above github post, add an entry to the main file in the tch-rs/mnist/example/main.rs folder and cargo run --example mnist conv_svdbug . The entry in main.rs should go below line 26 and read like this:
|
I'm sorry for the confusion that you see the following error.
The error message is just wrong. The "U" in the error message has nothing to do with the "U-firtst_returned_tensor_of_svd". It should say that the algorithm didn't converge. It will be fixed when #63220 is resolved. PyTorch uses a faster divide and conquer based algorithm for SVD (gesdd in LAPACK) or Jacobi iterations based (gesvdj in cuSOLVER), but they're both less robust than the one based on QR iterations (gesvd in LAPACK). So it's expected that SVD sometimes fails to compute the decomposition. In the future, we might add a fallback to more robust algorithm when the faster one fails. Could you please test one thing: if you pass a diagonal matrix ( |
I added the below lines which allocate an identity matrix of 1024 by 1024 above the match on svd statement for what I describe as mnist_conv_svdbug.rs and was not able to ever observe a failure of svd across several trials. However, during those trials, the fc1 svd broke.
|
This should be fixed in master, as we now fallback to a different algorithm once we have convergence issues in the main algorithm. Please feel free to comment and I'll reopen it if this is not the case. |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
I'm using libtorch from tch-rs to interact with pytorch because writing performant and correct code in python is too difficult.
The problem I am encountering is that when I try to calculate the rank of a singular matrix, I never get a valid numeric result but always trigger a C++ exception. Given that the purpose of calculating the matrix rank is to provide a numeric value indicating the rank (rather than an exception), I don't think this makes sense as normal api behavior.
The api author of tch-rs reports that he is simply providing a thin wrapper over the libtorch C++ api and so the root of this issue is actually an exception in pytorch being triggered by the wrapper function.
LaurentMazare/tch-rs#393
Exception occurs here:
pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu
Line 508 in 1ce3281
To Reproduce
Steps to reproduce the behavior:
Example error message:
Here is a snippet of the error message:
Environment
My libtorch version is 1.9.0+cu111 and I am using rustc 1.53.0 . Note that I am performing many other CUDA operations with libtorch and tch-rs just fine. I am pretty sure this is (so far) an intentionally generated exception.
Expected behavior
I expect that this function should return a numeric value for a matrix not of full rank rather than generate an exception.
conda
,pip
, source): libtorch download and unzipcc @yf225 @glaringlee @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano
The text was updated successfully, but these errors were encountered: