8000 linalg matrix rank generates exception when calculating rank of a singular matrix · Issue #62166 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

linalg matrix rank generates exception when calculating rank of a singular matrix #62166

8000
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rustrust opened this issue Jul 25, 2021 · 11 comments
Closed
Labels
feature A request for a proper, new feature. module: cpp Related to C++ API module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@rustrust
Copy link
rustrust commented Jul 25, 2021

🐛 Bug

I'm using libtorch from tch-rs to interact with pytorch because writing performant and correct code in python is too difficult.

The problem I am encountering is that when I try to calculate the rank of a singular matrix, I never get a valid numeric result but always trigger a C++ exception. Given that the purpose of calculating the matrix rank is to provide a numeric value indicating the rank (rather than an exception), I don't think this makes sense as normal api behavior.

The api author of tch-rs reports that he is simply providing a thin wrapper over the libtorch C++ api and so the root of this issue is actually an exception in pytorch being triggered by the wrapper function.

LaurentMazare/tch-rs#393

Exception occurs here:

batchCheckErrors(infos, "svd_cuda");

To Reproduce

Steps to reproduce the behavior:

  1. Use tch-rs to generate a singular Tensor (for instance, generate an identity matrix and then zero out a row).
  2. Calculate the matrix rank. For example, all of the below will generate this exception:
my_tensor.reshape(&[1, 1, 1024, 1024]).linalg_matrix_rank(None, false);
my_tensor.reshape(&[1, 1, 1024, 1024]).f_linalg_matrix_rank(None, false);
my_tensor.reshape(&[1, 1, 1024, 1024]).matrix_rank(false);
my_tensor.reshape(&[1, 1, 1024, 1024]).f_matrix_rank( false);

Example error message:

Here is a snippet of the error message:

svd_cuda: For batch 0: U(1025,1025) is zero, singular U.\nException raised from batchCheckErrors at /pytorch/aten/src/ATen/native/LinearAlgebraUtils.h:251 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f6259f4f1d9 in ...

Environment

My libtorch version is 1.9.0+cu111 and I am using rustc 1.53.0 . Note that I am performing many other CUDA operations with libtorch and tch-rs just fine. I am pretty sure this is (so far) an intentionally generated exception.

Expected behavior

I expect that this function should return a numeric value for a matrix not of full rank rather than generate an exception.

  • PyTorch Version (e.g., 1.0): libtorch 1.9.0+cu111
  • OS (e.g., Linux): Ubuntu Linux 20.04 LTS
  • How you installed PyTorch (conda, pip, source): libtorch download and unzip
  • CUDA/cuDNN version: cuda_11.4.r11.4/compiler.30033411_0
  • GPU models and configuration: NVIDIA GPU

cc @yf225 @glaringlee @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano

@mruberry mruberry added module: cpp Related to C++ API module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul feature A request for a proper, new feature. triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 27, 2021
@mruberry
Copy link
Collaborator

Thanks for the suggestion @rustrust. Our linear algebra team will take a look

@lezcano
Copy link
Collaborator
lezcano commented Jul 29, 2021

I was not able to reproduce from the Python API. The following code works as expected:

import torch
a = torch.eye(1024, 1024, device=torch.device("cuda")).reshape(1, 1, 1024, 1024)
a[..., -1, -1] = 0
print(torch.linalg.matrix_rank(a))

Could you provide a small example on pure PyTorch, python or c++, that reproduces this issue?

I see that the error access the element 1025, which, a priori, should be out of bounds.

@rustrust
Copy link
Author
rustrust commented Aug 5, 2021

Thanks for catching the out-of-bounds access, I didn't notice that. I managed to create a fairly minimal repro based on the tch-rs mnist_conv.rs example. I have not ported the repro to C++ or python so at the moment I am unsure of where the bug lives. You can read more about this, including the repro, in the tch-rs bug report.

@mruberry mruberry added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Aug 12, 2021
@mruberry
Copy link
Collaborator

If you could provide a self-contained valid snippet that would be very helpful, @rustrust

@nikitaved
Copy link
Collaborator
nikitaved commented Aug 12, 2021

A follow-up on the script by @lezcano, I cannot reproduce it either:

In [1]: import torch

In [3]: a = torch.rand(1024, 1024).reshape(1, 1, 1024, 1024).cuda()

In [4]: a[..., -1, :] = 0

In [5]: a[..., :, -1] = 0

In [6]: a[0, :, -1]
Out[6]: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

In [7]: torch.linalg.matrix_rank(a)
Out[7]: tensor([[1023]], device='cuda:0')

@nikitaved
Copy link
Collaborator
nikitaved commented Aug 12, 2021

@mruberry , @lezcano , but matrix_rank seems to be very weird, have a look at this:

In [25]: a = torch.rand(1, 1, 1024, 1024).cuda()

In [26]: a[..., -1, :] = 0

In [27]: a[..., :, -1] = 0

In [28]: torch.linalg.matrix_rank(a)
Out[28]: tensor([[1019]], device='cuda:0')

In [29]: a.svd()[1].abs().topk(k=5, largest=False)
Out[29]: 
torch.return_types.topk(
values=tensor([[[0.0000, 0.0011, 0.0157, 0.0202, 0.0401]]], device='cuda:0'),
indices=tensor([[[1023, 1022, 1021, 1020, 1019]]], device='cuda:0'))

No way it is rank is not 1023, which the specter proofs.
Shall I fail an issue?

@lezcano
Copy link
Collaborator
lezcano commented Aug 12, 2021

It turns out that this is related to #54151 as

torch.linalg.svdvals(a)[..., 0] * 1024 * torch.finfo(torch.float32).eps

for a matrix like the one you sampled returns 0.0625, which is incredibly high.

I'd say you post these findings there, and they'll be fixed whenever @IvanYashchuk solves that issue.

Thank you for finding this, this points to the fact that the defaults that we currently have are certainly no good.

@rustrust
Copy link
Author

I am having problems with svd in addition to matrix rank. svd is also giving trouble here in that it is trying to hand me back a U which is 1025 by 1025 when I hand it a matrix of dimension 1024 by 1024. At the moment it's unclear to me whether this is an issue with pytorch or with tch-rs.

In tch-rs this line gives the problem:

 let (u, s, v) = my_tensor.reshape(&[1, 1, 1024, 1024]).svd(true, true);

Please see further discussion here:

LaurentMazare/tch-rs#393 (comment)

It should be extremely trivial to get tch-rs working well enough to reproduce from the above comment. All you need to do is make sure you have libtorch and have the path variables configured, cargo will handle all of the other dependencies. Simply copy the tch-rs/examples/mnist/mnist_conv.rs example to tch-rs/examples/mnist/mnist_conv_svdbug.rs , apply the three simple edits in the above github post, add an entry to the main file in the tch-rs/mnist/example/main.rs folder and cargo run --example mnist conv_svdbug . The entry in main.rs should go below line 26 and read like this:

        Some("conv_svdbug") => mnist_conv_svdbug::run(),

@IvanYashchuk
Copy link
Collaborator

I'm sorry for the confusion that you see the following error.

svd_cuda: For batch 0: U(1025,1025) is zero, singular U.

The error message is just wrong. The "U" in the error message has nothing to do with the "U-firtst_returned_tensor_of_svd". It should say that the algorithm didn't converge. It will be fixed when #63220 is resolved.

PyTorch uses a faster divide and conquer based algorithm for SVD (gesdd in LAPACK) or Jacobi iterations based (gesvdj in cuSOLVER), but they're both less robust than the one based on QR iterations (gesvd in LAPACK). So it's expected that SVD sometimes fails to compute the decomposition. In the future, we might add a fallback to more robust algorithm when the faster one fails.

Could you please test one thing: if you pass a diagonal matrix (torch.eye) to SVD in tch-rs, does it work as expected?

@rustrust
Copy link
Author
rustrust commented Aug 19, 2021

I added the below lines which allocate an identity matrix of 1024 by 1024 above the match on svd statement for what I describe as mnist_conv_svdbug.rs and was not able to ever observe a failure of svd across several trials. However, during those trials, the fc1 svd broke.

           let my_test_tensor = Tensor::eye(1024_i64, (Kind::Float, vs.device()));

           match my_test_tensor.reshape(&[1, 1, 1024, 1024]).f_svd(true, true)
            {
                Ok(a) => { println!("my_test_tensor svd ok"); },
                Err(e) => {
                    println!("################ my_test_tensor fail #######################################");
                    println!("{:?}", e);
                    panic!("my_test_tensor panic");
                }
            }

@lezcano
Copy link
Collaborator
lezcano commented Feb 8, 2022

This should be fixed in master, as we now fallback to a different algorithm once we have convergence issues in the main algorithm.

Please feel free to comment and I'll reopen it if this is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: cpp Related to C++ API module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

5 participants
0