linalg matrix rank generates exception when calculating rank of a singular matrix #62166

rustrust · 2021-07-25T22:33:34Z

🐛 Bug

I'm using libtorch from tch-rs to interact with pytorch because writing performant and correct code in python is too difficult.

The problem I am encountering is that when I try to calculate the rank of a singular matrix, I never get a valid numeric result but always trigger a C++ exception. Given that the purpose of calculating the matrix rank is to provide a numeric value indicating the rank (rather than an exception), I don't think this makes sense as normal api behavior.

The api author of tch-rs reports that he is simply providing a thin wrapper over the libtorch C++ api and so the root of this issue is actually an exception in pytorch being triggered by the wrapper function.

LaurentMazare/tch-rs#393

Exception occurs here:

pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu

Line 508 in 1ce3281

batchCheckErrors(infos, "svd_cuda");

To Reproduce

Steps to reproduce the behavior:

Use tch-rs to generate a singular Tensor (for instance, generate an identity matrix and then zero out a row).
Calculate the matrix rank. For example, all of the below will generate this exception:

my_tensor.reshape(&[1, 1, 1024, 1024]).linalg_matrix_rank(None, false);
my_tensor.reshape(&[1, 1, 1024, 1024]).f_linalg_matrix_rank(None, false);
my_tensor.reshape(&[1, 1, 1024, 1024]).matrix_rank(false);
my_tensor.reshape(&[1, 1, 1024, 1024]).f_matrix_rank( false);

Example error message:

Here is a snippet of the error message:

svd_cuda: For batch 0: U(1025,1025) is zero, singular U.\nException raised from batchCheckErrors at /pytorch/aten/src/ATen/native/LinearAlgebraUtils.h:251 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x69 (0x7f6259f4f1d9 in ...

Environment

My libtorch version is 1.9.0+cu111 and I am using rustc 1.53.0 . Note that I am performing many other CUDA operations with libtorch and tch-rs just fine. I am pretty sure this is (so far) an intentionally generated exception.

Expected behavior

I expect that this function should return a numeric value for a matrix not of full rank rather than generate an exception.

PyTorch Version (e.g., 1.0): libtorch 1.9.0+cu111
OS (e.g., Linux): Ubuntu Linux 20.04 LTS
How you installed PyTorch (conda, pip, source): libtorch download and unzip
CUDA/cuDNN version: cuda_11.4.r11.4/compiler.30033411_0
GPU models and configuration: NVIDIA GPU

cc @yf225 @glaringlee @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano

The text was updated successfully, but these errors were encountered:

mruberry · 2021-07-27T09:23:59Z

Thanks for the suggestion @rustrust. Our linear algebra team will take a look

lezcano · 2021-07-29T12:39:12Z

I was not able to reproduce from the Python API. The following code works as expected:

import torch
a = torch.eye(1024, 1024, device=torch.device("cuda")).reshape(1, 1, 1024, 1024)
a[..., -1, -1] = 0
print(torch.linalg.matrix_rank(a))

Could you provide a small example on pure PyTorch, python or c++, that reproduces this issue?

I see that the error access the element 1025, which, a priori, should be out of bounds.

rustrust · 2021-08-05T23:47:57Z

Thanks for catching the out-of-bounds access, I didn't notice that. I managed to create a fairly minimal repro based on the tch-rs mnist_conv.rs example. I have not ported the repro to C++ or python so at the moment I am unsure of where the bug lives. You can read more about this, including the repro, in the tch-rs bug report.

mruberry · 2021-08-12T05:51:16Z

If you could provide a self-contained valid snippet that would be very helpful, @rustrust

nikitaved · 2021-08-12T10:07:15Z

A follow-up on the script by @lezcano, I cannot reproduce it either:

In [1]: import torch

In [3]: a = torch.rand(1024, 1024).reshape(1, 1, 1024, 1024).cuda()

In [4]: a[..., -1, :] = 0

In [5]: a[..., :, -1] = 0

In [6]: a[0, :, -1]
Out[6]: tensor([[0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')

In [7]: torch.linalg.matrix_rank(a)
Out[7]: tensor([[1023]], device='cuda:0')

nikitaved · 2021-08-12T10:20:34Z

@mruberry , @lezcano , but matrix_rank seems to be very weird, have a look at this:

In [25]: a = torch.rand(1, 1, 1024, 1024).cuda()

In [26]: a[..., -1, :] = 0

In [27]: a[..., :, -1] = 0

In [28]: torch.linalg.matrix_rank(a)
Out[28]: tensor([[1019]], device='cuda:0')

In [29]: a.svd()[1].abs().topk(k=5, largest=False)
Out[29]: 
torch.return_types.topk(
values=tensor([[[0.0000, 0.0011, 0.0157, 0.0202, 0.0401]]], device='cuda:0'),
indices=tensor([[[1023, 1022, 1021, 1020, 1019]]], device='cuda:0'))

No way it is rank is not 1023, which the specter proofs.
Shall I fail an issue?

lezcano · 2021-08-12T10:29:23Z

It turns out that this is related to #54151 as

torch.linalg.svdvals(a)[..., 0] * 1024 * torch.finfo(torch.float32).eps

for a matrix like the one you sampled returns 0.0625, which is incredibly high.

I'd say you post these findings there, and they'll be fixed whenever @IvanYashchuk solves that issue.

Thank you for finding this, this points to the fact that the defaults that we currently have are certainly no good.

rustrust · 2021-08-18T01:06:54Z

I am having problems with svd in addition to matrix rank. svd is also giving trouble here in that it is trying to hand me back a U which is 1025 by 1025 when I hand it a matrix of dimension 1024 by 1024. At the moment it's unclear to me whether this is an issue with pytorch or with tch-rs.

In tch-rs this line gives the problem:

 let (u, s, v) = my_tensor.reshape(&[1, 1, 1024, 1024]).svd(true, true);

Please see further discussion here:

LaurentMazare/tch-rs#393 (comment)

It should be extremely trivial to get tch-rs working well enough to reproduce from the above comment. All you need to do is make sure you have libtorch and have the path variables configured, cargo will handle all of the other dependencies. Simply copy the tch-rs/examples/mnist/mnist_conv.rs example to tch-rs/examples/mnist/mnist_conv_svdbug.rs , apply the three simple edits in the above github post, add an entry to the main file in the tch-rs/mnist/example/main.rs folder and cargo run --example mnist conv_svdbug . The entry in main.rs should go below line 26 and read like this:

        Some("conv_svdbug") => mnist_conv_svdbug::run(),

IvanYashchuk · 2021-08-18T06:38:14Z

I'm sorry for the confusion that you see the following error.

svd_cuda: For batch 0: U(1025,1025) is zero, singular U.

The error message is just wrong. The "U" in the error message has nothing to do with the "U-firtst_returned_tensor_of_svd". It should say that the algorithm didn't converge. It will be fixed when #63220 is resolved.

PyTorch uses a faster divide and conquer based algorithm for SVD (gesdd in LAPACK) or Jacobi iterations based (gesvdj in cuSOLVER), but they're both less robust than the one based on QR iterations (gesvd in LAPACK). So it's expected that SVD sometimes fails to compute the decomposition. In the future, we might add a fallback to more robust algorithm when the faster one fails.

Could you please test one thing: if you pass a diagonal matrix (torch.eye) to SVD in tch-rs, does it work as expected?

rustrust · 2021-08-19T17:51:13Z

I added the below lines which allocate an identity matrix of 1024 by 1024 above the match on svd statement for what I describe as mnist_conv_svdbug.rs and was not able to ever observe a failure of svd across several trials. However, during those trials, the fc1 svd broke.

           let my_test_tensor = Tensor::eye(1024_i64, (Kind::Float, vs.device()));

           match my_test_tensor.reshape(&[1, 1, 1024, 1024]).f_svd(true, true)
            {
                Ok(a) => { println!("my_test_tensor svd ok"); },
                Err(e) => {
                    println!("################ my_test_tensor fail #######################################");
                    println!("{:?}", e);
                    panic!("my_test_tensor panic");
                }
            }

lezcano · 2022-02-08T18:34:45Z

This should be fixed in master, as we now fallback to a different algorithm once we have convergence issues in the main algorithm.

Please feel free to comment and I'll reopen it if this is not the case.

mruberry added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Aug 12, 2021

lezcano closed this as completed Feb 8, 2022

LaurentMazare mentioned this issue May 14, 2023

SVD intermittently reports incorrect tensor size LaurentMazare/tch-rs#393

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

linalg matrix rank generates exception when calculating rank of a singular matrix #62166

linalg matrix rank generates exception when calculating rank of a singular matrix #62166

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

linalg matrix rank generates exception when calculating rank of a singular matrix #62166

linalg matrix rank generates exception when calculating rank of a singular matrix #62166

Comments

Uh oh!

🐛 Bug

To Reproduce

Environment

Expected behavior

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!