RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1) #28293

ryh95 · 2019-10-18T17:06:08Z

Hi,

when I run torch.svd() on a matrix with GPU, it raises the error
RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1)

However, the torch.svd() has no problem with the matrix on CPU

Can someone help me figure out the reason?
Thank you!

The matrix is attached
runtime_error_W.zip

cc @vishwakftw @ssnl @jianyuh

The text was updated successfully, but these errors were encountered:

vishwakftw · 2019-10-18T18:52:06Z

Hi @ryh95, thank you for opening the issue.

This seems to be an issue with MAGMA, possibly relating to the correctness of gesdd vs gesvd. Note: the matrix is well-conditioned, but has too many repeated singular values.

ryh95 · 2019-10-19T01:47:46Z

It seems that gesdd is used for both CPU and GPU versions of torch.svd(), according to the documentation.

The implementation of SVD on CPU uses the LAPACK routine ?gesdd (a divide-and-conquer algorithm) instead of ?gesvd for speed. Analogously, the SVD on GPU uses the MAGMA routine gesdd as well.

So, why this relate to the correctness of gesdd vs gesvd?

Hippogriff · 2019-11-25T20:56:53Z

Is there a temporary solution/work-around to this problem?

ryh95 · 2019-11-26T06:23:05Z

I just move the tensor to CPU then use torch.svd to get a decomposition. I'll move the results back to GPU if needed.

Someone has mentioned that gesvd is more accurate/robust(#25978 (comment)), so you can move the tensor to CPU and transfer to numpy array and use scipy if you prefer a better result.

Hippogriff · 2019-11-26T13:37:35Z

@ryh95 I also want to back propagate the gradients through this computation. Converting it to cpu might break the computation graph (though I am not sure about this).

Nav94 · 2020-03-25T21:58:44Z

@Hippogriff Did you manage to work around the issue? I am also running into the same issue and was thinking of computing on the cpu and moving it back to the gpu.

Hippogriff · 2020-03-27T16:18:29Z

@Nav94 This can happens either matrix is severely ill conditioned or because the singular values are very close or equal to each other.
There are following solution to the problem:

For ill conditioned case, you can compute the condition number of the matrix on cpu and if the condition number is very large, then you cannot do much. In this case, you can simply trivialize the solution.
If the singular values are close to each other, you need to safe guard your back prop, that is, you need to write a new back ward pass. You can use the custom_svd function that replaces the torch's svd function.

`

def compute_grad_V(U, S, V, grad_V):
    N = S.shape[0]
    K = svd_grad_K(S)
    S = torch.eye(N).cuda(S.get_device()) * S.reshape((N, 1))
    inner = K.T * (V.T @ grad_V)
    inner = (inner + inner.T) / 2.0
    return 2 * U @ S @ inner @ V.T


def svd_grad_K(S):
    N = S.shape[0]
    s1 = S.view((1, N))
    s2 = S.view((N, 1))
    diff = s2 - s1
    plus = s2 + s1

    # TODO Look into it
    eps = torch.ones((N, N)) * 10**(-6)
    eps = eps.cuda(S.get_device())
    max_diff = torch.max(torch.abs(diff), eps)
    sign_diff = torch.sign(diff)

    K_neg = sign_diff * max_diff

    # gaurd the matrix inversion
    K_neg[torch.arange(N), torch.arange(N)] = 10 ** (-6)
    K_neg = 1 / K_neg
    K_pos = 1 / plus

    ones = torch.ones((N, N)).cuda(S.get_device())
    rm_diag = ones - torch.eye(N).cuda(S.get_device())
    K = K_neg * K_pos * rm_diag
    return K


class CustomSVD(Function):
    """
    Costum SVD to deal with the situations when the
    singular values are equal. In this case, if dealt
    normally the gradient w.r.t to the input goes to inf.
    To deal with this situation, we replace the entries of
    a K matrix from eq: 13 in https://arxiv.org/pdf/1509.07838.pdf
    to high value.
    Note: only applicable for the tall and square matrix and doesn't
    give correct gradients for fat matrix. Maybe transpose of the
    original matrix is requires to deal with this situation. Left for
    future work.
    """
    @staticmethod
    def forward(ctx, input):
        # Note: input is matrix of size m x n with m >= n.
        # Note: if above assumption is voilated, the gradients
        # will be wrong.
        try:
            U, S, V = torch.svd(input, some=True)
        except:
            import ipdb; ipdb.set_trace()

        ctx.save_for_backward(U, S, V)
        return U, S, V

    @staticmethod
    def backward(ctx, grad_U, grad_S, grad_V):
        U, S, V = ctx.saved_tensors
        grad_input = compute_grad_V(U, S, V, grad_V)
        return grad_input

customsvd = CustomSVD.apply

`

ryh95 · 2020-03-28T03:01:18Z

Thanks for your solution! @Hippogriff
By the way, is it true that convert the tensor to CPU will break the backpropagate?

Hippogriff · 2020-03-28T03:13:51Z

Yes, as far as I know.

Abdelpakey · 2020-04-06T05:55:34Z

@Hippogriff
Thanks for the solution, is there any way to track your code in case you update it for the #TODO part?

Hippogriff · 2020-04-07T02:04:34Z

@Abdelpakey I am not keeping track of this yet. There are severl todos, for example, this code doesn't compute the gradient with respect to right and left eigen vectors. I will update this repo when I am done:
https://github.com/Hippogriff/SVD-Pytorch

chenhao1umbc · 2020-04-14T15:43:08Z

I am using this, since it is not solved

    try:
        u, s, v = torch.svd(L)
    except:                     # torch.svd may have convergence issues for GPU and CPU.
        u, s, v = torch.svd(L + 1e-4*L.mean()*torch.rand(l, h))

SebastianGrans · 2020-04-17T07:57:28Z

@chenhao1umbc Thanks for that snippet!

Is there a particular reason why you multiply with L.mean()?

chenhao1umbc · 2020-04-17T11:39:24Z

Yes, the main idea is that convergence issue can be solved by adding some turbulence. But the scale of L is unknown, which means L could be at 1e4 scale or 1e-4 scale. This means we cannot simple adding a "small" random number, but a relatively small number.

davidbau · 2020-11-11T11:34:28Z

This affects torch.pinverse also, due to its underlying svd.

andreaskoepf · 2021-05-29T12:15:27Z

For pytorch 1.8.1+cu101 the output of:

x = torch.randn(64, 64).cuda()
x[0,0] = float('nan')
torch.svd(x)

is:

RuntimeError: svd_cuda: For batch 0: U(65,65) is zero, singular U.

The error message is unexpected (and misleading) and comes from a special batchCheckErrors() info-tensor function overload that does not check for the "svd" in the name string, see

pytorch/aten/src/ATen/native/LinearAlgebraUtils.h

Lines 242 to 254 in ac67cda

    
           static inline void batchCheckErrors(const Tensor& infos, const char* name, bool allow_singular=false, int info_per_batch=1) { 
        
             auto batch_size = infos.numel(); 
        
             auto infos_cpu = infos.to(at::kCPU); 
        
             auto infos_data = infos_cpu.data_ptr<int>(); 
        
             for (int64_t i = 0; i < batch_size; i++) { 
        
               auto info = infos_data[i]; 
        
               if (info < 0) { 
        
                 AT_ERROR(name, ": For batch ", i/info_per_batch, ": Argument ", -info, " has illegal value"); 
        
               } else if (!allow_singular && info > 0) { 
        
                 AT_ERROR(name, ": For batch ", i/info_per_batch, ": U(", info, ",", info, ") is zero, singular U."); 
        
               } 
        
             } 
        
           }

If you look above and below e.g. batchCheckErrors(std::vector<int64_t>& infos, ...) and void singleCheckErrors() both have the case:

if (strstr(name, "svd")) {
     AT_ERROR(name, ": the updating process of SBDSDC did not converge (error: ", info, ")");
}

Earlier version of pytorch raised for the same code "RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 23)".

I guess for cuda tensors torch.svd() calls the tensor-info batchCheckErrors() overload from _svd_helper_cuda_lib():

pytorch/aten/src/ATen/native/cuda/BatchLinearAlgebraLib.cu

Lines 456 to 490 in 3f0b081

    
           std::tuple<Tensor, Tensor, Tensor> _svd_helper_cuda_lib(const Tensor& self, bool some, bool compute_uv) { 
        
             const int64_t batch_size = batchCount(self); 
        
             at::Tensor infos = at::zeros({batch_size}, self.options().dtype(at::kInt)); 
        
             const int64_t m = self.size(-2); 
        
             const int64_t n = self.size(-1); 
        
             const int64_t k = std::min(m, n); 
        
             Tensor U_working_copy, S_working_copy, VT_working_copy; 
        
             std::tie(U_working_copy, S_working_copy, VT_working_copy) = \ 
        
               _create_U_S_VT(self, some, compute_uv, /* svd_use_cusolver = */ true); 
        
             // U, S, V working copies are already column majored now 
        
             // heuristic for using `gesvdjBatched` over `gesvdj` 
        
             if (m <= 32 && n <= 32 && batch_size > 1 && (!some || m == n)) { 
        
               apply_svd_lib_gesvdjBatched(self, U_working_copy, S_working_copy, VT_working_copy, infos, compute_uv); 
        
             } else { 
        
               apply_svd_lib_gesvdj(self, U_working_copy, S_working_copy, VT_working_copy, infos, compute_uv, some); 
        
             } 
        
             // A device-host sync will be performed. 
        
             batchCheckErrors(infos, "svd_cuda"); 
        
             if (!compute_uv) { 
        
               VT_working_copy.zero_(); 
        
               U_working_copy.zero_(); 
        
             } 
        
             if (some) { 
        
               VT_working_copy = VT_working_copy.narrow(-2, 0, k); 
        
             } 
        
             // so far we have computed VT, but torch.svd returns V instead. Adjust accordingly. 
        
             VT_working_copy.transpose_(-2, -1); 
        
             return std::make_tuple(U_working_copy, S_working_copy, VT_working_copy); 
        
           }

btw if somebody wonders why the output is U(65,65) is zero, singular U. for a svd of 64x64 .. it is simply a misinterpretation of the info-feedback of cusolverDnSgesvdj()` (see cuda doc) ... cuda docs say "if info = 0, the operation is successful. if info = -i, the i-th parameter is wrong (not counting handle). if info = min(m,n)+1, gesvdj dose not converge under given tolerance and maximum sweeps. "

IvanYashchuk · 2021-08-13T06:47:41Z

Thank you, @andreaskoepf, for reporting the issue of incorrect error messages! We will have it fixed in the future PyTorch release. Unfortunately, the bugfixes would not be backported to older versions.

…ance (#64533) Summary: Fix #64237 Fix #28293 Fix #4689 See also #47953 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano Pull Request resolved: #64533 Reviewed By: albanD Differential Revision: D31915794 Pulled By: ngimel fbshipit-source-id: 29ea48696531ced8a48474e891a9e2d5f11e9d7a

lhyfst · 2021-12-11T08:39:45Z

I am using this, since it is not solved

    try:
        u, s, v = torch.svd(L)
    except:                     # torch.svd may have convergence issues for GPU and CPU.
        u, s, v = torch.svd(L + 1e-4*L.mean()*torch.rand(l, h))

What is h in torch.rand(l, h)?

andreaskoepf · 2021-12-11T09:26:45Z

What is h in torch.rand(l, h)?

To add some noise you could use torch.rand_like(L).

For support & coding questions please refer to the PyTorch Forums & see the docs, e.g. torch.rand.

lhyfst · 2021-12-11T09:29:34Z

What is h in torch.rand(l, h)?

To add some noise you could use torch.rand_like(L).

For support & coding questions please refer to the PyTorch Forums & see the docs, e.g. torch.rand.

Thank you

a-r-j · 2022-12-09T22:34:11Z

I had some luck with:

from tenacity import retry, stop_after_attempt


@retry(stop=stop_after_attempt(32)) # Or some other value
def func_with_svd(L: torch.Tensor):
    try:
        u, s, v = torch.svd(L)
    except:                     # torch.svd may have convergence issues for GPU and CPU.
        u, s, v = torch.svd(L + 1e-4*L.mean()*torch.rand_like(L))
    ...

vishwakftw added module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul module: numerical-stability Problems related to numerical stability of operations labels Oct 18, 2019

pietern added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 22, 2019

mruberry mentioned this issue Oct 5, 2020

torch.linalg in PyTorch 1.10 tracker #42666

Closed

4 tasks

MMathisLab mentioned this issue Mar 22, 2021

RuntimeError: svd_cuda: For batch 0: U(129,129) is zero, singular U. LINCellularNeuroscience/VAME#37

Closed

IvanYashchuk mentioned this issue Aug 13, 2021

SVD error message is not correct #63220

Closed

xwang233 mentioned this issue Aug 31, 2021

Fix the non-converging issue of SVD on GPU for large matrices #64237

Closed

lezcano mentioned this issue Sep 6, 2021

[CUDA][Linalg] Add gesvd as SVD fallback; optimize SVD gesvdj performance #64533

Closed

ducha-aiki mentioned this issue Oct 24, 2021

fix catch type for torch.svd error kornia/kornia#1431

Merged

12 tasks

facebook-github-bot closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1) #28293

RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1) #28293

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1) #28293

RuntimeError: svd_cuda: the updating process of SBDSDC did not converge (error: 1) #28293

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!