8000 Fix the non-converging issue of SVD on GPU for large matrices · Issue #64237 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Fix the non-conv 8000 erging issue of SVD on GPU for large matrices #64237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xwang233 opened this issue Aug 31, 2021 · 5 comments
Closed

Fix the non-converging issue of SVD on GPU for large matrices #64237

xwang233 opened this issue Aug 31, 2021 · 5 comments
Assignees
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@xwang233
Copy link
Collaborator
xwang233 commented Aug 31, 2021

🚀 Feature

SVD on GPU for a large matrix (usually with size > 1024) or an ill-conditioned matrix may throw runtime error because of not converging well.

SVD on GPU currently uses the iterative Jacobi method of cusolverDn<T>gesvdj (note the trailing j). The pytorch implementation takes cusolver default values of max sweeps = 100 and tolerance = machine accuracy. The cusolver gesvdj engine stops when either the accuracy or max sweeps is achieved. From the cusolver doc, usually ~15 sweeps would be enough.

The gesvdj method works well and is much faster than the QR-based method of cusolverDn<T>gesvd for small size matrix (benchmark #48436 (comment)). However, for large size matrix, we've received several reports with runtime errors because SVD is not converged well, e.g. #28293 (comment)

To fix this issue, there are several approaches

  1. Expose the max sweeps and tolerance of cusolver gesvdj to user, so that people can set them to stricter values as needed.
    Problem: what should the user interface look like? Extra kwargs for torch.linalg.svd? Note that the CPU LAPACK method doesn't have this issue and won't use those parameters.

  2. Add gesvd method to pytorch.
    Problem: what should the user interface look like? The entrance of SVD is only torch.linalg.svd and how should user change the backend engine? Extra kwargs or extra python functions?

  3. Add gesvd method to pytorch, and use gesvd as a fallback when gesvdj doesn't converge.
    Problem: this would make SVD performance much worse in those cases.

Note that the QR-based gesvd method always converges. This issue doesn't affect the batched gesvdjBatched method which only takes matrices with size <= 32 and converges well in nearly all cases.

cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano @ptrblck

@xwang233 xwang233 added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Aug 31, 2021
@lezcano
Copy link
Collaborator
lezcano commented Aug 31, 2021

In my opinion, the third option is the best one.

I used to face this problem very rarely (say while training a network once every 10 epochs) and it would make the training fail, which was very annoying. As such, if we just launch gesvd as a fallback, the amortised cost will be close to zero, and all this would be completely transparent to the user.

Note that the solution I used to do (and one they propose in the mentioned post) is to just have a try-catch, and in the catch, one perturbs the original matrix with some noise and computes the SVD of that one. This effectively also incurs in computing two SVDs, but, again, this error is so infrequent that it's fine.

@lezcano
Copy link
Collaborator
lezcano commented Aug 31, 2021

There is a similar problem with eigh #59720 (and I suspect that also with eig), so we should see if we have similar fallback algorithms for these other two. I'd assume that there is one for eigh, but eig is infamous for not being the most stable of functions...

@lezcano
Copy link
Collaborator
lezcano commented Aug 31, 2021

This is a similar one, but it fails silently: #24466
For this second one we should probably just mention in the docs that working with floating precision may not be enough for these algorithms when the matrices are ill-conditioned.

@mruberry
Copy link
Collaborator
mruberry commented Aug 31, 2021

Per offline discussion:

  • the fallback sounds great
  • a separate issue to add a "driver"-like argument (like lstsq has) would be interesting (we should consider giving them meaningful names and their algorithmic names, too)
    • alternatively the keyword might be something like "exact"
  • we don't think we should add the more precise CUDA-specific kwargs to linalg.svd at this time (but maybe a "lower level" function in the future could expose them?)
    • structure work may be able to expose these options in the future, too
  • the fallback to the alternative "more precise" algorithm must be carefully documented because users might complain about us being different from other libraries and confused about the performance of linalg.svd
    • Let's add a warning when we fallback? SciPy has "no error" flags that suppress these warnings
    • TORCH_WARN_ONCE might be a happy compromise

@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 1, 2021
@xwang233 xwang233 self-assigned this Sep 3, 2021
@o-alexandre-felipe
Copy link
o-alexandre-felipe commented Sep 14, 2021

Do you think a parallel QR iteration for tridiagonal systems would help any of this?
As far as I know the method used for large matrices is the divide and conquer, in which sub matrices are diagonalized and then, combined with a rank-1 perturbation.
I have read that the QR iteration method is still the best in a serial computer.
I framed the sub diagonal elimination step of the QR iteration as an associative operation (in an augmented space).
If the serial elimination of the sub diagonal has complexity c1 * N, in a serial computer, this will have a complexity c2 * log(N) in a fully parallel computer, or c2 * N * log(N) / P, when we have P processors.
I don't know if it is worth implementing it or not :D

facebook-github-bot pushed a commit that referenced this issue Oct 26, 2021
…ance (#64533)

Summary:
Fix #64237
Fix #28293
Fix #4689

See also #47953

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: #64533

Reviewed By: albanD

Differential Revision: D31915794

Pulled By: ngimel

fbshipit-source-id: 29ea48696531ced8a48474e891a9e2d5f11e9d7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants
0