-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Fix the non-conv 8000 erging issue of SVD on GPU for large matrices #64237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
In my opinion, the third option is the best one. I used to face this problem very rarely (say while training a network once every 10 epochs) and it would make the training fail, which was very annoying. As such, if we just launch Note that the solution I used to do (and one they propose in the mentioned post) is to just have a |
There is a similar problem with |
This is a similar one, but it fails silently: #24466 |
Per offline discussion:
|
Do you think a parallel QR iteration for tridiagonal systems would help any of this? |
…ance (#64533) Summary: Fix #64237 Fix #28293 Fix #4689 See also #47953 cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano Pull Request resolved: #64533 Reviewed By: albanD Differential Revision: D31915794 Pulled By: ngimel fbshipit-source-id: 29ea48696531ced8a48474e891a9e2d5f11e9d7a
Uh oh!
There was an error while loading. Please reload this page.
🚀 Feature
SVD on GPU for a large matrix (usually with size > 1024) or an ill-conditioned matrix may throw runtime error because of not converging well.
SVD on GPU currently uses the iterative Jacobi method of
cusolverDn<T>gesvdj
(note the trailingj
). The pytorch implementation takes cusolver default values ofmax sweeps = 100
andtolerance = machine accuracy
. The cusolvergesvdj
engine stops when either the accuracy or max sweeps is achieved. From the cusolver doc, usually ~15 sweeps would be enough.The
gesvdj
method works well and is much faster than the QR-based method ofcusolverDn<T>gesvd
for small size matrix (benchmark #48436 (comment)). However, for large size matrix, we've received several reports with runtime errors because SVD is not converged well, e.g. #28293 (comment)To fix this issue, there are several approaches
Expose the
max sweeps
andtolerance
of cusolvergesvdj
to user, so that people can set them to stricter values as needed.Problem: what should the user interface look like? Extra kwargs for
torch.linalg.svd
? Note that the CPU LAPACK method doesn't have this issue and won't use those parameters.Add
gesvd
method to pytorch.Problem: what should the user interface look like? The entrance of SVD is only
torch.linalg.svd
and how should user change the backend engine? Extra kwargs or extra python functions?Add
gesvd
method to pytorch, and usegesvd
as a fallback whengesvdj
doesn't converge.Problem: this would make SVD performance much worse in those cases.
Note that the QR-based
gesvd
method always converges. This issue doesn't affect the batchedgesvdjBatched
method which only takes matrices with size <= 32 and converges well in nearly all cases.cc @jianyuh @nikitaved @pearu @mruberry @heitorschueroff @walterddr @IvanYashchuk @xwang233 @lezcano @ptrblck
The text was updated successfully, but these errors were encountered: