8000 [CUDA][Linalg] Add gesvd as SVD fallback; optimize SVD gesvdj performance by xwang233 · Pull Request #64533 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CUDA][Linalg] Add gesvd as SVD fallback; optimize SVD gesvdj performance #64533

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

xwang233
Copy link
Collaborator
@xwang233 xwang233 commented Sep 6, 2021

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Sep 6, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit af60e5c (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Collaborator
@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR Xiao! Left a few comments.
Not that this PR also
Fixes #28293

// cusolver gesvd only supports m >= n, we transpose the whole svd calculation if m < n
// i.e.: A = U S V^H --> A^T = V^H^T S U^T
if (m < n) {
apply_svd_lib_gesvd(self.transpose(-2, -1), VT, S, U, infos, compute_uv, some, calculate_all_batches, batches);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very slick!
If you want, via this trick you could implement #9083 for gesvdj which would save a copy in many cases.

@lezcano
Copy link
Collaborator
lezcano commented Sep 6, 2021

This PR also partially addresses #4689 . Perhaps even with the previous work from @IvanYashchuk it fully addresses it?

@lezcano
Copy link
Collaborator
lezcano commented Sep 6, 2021

@IvanYashchuk confirmed that this PR also
Fixes #4689

On a different note, if I'm not mistaken, if the input matrix has NaNs, the current behaviour would be to execute both algorithms before failing with no error in the TORCH_CUSOLVER_CHECK. Since we never optimise for errors, the fact that we execute both algorithms before failing is better than synchronising and checking for nans before launching svd, but the error part should hopefully be improved. Would it be fair to say that gesvd just returns CUSOLVER_STATUS_INTERNAL_ERROR when there are NaNs in the input? If so, we could check for that and throw a better error message then.

@lezcano
Copy link
Collaborator
lezcano commented Sep 6, 2021

Also, should we document this behaviour? If so, how should we advertise this? cc @mruberry
I am tempted to say that it's fine not to write this in the docs as it's more of an "implementation detail" and (at the moment) there's not much that the user can do about it. Even more, the user will be informed of this behaviour once it happens (if it happens).

@mruberry mruberry requested a review from ngimel September 7, 2021 03:33
Copy link
Collaborator
@IvanYashchuk IvanYashchuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only minor suggestions like using of full_matrices for clarity and using const references. I'd also like to see the clarification on view+copy in gesvd path for the VT tensor.

@mrshenli mrshenli added module: cuda Related to torch.cuda, and CUDA support in general module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 7, 2021
@ngimel
Copy link
Collaborator
ngimel commented Sep 8, 2021

Also, should we document this behaviour? If so, how should we advertise this? cc @mruberry
I am tempted to say that it's fine not to write this in the docs as it's more of an "implementation detail" and (at the moment) there's not much that the user can do about it. Even more, the user will be informed of this behaviour once it happens (if it happens).

@lezcano, I agree, it's ok not to write in the docs, as you say, it's implementation detail, and we don't optimize for fast throwing of error.

@codecov
Copy link
codecov bot commented Sep 8, 2021

Codecov Report

Merging #64533 (eac2a4a) into master (88c0ea9) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #64533      +/-   ##
==========================================
- Coverage   66.73%   66.65%   -0.08%     
==========================================
  Files         710      714       +4     
  Lines       92435    92546     +111     
==========================================
+ Hits        61682    61685       +3     
- Misses      30753    30861     +108     

svd(a)
error_msg = r'\(Batch element 1\): The algorithm failed to converge' if self.device_type == 'cpu' \
else 'CUSOLVER_STATUS_EXECUTION_FAILED'
Copy link
Collaborator Author
@xwang233 xwang233 Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite certain how to change this test. Cusolver gesvd reports CUSOLVER_STATUS_EXECUTION_FAILED when the matrix contains nan

cc @IvanYashchuk

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do something like this?
#64818 (comment)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the error, a cool way to handle this would be to do an input.in_nan().any() before parsing the error message in the function that parses the error

I don't think this is a good idea. First, Tensor.isnan() takes extra time. Besides that, if (Tensor.isnan().any()) will cause a device-host sync, which further complicates the performance.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I propose doing this whenever we know that an error has occured, so we'll be pesimizing the code before throwing an exception.

@@ -490,6 +669,8 @@ std::tuple<Tensor, Tensor, Tensor> _svd_helper_cuda_lib(const Tensor& self, bool
VT_working_copy.transpose_(-2, -1);
}

// VT_working_copy is row-major, actually "V" now

// heuristic for using `gesvdjBatched` over `gesvdj`
if (m <= 32 && n <= 32 && batch_size > 1 && (!some || m == n)) {
apply_svd_lib_gesvdjBatched(self, U_working_copy, S_working_copy, VT_working_copy, infos, compute_uv);
Copy link
Collaborator Author
@xwang233 xwang233 Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have a switch to switch the backend of gesvd, gesvdj, or gesvdjBatched, I can't find a good way to test this gesvd implementation in the CI.

I tested this locally by replacing the apply_svd_lib_gesvdj below with apply_svd_lib_gesvd, and all linalg tests passed. Also tested numerical accuracy of various shapes with this script https://github.com/xwang233/code-snippet/blob/master/linalg/svd/svdprof.py.

@xwang233 xwang233 marked this pull request as draft September 10, 2021 05:50
@xwang233 xwang233 marked this pull request as ready for review September 10, 2021 07:36
@xwang233
Copy link
Collaborator Author

cc @mruberry @ngimel The PR is ready to go

@ngimel
Copy link
Collaborator
ngimel commented Sep 29, 2021

@lezcano @IvanYashchuk do you have any additional comments on this PR?

Copy link
Collaborator
@lezcano lezcano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that there's just this minor point standing: #64533 (comment)
Otherwise this LGTM

@pytorch-probot
Copy link
pytorch-probot bot commented Oct 21, 2021
CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/xwang233/pytorch/blob/af60e5c15bfae21146bc1e346704ce3e265c91d9/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default,ciflow/all

Workflows Labels (bold enabled) Status
Triggered Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux ✅ triggered
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux ✅ triggered
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux ✅ triggered
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow ✅ triggered
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux ✅ triggered
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck ✅ triggered
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled ✅ triggered
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@IvanYashchuk
Copy link
Collaborator

@mruberry, @ngimel could you please take a look and help merge this PR?

@ngimel
Copy link
Collaborator
ngimel commented Oct 22, 2021

ROCm error is real https://ci.pytorch.org/jenkins/job/pytorch-builds/job/pytorch-linux-bionic-rocm4.3.1-py3.6-test1/6123/console

@xwang233
Copy link
Collaborator Author

@ngimel ROCm test has been fixed. All tests passed and the PR is ready to go.

@facebook-github-bot
Copy link
Contributor

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Oct 26, 2021
…ance (#64533)

Summary:
Fix #64237
Fix #28293
Fix #4689

See also #47953

cc ngimel jianyuh nikitaved pearu mruberry walterddr IvanYashchuk xwang233 Lezcano

Pull Request resolved: #64533

Reviewed By: albanD

Differential Revision: D31915794

Pulled By: ngimel

fbshipit-source-id: 29ea48696531ced8a48474e891a9e2d5f11e9d7a
facebook-github-bot pushed a commit that referenced this pull request Nov 23, 2021
…68683)

Summary:
In SVD cusolverDnXgesvd computations,

When cuda < 11.5, cusolver raises CUSOLVER_STATUS_EXECUTION_FAILED when input contains nan.
When cuda >= 11.5, cusolver normally finishes execution and sets info array indicating convergence issue.

Related: #68259 #64533

Pull Request resolved: #68683

Reviewed By: dagitses

Differential Revision: D32583576

Pulled By: mruberry

fbshipit-source-id: f732872522e0bda2703450ffcc64ae3a0d3f5bc0
zenoone added a commit to zenoone/deepqmc that referenced this pull request Jan 10, 2022
The issue with the evaluation of the svd on the gpu was fixed in
pytorch/pytorch#64533.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: cuda Related to torch.cuda, and CUDA support in general module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
7 participants
0