8000 Add cuSOLVER path for torch.linalg.qr by IvanYashchuk · Pull Request #56256 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Add cuSOLVER path for torch.linalg.qr #56256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 14 commits into from

Conversation

IvanYashchuk
Copy link
Collaborator
@IvanYashchuk IvanYashchuk commented Apr 16, 2021

Stack from ghstack:

Using cuSOLVER path with pytest test/test_ops.py -k 'linalg_qr' --durations=5 cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: D27960154

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Apr 16, 2021

💊 CI failures summary and remediations

As of commit 9cc5653 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 16, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 4f0cbb7
Pull Request resolved: pytorch#56256
@IvanYashchuk IvanYashchuk removed the request for review from ezyang April 16, 2021 10:03
@IvanYashchuk IvanYashchuk added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Apr 16, 2021
@IvanYashchuk IvanYashchuk requested a review from mruberry April 16, 2021 10:04
@IvanYashchuk
Copy link
Collaborator Author

Time spent for running pytest test/test_ops.py -k 'linalg_qr' --durations=5.
cuSOLVER:

====================================================== slowest 5 durations =======================================================
8.03s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_complex64
2.67s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_float32
2.65s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_qr_cpu_complex128
1.73s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_complex128
1.37s call     test/test_ops.py::TestOpInfoCUDA::test_duplicate_method_tests_linalg_qr_cuda_float32
================================= 49 passed, 41 skipped, 12294 deselected, 5 warnings in 31.98s ==================================

MAGMA:

====================================================== slowest 5 durations =======================================================
39.57s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_complex128
11.12s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_float64
5.31s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_float32
5.28s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_complex64
2.75s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_qr_cpu_complex128
============================ 49 passed, 41 skipped, 12294 deselected, 5 warnings in 81.28s (0:01:21) =============================

@IvanYashchuk
Copy link
Collaborator Author
IvanYashchuk commented Apr 16, 2021

Here is MAGMA vs cuSOLVER comparison for non-batched square inputs for modes 'complete', 'reduced', 'r':

|                          | cuSOLVER, 'complete' | MAGMA, 'complete' | cuSOLVER, 'reduced' | MAGMA, 'reduced' | cuSOLVER, 'r' | MAGMA, 'r' |
|--------------------------|----------------------|-------------------|---------------------|------------------|---------------|------------|
| torch.Size([2, 2])       | 0.084                | 8.0               | 0.0774              | 7.6              | 0.0504        | 3.3        |
| torch.Size([8, 8])       | 0.0877               | 7.6               | 0.0872              | 8.1              | 0.0474        | 3.2        |
| torch.Size([16, 16])     | 0.158                | 7.6               | 0.1569              | 8.3              | 0.1577        | 3.3        |
| torch.Size([32, 32])     | 0.4164               | 7.6               | 0.413               | 8.5              | 0.2835        | 3.3        |
| torch.Size([64, 64])     | 0.9334               | 8.0               | 0.9257              | 8.4              | 0.6559        | 3.3        |
| torch.Size([128, 128])   | 2.0622               | 9.3               | 2.045               | 9.8              | 1.554         | 3.9        |
| torch.Size([256, 256])   | 3.5756               | 12.4              | 3.548               | 12.9             | 2.342         | 5.1        |
| torch.Size([512, 512])   | 8.6611               | 17.4              | 8.593               | 18.7             | 5.797         | 8.3        |
| torch.Size([1024, 1024]) | 23.4609              | 36.9              | 23.342              | 37.4             | 15.196        | 15.6       |
| torch.Size([2048, 2048]) | 92.3197              | 118.7             | 92.247              | 120.1            | 54.483        | 43.9       |
| torch.Size([4096, 4096]) | 497.0645             | 694.1             | 494.418             | 695.7            | 277.952       | 243.5      |
| torch.Size([8192, 8192]) | 3267.1995            | 4603.7            | 3250.727            | 4617.3           | 1713.537      | 1536.7     |

Times are in milliseconds (ms).

MAGMA is only faster than cuSOLVER for large size inputs and mode='r'. For all other cases cuSOLVER is better.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 16, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

gh
8000
stack-source-id: 4f5361a
Pull Request resolved: pytorch#56256
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. #51552, #47953

[ghstack-poisoned]
@@ -1777,7 +1777,7 @@ void linalg_qr_out_helper(const Tensor& input, const Tensor& Q, const Tensor& R,
orgqr_stub(input.device().type(), const_cast<Tensor&>(Q), tau);
}

std::tuple<Tensor, Tensor> _linalg_qr_helper_cpu(const Tensor& input, std::string mode) {
std::tuple<Tensor, Tensor> _linalg_qr_helper 8000 _default(const Tensor& input, std::string mode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "default" and not "cpu"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have now linalg_qr_helper_magma that uses MAGMA for the QR decomposition, it can't be implemented using geqrf_stub + orgqr_stub, because orgqr_stub only supports cuSOLVER for CUDA inputs. In addition, MAGMA doesn't follow LAPACK API for geqrf and orgqr operations that together form the QR decomposition. That's why we need to have a separate function for MAGMA.

And we have _linalg_qr_helper_default with "_default" and not "_cpu" because this function supports both CPU and CUDA inputs, for CUDA inputs cuSOLVER&cuBLAS is used.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit that referenced this pull request Apr 26, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit that referenced this pull request Apr 26, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 26, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 2f98cde
Pull Request resolved: pytorch#56256
@mruberry
Copy link
Collaborator

Time to start landing the second part of this stack!

@xwang233 would you take a look at this PR in the stack?

Copy link
Collaborator
@xwang233 xwang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR is very concise and LGTM.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: e94b357
Pull Request resolved: pytorch#56256
IvanYashchuk added a commit that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154)

[ghstack-poisoned]
IvanYashchuk added a commit to IvanYashchuk/pytorch that referenced this pull request Apr 29, 2021
Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally.
Ref. pytorch#51552

ghstack-source-id: 574f15d
Pull Request resolved: pytorch#56256
Copy link
Collaborator
@mruberry mruberry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stamped

@facebook-github-bot
Copy link
Contributor

@mruberry merged this pull request in ff59039.

@facebook-github-bot facebook-github-bot deleted the gh/ivanyashchuk/17/head branch May 4, 2021 14:16
krshrimali pushed a commit to krshrimali/pytorch that referenced this pull request May 19, 2021
Summary:
Pull Request resolved: pytorch#56256

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr'
--durations=5` cuts the runtime for these tests by 1 minute locally. See pytorch#56256 (comment).

Performance comparison: pytorch#56256 (comment).

Test Plan: Imported from OSS

Reviewed By: ngimel

Differential Revision: D27960154

Pulled By: mruberry

fbshipit-source-id: 5312330d82337dec2856ec5527156a3a547a0b50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed Merged module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul open source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0