Add cuSOLVER path for torch.linalg.qr #56256

IvanYashchuk · 2021-04-16T09:51:03Z

Stack from ghstack:

Add cuSOLVER path for torch.linalg.lstsq #57317 Add cuSOLVER path for torch.linalg.lstsq
Add CUDA support for torch.ormqr #57316 Add CUDA support for torch.ormqr
Port CPU torch.ormqr to ATen #57315 Port CPU torch.ormqr to ATen
Fix torch.ormqr for non Fortran-contiguous inputs #57314 Fix torch.ormqr for non Fortran-contiguous inputs
Fix MAGMA qr for empty batched inputs #56257 Fix MAGMA qr for empty batched inputs
Add cuSOLVER path for torch.linalg.qr #56256 Add cuSOLVER path for torch.linalg.qr
Remove size arguments for internal orgqr and geqrf calls #56255 Remove size arguments for internal orgqr and geqrf calls
Add non-allocating helper function for torch.linalg.qr #56254 Add non-allocating helper function for torch.linalg.qr

Using cuSOLVER path with pytest test/test_ops.py -k 'linalg_qr' --durations=5 cuts the runtime for these tests by 1 minute locally. See #56256 (comment).

Performance comparison: #56256 (comment).

Differential Revision: D27960154

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552 [ghstack-poisoned]

facebook-github-bot · 2021-04-16T09:51:34Z

💊 CI failures summary and remediations

As of commit 9cc5653 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. pytorch#51552 ghstack-source-id: 4f0cbb7 Pull Request resolved: pytorch#56256

IvanYashchuk · 2021-04-16T10:06:39Z

Time spent for running pytest test/test_ops.py -k 'linalg_qr' --durations=5.
cuSOLVER:

====================================================== slowest 5 durations =======================================================
8.03s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_complex64
2.67s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_float32
2.65s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_qr_cpu_complex128
1.73s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_complex128
1.37s call     test/test_ops.py::TestOpInfoCUDA::test_duplicate_method_tests_linalg_qr_cuda_float32
================================= 49 passed, 41 skipped, 12294 deselected, 5 warnings in 31.98s ==================================

MAGMA:

====================================================== slowest 5 durations =======================================================
39.57s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_complex128
11.12s call     test/test_ops.py::TestGradientsCUDA::test_fn_grad_linalg_qr_cuda_float64
5.31s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_float32
5.28s call     test/test_ops.py::TestCommonCUDA::test_variant_consistency_jit_linalg_qr_cuda_complex64
2.75s call     test/test_ops.py::TestGradientsCPU::test_fn_gradgrad_linalg_qr_cpu_complex128
============================ 49 passed, 41 skipped, 12294 deselected, 5 warnings in 81.28s (0:01:21) =============================

IvanYashchuk · 2021-04-16T10:23:09Z

Here is MAGMA vs cuSOLVER comparison for non-batched square inputs for modes 'complete', 'reduced', 'r':

|                          | cuSOLVER, 'complete' | MAGMA, 'complete' | cuSOLVER, 'reduced' | MAGMA, 'reduced' | cuSOLVER, 'r' | MAGMA, 'r' |
|--------------------------|----------------------|-------------------|---------------------|------------------|---------------|------------|
| torch.Size([2, 2])       | 0.084                | 8.0               | 0.0774              | 7.6              | 0.0504        | 3.3        |
| torch.Size([8, 8])       | 0.0877               | 7.6               | 0.0872              | 8.1              | 0.0474        | 3.2        |
| torch.Size([16, 16])     | 0.158                | 7.6               | 0.1569              | 8.3              | 0.1577        | 3.3        |
| torch.Size([32, 32])     | 0.4164               | 7.6               | 0.413               | 8.5              | 0.2835        | 3.3        |
| torch.Size([64, 64])     | 0.9334               | 8.0               | 0.9257              | 8.4              | 0.6559        | 3.3        |
| torch.Size([128, 128])   | 2.0622               | 9.3               | 2.045               | 9.8              | 1.554         | 3.9        |
| torch.Size([256, 256])   | 3.5756               | 12.4              | 3.548               | 12.9             | 2.342         | 5.1        |
| torch.Size([512, 512])   | 8.6611               | 17.4              | 8.593               | 18.7             | 5.797         | 8.3        |
| torch.Size([1024, 1024]) | 23.4609              | 36.9              | 23.342              | 37.4             | 15.196        | 15.6       |
| torch.Size([2048, 2048]) | 92.3197              | 118.7             | 92.247              | 120.1            | 54.483        | 43.9       |
| torch.Size([4096, 4096]) | 497.0645             | 694.1             | 494.418             | 695.7            | 277.952       | 243.5      |
| torch.Size([8192, 8192]) | 3267.1995            | 4603.7            | 3250.727            | 4617.3           | 1713.537      | 1536.7     |

Times are in milliseconds (ms).

MAGMA is only faster than cuSOLVER for large size inputs and mode='r'. For all other cases cuSOLVER is better.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. pytorch#51552 gh 8000 stack-source-id: 4f5361a Pull Request resolved: pytorch#56256

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

mruberry · 2021-04-23T03:27:57Z

aten/src/ATen/native/BatchLinearAlgebra.cpp

@@ -1777,7 +1777,7 @@ void linalg_qr_out_helper(const Tensor& input, const Tensor& Q, const Tensor& R,
  orgqr_stub(input.device().type(), const_cast<Tensor&>(Q), tau);
 }

-std::tuple<Tensor, Tensor> _linalg_qr_helper_cpu(const Tensor& input, std::string mode) {
+std::tuple<Tensor, Tensor> _linalg_qr_helper
8000
_default(const Tensor& input, std::string mode) {


Why "default" and not "cpu"?

We have now linalg_qr_helper_magma that uses MAGMA for the QR decomposition, it can't be implemented using geqrf_stub + orgqr_stub, because orgqr_stub only supports cuSOLVER for CUDA inputs. In addition, MAGMA doesn't follow LAPACK API for geqrf and orgqr operations that together form the QR decomposition. That's why we need to have a separate function for MAGMA.

And we have _linalg_qr_helper_default with "_default" and not "_cpu" because this function supports both CPU and CUDA inputs, for CUDA inputs cuSOLVER&cuBLAS is used.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment). Performance comparison: #56256 (comment). Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154) [ghstack-poisoned]

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. pytorch#51552 ghstack-source-id: 2f98cde Pull Request resolved: pytorch#56256

mruberry · 2021-04-27T08:41:41Z

Time to start landing the second part of this stack!

@xwang233 would you take a look at this PR in the stack?

xwang233

The PR is very concise and LGTM.

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment). Performance comparison: #56256 (comment). Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154) [ghstack-poisoned]

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. pytorch#51552 ghstack-source-id: e94b357 Pull Request resolved: pytorch#56256

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. See #56256 (comment). Performance comparison: #56256 (comment). Differential Revision: [D27960154](https://our.internmc.facebook.com/intern/diff/D27960154) [ghstack-poisoned]

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. pytorch#51552 ghstack-source-id: 574f15d Pull Request resolved: pytorch#56256

mruberry

Stamped

facebook-github-bot · 2021-04-30T18:15:23Z

@mruberry merged this pull request in ff59039.

Summary: Pull Request resolved: pytorch#56256 Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. See pytorch#56256 (comment). Performance comparison: pytorch#56256 (comment). Test Plan: Imported from OSS Reviewed By: ngimel Differential Revision: D27960154 Pulled By: mruberry fbshipit-source-id: 5312330d82337dec2856ec5527156a3a547a0b50

Add cuSOLVER path for torch.linalg.qr

0dbda48

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552 [ghstack-poisoned]

IvanYashchuk requested a review from ezyang as a code owner April 16, 2021 09:51

facebook-github-bot added the cla signed label Apr 16, 2021

pytorchbot added the open source label Apr 16, 2021

IvanYashchuk removed the request for review from ezyang April 16, 2021 10:03

IvanYashchuk added the module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul label Apr 16, 2021

IvanYashchuk requested a review from mruberry April 16, 2021 10:04

Update on "Add cuSOLVER path for torch.linalg.qr"

b6901bc

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

Update on "Add cuSOLVER path for torch.linalg.qr"

415e924

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

Update on "Add cuSOLVER path for torch.linalg.qr"

703c7e0

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

IvanYashchuk added 2 commits April 19, 2021 12:10

Update on "Add cuSOLVER path for torch.linalg.qr"

9aabe59

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

Update on "Add cuSOLVER path for torch.linalg.qr"

0f6925f

Using cuSOLVER path with `pytest test/test_ops.py -k 'linalg_qr' --durations=5` cuts the runtime for these tests by 1 minute locally. Ref. #51552, #47953 [ghstack-poisoned]

mruberry reviewed Apr 23, 2021

View reviewed changes

xwang233 approved these changes Apr 27, 2021

View reviewed changes

This was referenced Apr 29, 2021

Fix torch.ormqr for non Fortran-contiguous inputs #57314

Closed

Port CPU torch.ormqr to ATen #57315

Closed

Add CUDA support for torch.ormqr #57316

Closed

Add cuSOLVER path for torch.linalg.lstsq #57317

Closed

IvanYashchuk mentioned this pull request Apr 29, 2021

Linear algebra GPU backend tracking issue [magma/cusolver/cublas] #47953

Open

mruberry approved these changes Apr 30, 2021

View reviewed changes

facebook-github-bot closed this in ff59039 Apr 30, 2021

facebook-github-bot added the Merged label Apr 30, 2021

facebook-github-bot deleted the gh/ivanyashchuk/17/head branch May 4, 2021 14:16

This was referenced Aug 30, 2021

Some test_qr CUDA tests in test_autograd are taking very long time #51552

Closed

switch CUDA svd and qr to using cuSolver #4689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cuSOLVER path for torch.linalg.qr #56256

Add cuSOLVER path for torch.linalg.qr #56256

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add cuSOLVER path for torch.linalg.qr #56256

Add cuSOLVER path for torch.linalg.qr #56256

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!