pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. #66092

nikitaved · 2021-10-04T20:51:38Z

Fixes #65911. Also enables complex support/tests for linalg_pinv in OpInfo.

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @jianyuh @mruberry @walterddr @IvanYashchuk @xwang233

rank-preserving neighborhood.

pytorch-probot · 2021-10-04T20:51:41Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/2b5431eed4d51d4b7566af98116cbc3c649a0978/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped
puretorch-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

nikitaved · 2021-10-04T20:58:29Z

torch/testing/_internal/common_methods_invocations.py

+           # Only large tensors show issues with implicit backward used prior to
+           # explicit backward implementation.


a note: large tensors of low rank. In my environment I had to create a 1-rank 30x30 matrix to see issues with repeated "zeros" in the backward of SVD.

ezyang · 2021-10-04T23:17:04Z

not sure appropriate FB reviewer has been tagged yet

facebook-github-bot · 2021-10-05T01:59:01Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/66092
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 2b5431e (more details on the Dr. CI page):

1/1 failures introduced in this PR

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

Oct 08 11:26:22 RuntimeError: tensorflow/compil...'rendezvous_test.0': Connection reset by peer (14)

Oct 08 11:26:22 Exception in device=CPU:1: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'rendezvous_test.0': Connection reset by peer (14)
Oct 08 11:26:22 Traceback (most recent call last):
Oct 08 11:26:22   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
Oct 08 11:26:22     _start_fn(index, pf_cfg, fn, args)
Oct 08 11:26:22   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
Oct 08 11:26:22     fn(gindex, *args)
Oct 08 11:26:22   File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 22, in _mp_fn
Oct 08 11:26:22     replicas=replicas)
Oct 08 11:26:22   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/core/xla_model.py", line 875, in rendezvous
Oct 08 11:26:22     return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
Oct 08 11:26:22 RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'rendezvous_test.0': Connection reset by peer (14)
Oct 08 11:26:23 Traceback (most recent call last):
Oct 08 11:26:23   File "/var/lib/jenkins/workspace/xla/test/test_mp_rendezvous.py", line 35, in <module>
Oct 08 11:26:23     xmp.spawn(_mp_fn, args=())
Oct 08 11:26:23   File "/opt/conda/lib/python3.6/site-packages/torch_xla-1.10-py3.6-linux-x86_64.egg/torch_xla/distributed/xla_multiprocessing.py", line 394, in spawn
Oct 08 11:26:23     start_method=start_method)
Oct 08 11:26:23   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
Oct 08 11:26:23     while not context.join():
Oct 08 11:26:23   File "/opt/conda/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 144, in join
Oct 08 11:26:23     exit_code=exitcode
Oct 08 11:26:23 torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with exit code 17

XLA failure

Job pytorch_xla_linux_bionic_py3_6_clang9_test is failing. Please create an issue with title prefixed by [PT_BREAK] in pytorch/xla and link to to this PR. If you have questions, please reach out to @ailzhang / @dlibenzi / @JackCaoG.

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

…taved/pinv_backward

lezcano

Nice! Thanks both for finding the slick formula for this backward and the rather compact implementation!

lezcano · 2021-10-05T09:22:53Z

torch/testing/_internal/common_methods_invocations.py

+                # Note that by making the columns of `a` and `b` orthonormal we make sure
+                # that the product matrix `a @ b.t()` has condition number 1.


Nice! This saves us a lot of pain in future debugging.

Now, this note is slightly incorrect. The resulting matrix will have singular values 0 and 1, so the condition number will be infinite! Perhaps you mean that it has condition number 1 when restricted to its image?

Yes, exactly in the image, correct, so that pinv is stable.

torch/testing/_internal/common_methods_invocations.py

albanD · 2021-10-05T16:52:02Z

torch/testing/_internal/common_methods_invocations.py

+           sample_inputs_func=sample_inputs_linalg_pinv_singular,
+           # Only large tensors show issues with implicit backward used prior to
+           # explicit backward implementation.
+           decorators=[slowTest, skipCUDAIfNoMagmaAndNoCusolver, skipCUDAIfRocm, skipCPUIfNoLapack],


Is the slowTest decorator working as expected here?

@albanD It will apply the slowTest decorator to EVERY test generated by this OpInfo

mruberry · 2021-10-06T21:16:03Z

Cool! Do you have before/after perf numbers for the autograd, @nikitaved?

…taved/pinv_backward

nikitaved · 2021-10-07T19:46:38Z

@mruberry, I did run some benchmarks and surprisingly this PR also improves performance.

This PR, cpu float32:

shape: (10, 10), device: cpu, dtype: torch.float32                                                                                                                                                                 
29.8 µs ± 167 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)                                                                                                                                           
                                                                                                                                                                                                                   
shape: (1000, 10, 10), device: cpu, dtype: torch.float32                                                                                                                                                           
797 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)                                                                                                                                            
                                                                                                                                                                                                                   
shape: (100, 100), device: cpu, dtype: torch.float32                                                                                                                                                               
273 µs ± 3.76 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)                                                                                                                                            
                                                                                                                                                                                                                   
shape: (1000, 100, 100), device: cpu, dtype: torch.float32                                                                                                                                                         
56.4 ms ± 1.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)                                                                                                                                              
                                                                                                                                                                                                                   
shape: (1000, 1000), device: cpu, dtype: torch.float32                                                                                                                                                             
11.7 ms ± 96.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)                                                                                                                                            
                                                                                                                                                                     
6D47
                                              
shape: (10, 1000, 1000), device: cpu, dtype: torch.float32                                                                                                                                                         
159 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Master, cpu float32:

shape: (10, 10), device: cpu, dtype: torch.float32                                                                                                                                                                 
86.3 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)                                                                                                                                          
                                                                                                                                                                                                                   
shape: (1000, 10, 10), device: cpu, dtype: torch.float32                                                                                                                                                           
2.23 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)                                                                                                                                             
                                                                                                                                                                                                                   
shape: (100, 100), device: cpu, dtype: torch.float32                                                                                                                                                               
535 µs ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)                                                                                                                                            
                                                                                                                                                                                                                   
shape: (1000, 100, 100), device: cpu, dtype: torch.float32                                                                                                                                                         
174 ms ± 6.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)                                                                                                                                              
                                                                                                                                                                                                                   
shape: (1000, 1000), device: cpu, dtype: torch.float32                                                                                                                                                             
26.9 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)                                                                                                                                             
                                                                                                                                                                                                                   
shape: (10, 1000, 1000), device: cpu, dtype: torch.float32                                                                                                                                                         
392 ms ± 30.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR, cuda float32:

shape: (10, 10), device: cuda, dtype: torch.float32
111 µs ± 3.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

shape: (1000, 10, 10), device: cuda, dtype: torch.float32
332 µs ± 998 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (100, 100), device: cuda, dtype: torch.float32
111 µs ± 772 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

shape: (1000, 100, 100), device: cuda, dtype: torch.float32
7.25 ms ± 11.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 1000), device: cuda, dtype: torch.float32
3.21 ms ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (10, 1000, 1000), device: cuda, dtype: torch.float32
29.8 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Master, cuda float32:

shape: (10, 10), device: cuda, dtype: torch.float32
282 µs ± 15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 10, 10), device: cuda, dtype: torch.float32
565 µs ± 3.39 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (100, 100), device: cuda, dtype: torch.float32
312 µs ± 32.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape: (1000, 100, 100), device: cuda, dtype: torch.float32
11.8 ms ± 41.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (1000, 1000), device: cuda, dtype: torch.float32
4.72 ms ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

shape: (10, 1000, 1000), device: cuda, dtype: torch.float32
42.4 ms ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

…taved/pinv_backward

lezcano · 2021-10-08T08:29:30Z

Faster and correct! There's no better combination than that :)

albanD

Looks good. Can you fix the last lint (EDIT: Ho it looks like the job itself failed...)and I'll merge this.

facebook-github-bot · 2021-10-08T13:23:44Z

@albanD has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

pinv: forward/backward AD which is Frechet differentiable in a

d4ddef8

rank-preserving neighborhood.

nikitaved added module: autograd Related to torch.autograd, and the autograd engine in general module: linear algebra Issues related to specialized linear algebra operations in PyTorch; includes matrix multiply matmul complex_autograd ci/slow-gradcheck labels Oct 4, 2021

nikitaved requested review from albanD, ezyang and soulitzer as code owners October 4, 2021 20:51

pytorch-probot bot added the ciflow/default label Oct 4, 2021

nikitaved commented Oct 4, 2021

View reviewed changes

pytorchbot added the open source label Oct 4, 2021

ezyang removed their request for review October 4, 2021 23:16

facebook-github-bot added the cla signed label Oct 5, 2021

nikitaved added 2 commits October 5, 2021 08:11

add forward AD to the tests of pinverse in OpInfo

1ad418e

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

4775c45

…taved/pinv_backward

nikitaved changed the title ~~pinv: forward/backward AD which is Frechet-differentiable in a rank-preserving neighborhood.~~ pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. Oct 5, 2021

nikitaved requested review from mruberry, IvanYashchuk, lezcano and anjali411 October 5, 2021 08:14

minor optimization

346113f

lezcano approved these changes Oct 5, 2021

View reviewed changes

albanD reviewed Oct 5, 2021

View reviewed changes

nikitaved added 3 commits October 7, 2021 14:18

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

23c6ccf

…taved/pinv_backward

minor test update

b154a16

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

3743c7d

…taved/pinv_backward

albanD mentioned this pull request Oct 7, 2021

[bootcamp task] Add forward AD support for specific ops #66266

Closed

4 tasks

nikitaved added 2 commits October 7, 2021 21:01

add derivations

2dbbecb

Merge branch 'master' of https://github.com/pytorch/pytorch into niki…

443ccc1

…taved/pinv_backward

nikitaved added 3 commits October 8, 2021 09:20

minor

378abc5

resolve merge conflict

0ec1de8

fix flake8

2b5431e

albanD approved these changes Oct 8, 2021

View reviewed changes

facebook-github-bot closed this in 1b40daa Oct 11, 2021

This was referenced Oct 14, 2021

Add relative and absolute tolerances for matrix_rank, pinv #63102

Closed

Derivative for aten::linalg_pinv is not implemented #66618

Closed

crcrpar mentioned this pull request Nov 6, 2021

TestCommonCUDA.test_noncontiguous_samples_linalg_pinv_hermitian_cuda_float32 fails when TF32 is enabled #67947

Closed

github-actions bot deleted the nikitaved/pinv_backward branch February 13, 2024 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. #66092

pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. #66092

Uh oh!

Uh oh!

⚛️ CI Flow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		# Only large tensors show issues with implicit backward used prior to
		# explicit backward implementation.

		# Note that by making the columns of `a` and `b` orthonormal we make sure
		# that the product matrix `a @ b.t()` has condition number 1.

pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. #66092

pinv: forward/backward AD which is Frechet-defined in a rank-preserving neighborhood. #66092

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

⚛️ CI Flow

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

🕵️ 1 new failure recognized by patterns

pytorch_xla_linux_bionic_py3_6_clang9_test (1/1)

XLA failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!