Improvements for associative_scan - Autograd #136966

bohnstingl · 2024-09-29T22:42:58Z

This is part of a series of PRs to improve the functionality of the associatve_scan functionality. This specific PR implements the Autograd for associative_scan. This PR has been derived from #129307.

@ydwu4

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

pytorch-bot · 2024-09-29T22:43:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136966

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PREEMTIVE] Experimenting with new runners linux.aws.a100 on inductor-perf-compare.yml

❌ 15 New Failures

As of commit fc19798 with merge base a16476b ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_binary_operator_reverse_False_cpu
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_False_combine_mode_generic_cpu_autograd_True
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_cuda_repro.py::CudaReproTests::test_non_commutative_scan_op
pull / linux-focal-py3.11-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.11-clang10 / test (default, 2, 4, linux.4xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.12-clang10 / test (default, 2, 4, linux.4xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.12-clang10-experimental-split-build / test (default, 2, 3, linux.4xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.12-clang10-experimental-split-build / test (dynamo, 3, 3, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.9-clang10 / test (default, 2, 4, linux.4xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-focal-py3.9-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, linux.4xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 4, linux.2xlarge) (gh)
functorch/test_control_flow.py::TestControlFlow::test_scan_y_less_ndim_then_dim

This comment was automatically generated by Dr. CI and updates every 15 minutes.

bohnstingl · 2024-10-17T20:40:18Z

@pytorchbot label "topic: not user facing"

Binary operator not working yet

*) Added more documentation *) Create function to compute gradient for one leaf and vmapped to compute all in parallel

ydwu4 · 2024-10-23T17:00:04Z

test/functorch/test_control_flow.py


    results = [
-        torch.stack([e[leave_ind] for e in op(result_flat)], dim)
+        torch.concatenate([e[leave_ind] for e in op(result_flat)], dim)


Why is it a concatenate? if we associative_scan over (4, 2, 3) over dim=0, each subgraph should work on a slice of (2, 3), and the end results should be of shape (4, 2, 3). Anything wrong with this interface? After the change, does subgraph takes (1, 2, 3) or result becomes (8, 3).

oh, this one should be deleted i feel? _fake_scan is now in _higher_order_ops/scan.py.

ydwu4 · 2024-10-23T17:06:18Z

torch/_dynamo/variables/higher_order_ops.py

        from .builder import wrap_fx_proxy

        args, kwargs = LazyVariableTracker.realize_all((args, kwargs))

-        def arg_extractor(combine_fn, xs, dim):
-            return combine_fn, xs, dim
+        def arg_extractor(combine_fn, xs):


i feel we should split this diff into 2, first is the the dim change, then the autograd.

ydwu4 · 2024-10-23T17:09:30Z

torch/_higher_order_ops/associative_scan.py

-    dim = utils.canonicalize_dim(ndim, dim)
+    # Move scan dim to 0 and always perform scan on dim 0
+    orig_scan_dim = dim
+    leaves = [shift_source_dim_to_target_dim(elem, int(dim), 0) for elem in leaves]


we might replace shift_source_dim_to_target_dim with torch.movedim(elem, int(dim), 0).

ydwu4 · 2024-10-23T17:12:14Z

torch/_higher_order_ops/associative_scan.py

+        result_flat = [torch.flip(elem, [0]) for elem in result_flat]
+
+    result_flat = [
+        shift_source_dim_to_target_dim(elem, 0, orig_scan_dim) for elem in result_flat


shift_source_dim_to_target_dim -> movedim

ydwu4 · 2024-10-23T17:12:59Z

torch/_higher_order_ops/associative_scan.py


    return pytree.tree_unflatten(result_flat, spec)


+# TODO: Provide inductor support for generic scan


What's missing in inductor for generic scan. Is it the test failure we talked about?

ydwu4 · 2024-10-23T17:22:51Z

torch/_higher_order_ops/associative_scan.py

+        return (*outs,)
+
+    @staticmethod
+    def backward(ctx, *flat_grads_unmasked):


I trust you on this.

Didn't look into the details of the backward implementation. Some general thoughts for better testing this: can we use scan to implement a baseline version first and add much more tests to verify the correctness (e.g. nesting cond, scan and associative scan with autograd, more types of ops inside the body of associative scan(e.g. different kinds of view ops, non-continous inputs and outputs.)

In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of #136966 Pull Request resolved: #138858 Approved by: https://github.com/ydwu4

bohnstingl · 2024-11-06T23:40:31Z

Closing this PR, as it is split into several smaller PRs: #138858, #139864, #139939

In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of pytorch#136966 Pull Request resolved: pytorch#138858 Approved by: https://github.com/ydwu4

bohnstingl requested a review from zou3519 as a code owner September 29, 2024 22:42

pytorchbot added the open source label Sep 29, 2024

zou3519 requested review from ydwu4 and removed request for zou3519 September 30, 2024 19:48

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 30, 2024

bohnstingl force-pushed the generic_associative_scan_6 branch from a9366e6 to 3ab2330 Compare October 2, 2024 08:37

bohnstingl mentioned this pull request Oct 7, 2024

Added host-side associative scan function #129307

Closed

pytorch-bot bot added module: dynamo module: inductor labels Oct 17, 2024

pytorch-bot bot added the topic: not user facing topic category label Oct 17, 2024

bohnstingl force-pushed the generic_associative_scan_6 branch from 15de9c4 to 8d669c1 Compare October 18, 2024 17:47

bohnstingl added 16 commits October 22, 2024 23:18

First working version of Autograd

62d8159

WIP: Autograd almost working

b401792

Working version of Autograd

ef943e3

Integrated partial gradient support

dc0e313

Pre-cleaned version

d8d9dbb

After clean-up

05fb76b

Binary operator not working yet

Working version Autograd for the binary operator

e89fdf3

Code cleanup, pre-lintrunner

9c3e8f5

Remove import

f4697c5

Minor fixes

02f083d

Working version

ef000a6

Working version with matrix acceleration

36401bc

Performance improvements

982dc36

*) Code clean-up

706835c

*) Added more documentation *) Create function to compute gradient for one leaf and vmapped to compute all in parallel

Clean-up and rebase

fc19798

bohnstingl force-pushed the generic_associative_scan_6 branch from 8d669c1 to fc19798 Compare October 22, 2024 23:11

ydwu4 reviewed Oct 23, 2024

View reviewed changes

ydwu4 requested changes Oct 23, 2024

View reviewed changes

bohnstingl mentioned this pull request Oct 24, 2024

Improvements for associative_scan - slicing of xs #138858

Closed

bohnstingl closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improvements for associative_scan - Autograd #136966

Improvements for associative_scan - Autograd #136966

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		return pytree.tree_unflatten(result_flat, spec)


		# TODO: Provide inductor support for generic scan

Improvements for associative_scan - Autograd #136966

Improvements for associative_scan - Autograd #136966

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/136966

❗ 1 Active SEVs

❌ 15 New Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!