Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs #42533

izdeby · 2020-08-04T15:02:49Z

Stack from ghstack:

Rewrote Adadelta with _foreach APIs #43926 [Wip] Rewrote Adadelta with _foreach APIs
Rewrote adamax optimizer with _foreach APIs #43925 [wip] Rewrote adamax optimizer with _foreach APIs
Rewrote asgd optimizer with *_foreach APIs #43727 WIP asgd
Rewrote rprop optimizer with _foreach APIs #43726 [wip] Rewrote rprop optimizer with _foreach APIs
Rewrote rmsprop with _forerach APIs #43725 [wip] Rewrote rmsprop with _forerach APIs
Rewrote SGD with _foreach APIs #43724 [wip] Rewrote SGD with _foreach APIs
Rewrote AdamW with _foreach APIs #43562 [wip] Rewrote AdamW with _foreach APIs
[WIP] Rewrote adam optimizer with foreach APIs #43507 [WIP] Rewrote adam optimizer with foreach APIs
Added alpha overloads for add/sub ops with lists #43413 Added alpha overloads for add/sub ops with lists
Added pointwise ops: addcdiv, addcmul #42760 Added pointwise ops: addcdiv, addcmul
Add unary ops: exp, sqrt and pointwise ops: addcdiv, addcmul #42537 Add unary ops: exp and sqrt
Add binary ops for _foreach APIs #42536 Add binary ops for _foreach APIs
Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs #42533 Add _foreach_add(TensorList tl1, TensorList tl2) and foreach_add(TensorList tl1, TensorList tl2) APIs

First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar).

Motivation
GitHub issue
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at NVIDIAs Apex.
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

Current API restrictions

List can't be empty (will fixed in upcoming PRs).
All tensors in the list must have the same dtype, device and size.

Broadcasting
At this point we don't support broadcasting.

What is 'Fast' and 'Slow' route
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,

All tensors must have strided layout
All tensors must be dense and not have overlapping memory
The resulting tensor type must be the same.

In this PR

Adding a _foreach_add(TensorList tl1, TensorList tl2) API
Adding a _foreach_add_(TensorList tl1, TensorList tl2) API

Tests
Tested via unit tests

TODO

Properly handle empty lists

Plan for the next PRs

APIs

Binary Ops for list with Scalar
Binary Ops for list with list
Unary Ops for list
Pointwise Ops

Complete tasks from TODO
Rewrite PyTorch optimizers to use for-each operators for performance gains.

Differential Revision: D23331894

[ghstack-poisoned]

dr-ci · 2020-08-04T15:25:25Z

💊 CI failures summary and remediations

As of commit 729d53f (more details on the Dr. CI page):

2/6 failures possibly* introduced in this PR
- 1/2 non-CircleCI failure(s)
4/6 broken upstream at merge base eb4199b on Aug 31 from 10:12am to 12:18pm PDT (16 commits; eb4199b - f7bae5b)

🕵️ 1 new failure recognized by patterns

The following CI failures do not appear to be due to upstream breakages:

pytorch_windows_vs2019_py36_cuda10.1_test2 (1/1)

Step: "Test" (full log | diagnosis details | 🔁 rerun)

RuntimeError: test_foreach failed!

FAILED (errors=2) 
 
Generating XML reports... 
Generated XML report: test-reports\python-unittest\TEST-TestForeachCPU-20200901202623.xml 
Generated XML report: test-reports\python-unittest\TEST-TestForeachCUDA-20200901202623.xml 
Traceback (most recent call last): 
  File "run_test.py", line 734, in <module> 
    main() 
  File "run_test.py", line 717, in main 
    raise RuntimeError(err_message) 
RuntimeError: test_foreach failed! 
 
(base) circleci@PACKER-5F0EEC91 C:\Users\circleci\project\test>if ERRORLEVEL 1 exit /b 1  
+ cleanup
+ retcode=1
+ set +x

🚧 4 fixed upstream failures:

These were probably caused by upstream breakages that were already fixed.

Please rebase on the viable/strict branch (expand for instructions)

Since your merge base is older than viable/strict, run these commands:

git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD

Check out the recency history of this "viable master" tracking branch.

pytorch_linux_bionic_py3_6_clang9_test on Aug 31 from 10:12am to 12:18pm PDT (16 commits; eb4199b - f7bae5b)
- 🔁 rerun
pytorch_linux_bionic_py3_8_gcc9_coverage_test on Aug 31 from 10:12am to 12:18pm PDT (16 commits; eb4199b - f7bae5b)
- 🔁 rerun
pytorch_linux_xenial_py3_clang5_asan_test2 on Aug 31 from 10:12am to 12:18pm PDT (16 commits; eb4199b - f7bae5b)
- 🔁 rerun
pytorch_linux_xenial_cuda10_2_cudnn7_py3_gcc7_test on Aug 31 from 10:12am to 12:18pm PDT (16 commits; eb4199b - f7bae5b)
- 🔁 rerun

ci.pytorch.org: 1 failed

Failed: pr/caffe2-pytorch-linux-bionic-rocm3.7-py3.6-test

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 83 times.

[ghstack-poisoned]

…and _foreach_add_(TensorList tl1, TensorList tl2) APIs" [ghstack-poisoned]

…ach_add_(TensorList tl1, TensorList tl2) APIs" **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API - Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

gchanan · 2020-08-10T15:52:52Z

which optimizer needs this?

ngimel · 2020-08-10T16:23:21Z

out of place list add e.g. sgd d_p = d_p.add(buf, alpha=momentum), in-place list add adam and adamw: exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1)

gchanan · 2020-08-10T17:58:28Z

broadcasting?

ngimel · 2020-08-10T18:28:12Z

exp_avg and grad have the same size. If broadcasting happens due to grad being expanded (that should not be common) then fast path won't be taken.

test/test_foreach.py

zou3519 · 2020-08-31T17:14:34Z

test/test_foreach.py

+        torch._foreach_add_(tensors1, tensors2)
+        self.assertEqual(res, tensors1)
+        self.assertEqual(res[0], torch.ones(10, 10, device=device, dtype=dtype))
+


It would be nice to test some edge cases (e.g., number of tensors in the list >= 100).

test/test_foreach.py

aten/src/ATen/native/ForeachUtils.h

aten/src/ATen/native/native_functions.yaml

aten/src/ATen/native/ForeachOpsKernels.cpp

zou3519 · 2020-08-31T17:53:41Z

aten/src/ATen/native/ForeachUtils.h

+
+  auto expected_dtype = tensors1[0].dtype();
+
+  for (int i = 0; i < tensors1.size(); i++) {


The PR body implies that "All tensors in the list must have the same [device]". Should that be checked here?

I have this check in check_fast_route, as we dont want to fail the op if tensors are on different devices, but just go via regular route.

zou3519 · 2020-08-31T17:57:27Z

aten/src/ATen/native/cuda/ForeachTensorAddList.cu

+std::vector<Tensor> foreach_tensor_add_list_kernel_cuda(TensorList tensors1, TensorList tensors2) {
+    verify_list(tensors1, tensors2);
+
+    if (!check_fast_route(tensors1, tensors2)) {


nit: As a function name, "check_fast_route" doesn't sound like it would return a boolean (it sounds like it would return nothing). Something like "can_use_fast_route" sounds like it would return a boolean.

agree. renamed.

zou3519 · 2020-08-31T18:01:19Z

aten/src/ATen/native/cuda/ForeachTensorAddList.cu

+        return at::native::foreach_tensor_add_list_kernel_slow(tensors1, tensors2);
+    }
+
+    std::vector<std::vector<at::Tensor>> tensor_lists; 


(no action required) We know exactly how many lists there needs to be, so it would be nice if we could avoid the dynamic allocation here. This would require us to revisit the first argument to multi_tensor_apply.

Added to TODO as a follow up

aten/src/ATen/native/cuda/ForeachTensorAddList.cu

zou3519 · 2020-08-31T18:23:26Z

aten/src/ATen/native/cuda/ForeachFunctors.cuh

+                }
+            }
+            else {
+                // Non-divergent exit condition for __syncthreads, not necessary here


Where is the __syncthreads this is referring to? What exactly is "not necessary here"?

irrelevant in our case. removed

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#synchronization-functions

zou3519 · 2020-08-31T18:23:52Z

aten/src/ATen/native/cuda/ForeachFunctors.cuh

+
+            n -= chunk_idx * chunk_size;
+
+            T r_x[kILP];


What is kILP? It looks like it is equal to 4, but I'm not sure what that means

Instruction-level parallelism.
Please, note that this code is 95% taken from APEX so most of comments/constants are coming from there.

aten/src/ATen/native/cuda/ForeachFunctors.cuh

zou3519 · 2020-08-31T18:38:23Z

aten/src/ATen/native/cuda/ForeachFunctors.cuh

+                    }
+#pragma unroll
+                    for(int ii = 0; ii < kILP; ii++) {
+                        r_out[ii] = static_cast<T>(r_x[ii]) + static_cast<T>(r_y[ii]);


It seems like the important part of the Functor is r_out[ii] = static_cast<T>(r_x[ii]) + static_cast<T>(r_y[ii]); and that the other logic is pretty standard between Functors (from me comparing AddListFunctor and AddListFunctor_). Do we ever want functors that don't have similar logic? If not, it might be worth trying to abstract away the rest of the code via templates or macros so that someone implementing a new Functor doesn't have to copy & paste code that is considered to be boilerplate.

(to clarify, no action necessary, but I am curious about if we can clean this up in the future)

Yes, i thought about this. I had a hard time figuring out what all the functors will look like so i decided that i would add them all as is and will refactor later once i have all the requirements and corner cases. But im open for discussion here.

I'm also not sure what the functors will look like in the end state so I agree it makes sense to refactor later

zou3519

F987

I need to go read through how multi_tensor_apply and functors work, but the code looks like it is following in the footsteps of the last PRs so I don't have any major comments. Added some minor comments about the testing and some questions in-line

…ach_add_(TensorList tl1, TensorList tl2) APIs" [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** - Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API - Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Differential Revision: [D23331894](https://our.internmc.facebook.com/intern/diff/D23331894) [ghstack-poisoned]

zou3519 · 2020-09-01T14:05:21Z

aten/src/ATen/native/ForeachOpsKernels.cpp

+std::vector<Tensor> foreach_tensor_add_list_kernel_slow(TensorList tensors1, TensorList tensors2) {
+  verify_list(tensors1, tensors2);
+
+  std::vector<Tensor> result;


nit: result.reserve(tensors1.size())

zou3519 · 2020-09-01T14:08:54Z

aten/src/ATen/native/cuda/ForeachFunctors.cuh

+                    }
+#pragma unroll
+                    for(int ii = 0; ii < kILP; ii++) {
+                        r_out[ii] = static_cast<T>(r_x[ii]) + static_cast<T>(r_y[ii]);


I'm also not sure what the functors will look like in the end state so I agree it makes sense to refactor later

zou3519 · 2020-09-01T14:11:23Z

aten/src/ATen/native/ForeachUtils.h

    TORCH_CHECK(t.dtype() == expected_dtype, "All tensors in the tensor list must have the same dtype.");
  }
 }

-// To go via 'fast' path, several conditions must be satisfied 
+void verify_list(TensorList tensors1, TensorList tensors2) {


A comment here that these are the restrictions for the foreach TensorList APIs would be nice

zou3519 · 2020-09-01T14:13:23Z

test/test_foreach.py

+        # different devices
+        tensor1 = torch.zeros(10, 10, device="cuda:0")
+        tensor2 = torch.ones(10, 10, device="cuda:1")
+        with self.assertRaisesRegex(RuntimeError, "Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"):


Feel free to just check part of the string, e.g. "Expected all tensors to be on the same device". The goal of checking assertRaisesRegex is to make sure the error message isn't something completely unreadable like "TORCH_INTERNAL_ASSERT(false)".

zou3519 · 2020-09-01T14:14:03Z

test/test_foreach.py

+        # Coresponding tensors with different sizes 
+        tensors1 = [torch.zeros(10, 10, device=device) for _ in range(10)]
+        tensors2 = [torch.ones(11, 11, device=device) for _ in range(10)]
+        with self.assertRaisesRegex(RuntimeError, "Corresponding tensors in lists must have the same size, got \[10, 10\] and \[11, 11\]"):


'\[' isn't a valid escape sequence for regular strings, you have to append an r to the front of the string:

r"Corresponding tensors in lists..."

zou3519 · 2020-09-01T14:22:14Z

aten/src/ATen/native/ForeachUtils.h

    TORCH_CHECK(t.dtype() == expected_dtype, "All tensors in the tensor list must have the same dtype.");
  }
 }

-// To go via 'fast' path, several conditions must be satisfied 
+void verify_list(TensorList tensors1, TensorList tensors2) {


nit: this is in the at::native namespace. To avoid name collision, it might be nice to name this something more detailed, like "check_foreach_api_restrictions"

…ach_add_(TensorList tl1, TensorList tl2) APIs" [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** - Adding a `_foreach_add(TensorList tl1, TensorList tl2)` API - Adding a `_foreach_add_(TensorList tl1, TensorList tl2)` API **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Differential Revision: [D23331894](https://our.internmc.facebook.com/intern/diff/D23331894) [ghstack-poisoned]

facebook-github-bot · 2020-09-02T20:00:22Z

@izdeby merged this pull request in 297c938.

Added add_list APIs

fd445a3

[ghstack-poisoned]

izdeby changed the title ~~Added add_list APIs~~ [IGNORE] Added add_list APIs Aug 4, 2020

Iurii Zdebskyi added 5 commits August 4, 2020 09:02

Update on "[IGNORE] Added add_list APIs"

4cafe2f

[ghstack-poisoned]

Update on "[IGNORE] Added add_list APIs"

d699850

[ghstack-poisoned]

Update on "[IGNORE] Added add_list APIs"

7cfbee8

[ghstack-poisoned]

Update on "[IGNORE] Added add_list APIs"

b839bcc

[ghstack-poisoned]

Update on "[IGNORE] Added add_list APIs"

0a4c0d1

[ghstack-poisoned]

izdeby changed the title ~~[IGNORE] Added add_list APIs~~ [IGNORE] Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) Aug 4, 2020

izdeby changed the title ~~[IGNORE] Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2)~~ [IGNORE] Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs Aug 4, 2020

Update on "[IGNORE] Add _foreach_add(TensorList tl1, TensorList tl2) …

eead4a6

…and _foreach_add_(TensorList tl1, TensorList tl2) APIs" [ghstack-poisoned]

izdeby changed the title ~~[IGNORE] Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs~~ Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs Aug 4, 2020

izdeby requested review from bhosmer, mcarilli, ezyang, ngimel, gchanan and cpuhrsch August 4, 2020 23:14

izdeby mentioned this pull request Aug 7, 2020

Added pointwise ops: addcdiv, addcmul #42760

Closed