Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531

izdeby · 2020-08-04T15:02:37Z

Stack from ghstack:

Rewrote asgd optimizer with *_foreach APIs #43727 WIP asgd
Rewrote rprop optimizer with _foreach APIs #43726 Rewrote rprop optimizer with _foreach APIs
Rewrote rmsprop with _forerach APIs #43725 Rewrote rmsprop with _forerach APIs
Rewrote SGD with _foreach APIs #43724 Rewrote SGD with _foreach APIs
Rewrote AdamW with _foreach APIs #43562 Rewrote AdamW with _foreach APIs
[WIP] Rewrote adam optimizer with foreach APIs #43507 [WIP] Rewrote adam optimizer with foreach APIs
Added alpha overloads for add/sub ops with lists #43413 WIP. Added alpha overloads for add/sub ops with lists
Added pointwise ops: addcdiv, addcmul #42760 [IGNORE] [WIP] Added pointwise ops: addcdiv, addcmul
Add unary ops: exp, sqrt and pointwise ops: addcdiv, addcmul #42537 Add unary ops: exp and sqrt
Add binary ops for _foreach APIs #42536 Add binary ops for _foreach APIs
Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs #42533 Add _foreach_add(TensorList tl1, TensorList tl2) and foreach_add(TensorList tl1, TensorList tl2) APIs
Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531 Add foreach_add(TensorList tensors, Scalar scalar) API

First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar).

Motivation
GitHub issue
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at NVIDIAs Apex.
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

Current API restrictions

List can't be empty (will fixed in upcoming PRs).
All tensors in the list must have the same dtype, device and size.

Broadcasting
At this point we don't support broadcasting.

What is 'Fast' and 'Slow' route
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,

All tensors must have strided layout
All tensors must be dense and not have overlapping memory
The resulting tensor type must be the same.

In this PR

Adding a std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar) API
Resolving some additional comments from previous PR.

Tests
Tested via unit tests

TODO

Properly handle empty lists

Plan for the next PRs

APIs

Binary Ops for list with Scalar
Binary Ops for list with list
Unary Ops for list
Pointwise Ops

Complete tasks from TODO
Rewrite PyTorch optimizers to use for-each operators for performance gains.

Differential Revision: D23331892

[ghstack-poisoned]

dr-ci · 2020-08-04T15:30:58Z

💊 CI failures summary and remediations

As of commit 7e18bb0 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group.

See how this bot performed.

This comment has been revised 55 times.

[ghstack-poisoned]

**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API **Tests** Tested via unit tests **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

ngimel

Can you briefly describe why changes to autograd and jit are necessary?

aten/src/ATen/native/ForeachOpsKernels.cpp

test/test_foreach.py

aten/src/ATen/native/native_functions.yaml

gchanan · 2020-08-10T16:00:31Z

aten/src/ATen/native/native_functions.yaml

@@ -5450,6 +5450,13 @@
    CPU: foreach_add_scalar_kernel_fallback
    CUDA: foreach_tensor_add_scalar_kernel_cuda

+- func: _foreach_add_.Scalar(Tensor[](a!) self, Scalar scalar) -> Tensor[](a!)


we don't have in-place functions, only methods. Is it feasible to make this an out-variant function?

actually this isn't quite true, a bunch of stuff in nn::functional is defined this way (although they aren't typically directly accessible).

I thought about this as well, but if i change it to method and try building this, i get
"RuntimeError: Function 'foreach_add' starts with a single underscore and is configured to have a method on Tensor. Functions that start with a single underscore should only be functions in the at:: namespace and not methods on Tensor!"

Which made me follow other examples in native_functions.yaml

sounds good then!

wouldn't Tensor[](a!) mean that the list is being mutated, and Tensor(a!)[] be a list of mutable tensors?

@suo is that right? Is this safe to add?

Coming in here late, I noticed this problem too while working on the new codegen at #42629 I also agree that Tensor(a!)[] is the correct type, and what was landed (Tensor[](a!)) is not correct.

I added a small fix for the problem in the new codegen. Besides

- - func: _foreach_add_.Scalar(Tensor[](a!) self, Scalar scalar) -> () - device_guard: False + - func: _foreach_add_.Scalar(Tensor(a!)[] self, Scalar scalar) -> () + device_guard: False

I also had to add a special case to make void return for trailing underscore ok:

if self.name.name.inplace: # TODO: fixme if str(self.name) not in [ '_amp_non_finite_check_and_unscale_', '_foreach_add_.Scalar']: assert len(self.returns) == 1

… API" **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

aten/src/ATen/native/native_functions.yaml

… API" **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

… API" [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. --------------- **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. [ghstack-poisoned]

izdeby · 2020-08-13T19:35:24Z

@gchanan, @ngimel, @cpuhrsch
Updated the stack according to all the discussions. This PR is a bit messy but all the next ones should be straightforward.
Changes since last review:

Renamed foreach_add_scalar_kernel_fallback ---> foreach_tensor_add_scalar_kernel_slow_
updated tests to cover 4 different scalars
in place ops now return void
Moved AddScalarFunctor_ and AddScalarFunctor to native/cuda/ForeachFunctors.cuh which will be used for all functors in future.
native/cuda/ForeachUtils.cuh --> native/ForeachUtils.h
updated PR description

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. --------------- **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. [ghstack-poisoned]

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. --------------- **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Differential Revision: [D23331892](https://our.internmc.facebook.com/intern/diff/D23331892) [ghstack-poisoned]

codecov · 2020-08-27T22:50:31Z

Codecov Report

❗ No coverage uploaded for pull request base (gh/izdeby/35/base@c4234e0). Click here to learn what that means.
The diff coverage is n/a.

@@                 Coverage Diff                  @@
##             gh/izdeby/35/base   #42531   +/-   ##
====================================================
  Coverage                     ?   69.34%           
====================================================
  Files                        ?      378           
  Lines                        ?    46674           
  Branches                     ?        0           
====================================================
  Hits                         ?    32364           
  Misses                       ?    14310           
  Partials                     ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4234e0...7e18bb0. Read the comment docs.

facebook-github-bot · 2020-08-28T22:25:53Z

@izdeby merged this pull request in 4cb8d30.

BD6B

Added add_scalar_

0897736

[ghstack-poisoned]

izdeby requested review from albanD and apaszke as code owners August 4, 2020 15:02

facebook-github-bot added the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 4, 2020

izdeby changed the title ~~Added add_scalar_~~ [IGNORE] Added add_scalar_ Aug 4, 2020

Iurii Zdebskyi added 3 commits August 4, 2020 09:02

Update on "[IGNORE] Added add_scalar_"

e4d4960

[ghstack-poisoned]

izdeby removed the oncall: jit Add this issue/PR to JIT oncall triage queue label Aug 4, 2020

izdeby changed the title ~~[IGNORE] Added add_scalar_~~ [IGNORE] Add _foreach_add_(TensorList tensors, Scalar scalar) Aug 4, 2020

izdeby changed the title ~~[IGNORE] Add _foreach_add_(TensorList tensors, Scalar scalar)~~ Add _foreach_add_(TensorList tensors, Scalar scalar) Aug 4, 2020

izdeby requested review from gchanan, ngimel and mcarilli August 4, 2020 21:17

izdeby changed the title ~~Add _foreach_add_(TensorList tensors, Scalar scalar)~~ Add _foreach_add_(TensorList tensors, Scalar scalar) API Aug 4, 2020

izdeby requested review from bhosmer and cpuhrsch August 4, 2020 23:14

izdeby mentioned this pull request Aug 7, 2020

Added pointwise ops: addcdiv, addcmul #42760

Closed

ngimel reviewed Aug 7, 2020

View reviewed changes

aten/src/ATen/native/ForeachOpsKernels.cpp Outdated Show resolved Hide resolved

test/test_foreach.py Show resolved Hide resolved

test/test_foreach.py Outdated Show resolved Hide resolved

gchanan reviewed Aug 10, 2020

View reviewed changes

aten/src/ATen/native/native_functions.yaml Outdated Show resolved Hide resolved

gchanan reviewed Aug 10, 2020

View reviewed changes

gchanan reviewed Aug 11, 2020

View reviewed changes

aten/src/ATen/native/native_functions.yaml Show resolved Hide resolved

Iurii Zdebskyi added 4 commits August 12, 2020 18:19

izdeby changed the title ~~[WIP] Add _foreach_add_(TensorList tensors, Scalar scalar) API~~ Add _foreach_add_(TensorList tensors, Scalar scalar) API Aug 13, 2020

izdeby requested review from ngimel and gchanan August 13, 2020 19:35

This was referenced Aug 21, 2020

Added alpha overloads for add/sub ops with lists #43413

Closed

[WIP] Rewrote adam optimizer with foreach APIs #43507

Closed

izdeby requested a review from zou3519 August 25, 2020 18:13

izdeby mentioned this pull request Aug 25, 2020

Rewrote AdamW with _foreach APIs #43562

Closed

cpuhrsch approved these changes Aug 25, 2020

View reviewed changes

Iurii Zdebskyi added 2 commits August 25, 2020 17:55

This was referenced Aug 27, 2020

Rewrote SGD with _foreach APIs #43724

Closed

Rewrote rmsprop with _forerach APIs #43725

Closed

Rewrote rprop optimizer with _foreach APIs #43726

Closed

Rewrote asgd optimizer with *_foreach APIs #43727

Closed

facebook-github-bot closed this in 4cb8d30 Aug 28, 2020

facebook-github-bot added the merged label Aug 28, 2020

facebook-github-bot deleted the gh/izdeby/35/head branch September 1, 2020 14:17

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531

Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531

Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!