8000 Provide efficient implementation for operations on lists of tensors. · Issue #38655 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Provide efficient implementation for operations on lists of tensors. #38655

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ngimel opened this issue May 18, 2020 · 22 comments
Closed

Provide efficient implementation for operations on lists of tensors. #38655

ngimel opened this issue May 18, 2020 · 22 comments
Labels
feature A request for a proper, new feature. module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ngimel
Copy link
Collaborator
ngimel commented May 18, 2020

We should have efficient implementations for a small subset of operations on the lists of tensors, such as

_tensor_list_add.Tensor(Tensor[] self, Tensor[] other, *, Scalar alpha=1)

Motivation

For a lot of GPU training workloads, the performance of the optimizer becomes the bottleneck. Currently there are two extreme cases of gpu optimizer performance - very flexible and hackable implementations in torch.optim, and fully fused rigid implementations in apex (github.com/nvidia/apex), available for sgd, Adam, and, as of couple days back, adagrad. People
who want to experiment with optimizers either have to suffer through bad performance that typically would be seen when implementing optimizer the usual way, iterating over lists of tensors, or, if they are up to it, locally modify apex to have a fast fused kernel. The latter part can be pretty challenging.

A typical optimizer launches 5-10 kernels per parameter tensor. Models can have hundreds of parameters, so 500-1000 kernels have to be launched. Those kernels are often very small, and the operations become cpu-bound. If we can launch a single kernel that operates on all the parameter tensors at the same time, we can reduce the number of kernels launches by a factor of 100 and relieve the cpu-boundness.

Other situations where efficient implementation of operations on tensor lists are helpful are .zero_grad() function, unscaling gradients and checking for nan values in the mixed precision training, computing norm of parameters for clip_grad_norm, converting parameters/gradients between low- and high- precision type for mixed precision training.

Nvidia has kernels that can efficiently operate on disjoint list of tensors in apex, but it's impossible to use those kernels outside of apex, and each particular operation has to be implemented (currently those kernels are used for mixed precision gradient scaling and a few fused optimizers). If efficient list operations were available in pytorch core, that would
enable much easier experimentation.

Proposal

Expose experimental tensor list operations for arithmetic operations, fill_, norm backed by efficient implementations.
Initial list of operations can be discussed, but tentatively, it should be enough to implement operations in the existing optimizers. In particular, we should support inplace variants, and operations between tensor list and scalar. E.g., for add we want the
following variants:

_add_tensor_list.Tensor(Tensor[] self, Tensor other[], *, Scalar alpha=1) -> Tensor[]
_add_tensor_list.Scalar(Tensor[] self, Scalar other, Scalar alpha=1) -> Tensor[]
_add_tensor_list_.Tensor(Tensor[](a!) self, Tensor[] other, *, Scalar alpha=1) -> Tensor[](a!) #inplace
_add_tensor_list_.Scalar(Tensor[](a!) self, Scalar other, Scalar alpha=1) -> Tensor[](a!) #inplace scalar variant
_add_tensor_list.out(Tensor[] self, Tensor[] other, *, Scalar alpha=1, Tensor[](a!) out) -> Tensor[](a!)

Type promotion should be supported as for regular operations.
Shapes in the list arguments should be the same, error out on mismatch or if broadcasting is required (this is open for debate).
For backends that don't have efficient implementations or don't need them fall back to implemetations looping over tensors in the list.

After this is done, existing optimizers can be rewritten to use these ops, leading to better performance. Optimizer implementation itself will still remain readable, accessible and hackable.

Backward formulas can be provided for operations on the list. However, given that primary focus is use in optimizers, backward should not be a blocker.

Alternatives

This proposal has a significant overlap with NestedTensor proposal, but we understand that it is hard to implement NestedTensors to cover all the relevant scenarios. This proposal is much more limited in scope, and efficient implementations that are put in the core as part of this proposal can be reused for NestedTensor.
Similarly, once efficient implementations are in the core, TorchScript can have passes that will convert operations that loop over tensors in the list to call into these efficient implementations.

cc @cpuhrsch @vincentqb @mcarilli @soumith @dzhulgakov @ezyang

@ngimel ngimel added feature A request for a proper, new feature. module: optimizer Related to torch.optim labels May 18, 2020
@mcarilli
Copy link
Collaborator
mcarilli commented May 18, 2020

Awesome, been looking forward to getting this in core for a long time. Do we want the implementations to handle arbitrary tensor layouts, or layouts that don't match across across or between lists? I think it must, for consistency with TensorIterator-based APIs, which do allow ie a+b where a and b have different layouts.

For reference my current harness is here: https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_apply.cuh. It accepts a functor that allows arbitrary pointwise ops on lists of tensors, and processes them in one or a few launches with bandwidth comparable to one pointwise launch on a single flattened tensor. It can also be used for reductions. Several people (@slayton58, @FDecaYed, @thorjohnsen, @tulloch, @crcrpar) have used it to create custom fused optimizers.

@ngimel
Copy link
Collaborator Author
ngimel commented May 18, 2020

For consistency it should handle non-matching layouts, but it's ok to call .contiguous() on problematic tensors rather than torture the kernels themselves. Matching memory-dense (but not necessarily contiguous) tensors in the lists should not be a problem, right?

@vadimkantorov
Copy link
Contributor
vadimkantorov commented May 18, 2020

Tangentially related: reductions on lists of tensors #27522

8000
@soumith
Copy link
Member
soumith commented May 18, 2020

_add_tensor_list as a private function makes sense, and from what I can see the work on the kernels can be reused to any final / better implementation.

Let's do this!

@dzhulgakov
Copy link
Collaborator

Yay, let's do it!

How would the re-written optimizers look like? Explicit calls to torch._tensor_list_add and others might be a bit of a mouthful. And if we wrap it into some python wrapper with operator overloading it might not work with TorchScript.

@agolynski agolynski added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 19, 2020
@mcarilli
Copy link
Collaborator

@dzhulgakov On the optimizer side, the only minor annoyance is organizing params+grads+state buffers into lists according to how flexible the multi tensor apply backend is. For example, the Apex implementation only handles lists as long as all tensors in each list are the same dtype**, so params+grads need to be split accordingly (see @ajtulloch 's code). I think it remains legible. The split across dtypes could even be moved to the backend so the user can pass any tensor list.

** for ops that act on several lists, the dtype must be the same within each list but may differ across lists. This enables me to effectively fuse casting with any pointwise op, ie, read list of fp16 tensors, perform arbitrary pointwise op, and write out into a list of fp32 tensors. Also, fun fact, if the output tensor list was created as sliced views of a large empty tensor, what you've done is a fused cast + pointwise op + cat. Multi tensor apply is a great way to do cat, cleaner than the existing implementation in that no auxiliary metadata tensors need to be allocated. Metadata is passed via kernel args.

@gchanan
Copy link
Contributor
gchanan commented May 19, 2020

Awesome, been looking forward to getting this in core for a long time. Do we want the implementations to handle arbitrary tensor layouts, or layouts that don't match across across or between lists? I think it must, for consistency with TensorIterator-based APIs, which do allow ie a+b where a and b have different layouts.

what definition of layout are we using here? strides?

Shapes in the list arguments should be the same, error out on mismatch or if broadcasting is required (this is open for debate).

this is a bit of bike-shedding, but it's not obvious what the semantics are when we call things like _add_tensor_list -- it sounds like tensor_list is some opaque thing with its own semantics (like NestedTensor) when I don't think we want to start getting in that general game.

AFAICT, up until @mcarilli's point about doing "fused cast + pointwise op + cat" we are really just talking about doing a "for each" operator application -- and even @mcarilli's optimization would maybe fit (although I'm not sure whether autograd would actually work given the limitations on views).

So, long story short, it seems nicer to call things "_add_foreach.Tensor[]" or "_add_foreach.TensorList" or similar.

@ngimel
Copy link
Collaborator Author
ngimel commented May 19, 2020

Agreed, _add_foreach is a better name, we definitely don't want opaque things.
Layout here means strides, yes.

@vincentqb vincentqb added the module: nestedtensor NestedTensor tag see issue #25032 label May 19, 2020
@cpuhrsch
Copy link
Contributor

This is great! NestedTensor can lower a lot of operations to TensorList. In fact, the TensorList operations NestedTensor stand to launch will be a subset of the ones planned to be supported here. NestedTensor is a Tensor-esque type and as such must have a consistent dtype, device, dimension and layout. So, it's not the right container for this case, if we want to support mixed dtypes, layouts etc.

One comment here, you might want the Tensors in a given TensorList to be part of a large contiguous piece of memory. This is a condition that could be detected at runtime based on data_ptr, offsets, numel() etc., but there could also be place for a PackedTensorList structure or such that has some additional properties along these lines. This is again something NestedTensor could dispatch to.

@FDecaYed
Copy link
Contributor
FDecaYed commented May 20, 2020

This is great news!
One thing to note is launching single kernel over a list of tensor is not only a CPU optimization, but also help achieving higher bandwidth on device side.
So for workloads that's not CPU bound, probably thanks to the async nature of work scheduling, this can still benefit them.

@mcarilli
Copy link
Collaborator
mcarilli commented May 20, 2020

@cpuhrsch I don't think Greg means "TensorList" will be its own class, if that's what you meant. This is not something intended to rival NestedTensor, it's lower-level and can serve as an implementation detail of NestedTensor someday. I think Greg means _add_foreach.TensorList is what the overload declaration will look like in native_functions.yaml and TensorList is ArrayRef<Tensor>.

@vadimkantorov
Copy link
Contributor

One comment here, you might want the Tensors in a given TensorList to be part of a large contiguous piece of memory.

If flatten_parameters was supported on general models, that could be the result of it :)

@ngimel
Copy link
Collaborator Author
ngimel commented May 20, 2020

With current multi_tensor_apply implementation in apex there's very little difference between operating on a large contiguous space of memory or on the tensors in the different memory locations. It is still good to keep in mind as potential optimization, but the general idea is to not require flatten_parameters or something similar (which is memory and performance overhead) and still be able to operate on parameters/gradients in an efficient way.

@vincentqb vincentqb removed the module: nestedtensor NestedTensor tag see issue #25032 label May 20, 2020
@cpuhrsch cpuhrsch reopened this May 21, 2020
@izdeby izdeby self-assigned this Jun 24, 2020
izdeby pushed a commit that referenced this issue Aug 4, 2020
**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API

**Tests**
Tested via unit tests

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Aug 4, 2020
**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API

**Tests**
Tested via unit tests

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this issue Aug 4, 2020
…ors, Scalar scalar) (#41554)

Summary:
Initial PR for the Tensor List functionality.

**Motivation**
[GitHub issue](#38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**In this PR**
- Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA.
- Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)`

**Tests**
Tested via unit tests

**Plan for the next PRs**

1. Cover these ops with `multi_tensor_apply` support
- exponent
- division
- mul_
- add_
- addcmul_
- addcdiv_
- Sqrt

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

Pull Request resolved: #41554

Reviewed By: cpuhrsch

Differential Revision: D22829724

Pulled By: izdeby

fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
izdeby pushed a commit that referenced this issue Aug 4, 2020
…ar scalar) "


**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554).

**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](#41554).

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Aug 4, 2020
**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554).

**In this PR**
- Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API
- Resolving some additional comments from previous [PR](#41554).

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists

**Plan for the next PRs**
1. APIs
- Binary Ops for list with Scalar
- Binary Ops for list with list
- Unary Ops for list

2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains.

[ghstack-poisoned]
facebook-github-bot pushed a commit that referenced this issue Sep 8, 2020
Summary:
Pull Request resolved: #42537

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554).

**Motivation**
[GitHub issue](#38655)
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start.
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex).
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs).
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting.

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path.
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same.

----------------
**In this PR**
Adding APIs:
```
torch._foreach_exp(TensorList tl1)
torch._foreach_exp_(TensorList tl1)
torch._foreach_sqrt(TensorList tl1)
torch._foreach_sqrt_(TensorList tl1)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Test Plan: Imported from OSS

Reviewed By: cpuhrsch

Differential Revision: D23331889

Pulled By: izdeby

fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
izdeby pushed a commit that referenced this issue Sep 8, 2020
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554).

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same. 

----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)

torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Differential Revision: [D23331891](https://our.internmc.facebook.com/intern/diff/D23331891)

[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Sep 8, 2020
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554).

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same. 

----------------
**In this PR**
Adding APIs:
```
torch._foreach_sub(TensorList tl1, TensorList tl2)
torch._foreach_sub_(TensorList self, TensorList tl2)
torch._foreach_mul(TensorList tl1, TensorList tl2)
torch._foreach_mul_(TensorList self, TensorList tl2)
torch._foreach_div(TensorList tl1, TensorList tl2)
torch._foreach_div_(TensorList self, TensorList tl2)

torch._foreach_sub(TensorList tl1, Scalar scalar)
torch._foreach_sub_(TensorList self, Scalar scalar)
torch._foreach_mul(TensorList tl1, Scalar scalar)
torch._foreach_mul_(TensorList self, Scalar scalar)
torch._foreach_div(TensorList tl1, Scalar scalar)
torch._foreach_div(TensorList self, Scalar scalar)
```

**Tests**
Tested via unit tests

**TODO**
1. Properly handle empty lists
2. Properly handle bool tensors

**Plan for the next PRs**
1. APIs
- Unary Ops for list
- Pointwise Ops

2. Complete tasks from TODO
3. Rewrite PyTorch optimizers to use for-each operators for performance gains.

Differential Revision: [D23331891](https://our.internmc.facebook.com/intern/diff/D23331891)

[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Sep 8, 2020
… lists"


Differential Revision: [D23331896](https://our.internmc.facebook.com/intern/diff/D23331896)

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same dtype.
- All Tensors must be on the same device. 


----------------
**In this PR**
Added alpha overloads for add/sub ops with lists

**Tests**
Tested via unit tests


[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Sep 8, 2020
Differential Revision: [D23331896](https://our.internmc.facebook.com/intern/diff/D23331896)

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same dtype.
- All Tensors must be on the same device. 


----------------
**In this PR**
Added alpha overloads for add/sub ops with lists

**Tests**
Tested via unit tests


[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Sep 8, 2020
Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893)

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same dtype.
- All Tensors must be on the same device. 


----------------
**In this PR**
- We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs.
- Rewriting adam optimizer with _foreach_* APIs

[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Sep 23, 2020
9E88
…APIs"


Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893)

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same dtype.
- All Tensors must be on the same device. 


----------------
**In this PR**
- We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs.
- Rewriting adam optimizer with _foreach_* APIs

[ghstack-poisoned]
izdeby pushed a commit that referenced this issue Sep 23, 2020
Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893)

**Motivation**
[GitHub issue](#38655) 
Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. 
As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). 
In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer.

**Current API restrictions**
- List can't be empty (will fixed in upcoming PRs). 
- All tensors in the list must have the same dtype, device and size.

**Broadcasting**
At this point we don't support broadcasting. 

**What is 'Fast' and 'Slow' route**
In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. 
To go the fast route,
- All tensors must have strided layout
- All tensors must be dense and not have overlapping memory
- The resulting tensor type must be the same dtype.
- All Tensors must be on the same device. 


----------------
**In this PR**
- We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs.
- Rewriting adam optimizer with _foreach_* APIs

[ghstack-poisoned]
@ssnl
Copy link
Collaborator
ssnl commented Dec 14, 2020

@ngimel I'm curious about the performance tradeoff between this and performing single kernel ops on a giant flat vector containing all parameters (i.e., the actual parameters would be non-overlapping views of it).

@ngimel
Copy link
Collaborator Author
ngimel commented Dec 14, 2020

The benchmarks show that multi-tensor-apply kernels achieve high fraction of peak bandwidth (@mcarilli will know more). The problem with giant flat vector containing params is that it's usually pretty fragile (unless you have a simple training loop, always use all the parameters, no parameter sharing etc).

@vadimkantorov
Copy link
Contributor
vadimkantorov commented Dec 14, 2020

One way to support flatten_parameters would be checking if all inputs to multi_tensor_apply are indeed contiguous in memory (a complication may be if there are any memory alignment gutters; one could sort tensors by memory address and check that difference between addresses is numel * itemsize), and then launching a single kernel. Then an optional flatten_parameters may be supported to be explicitly used only for advanced cases.

But as @ngimel had mentioned in #38655 (comment), the perf sacrifice maybe is not big anyway

@ssnl
Copy link
Collaborator
ssnl commented Dec 14, 2020

@ngimel Nice, thanks for responding quickly. Since the conditions for giant flat vector is actually met for most use cases, it is great to hear that we will have a solution that doesn't sacrifice much performance for generality.

@mcarilli
Copy link
Collaborator
mcarilli commented Dec 14, 2020

I'm curious about the performance tradeoff between this and performing single kernel ops on a giant flat vector containing all parameters (i.e., the actual parameters would be non-overlapping views of it).

On the GPU, MTA kernels should achieve bandwidth comparable to acting on a giant flat vector. If you find a realistic case where they don't, let us know.

The main perf difference vs acting on a flat buffer occurs on the CPU side: packing tensor lists for MTA launches in Python can be expensive (example: we observe over 3 msec to build lists from the entire parameter set in BERT, in some cases). To reduce this overhead, MTA clients (e.g. optimizers) could maintain such lists persistently where possible, and where necessary, filter them on the C++ side (eg for optimizers, with a helper like

std::tuple<std::vector<at::Tensor>,
           std::vector<at::Tensor>>
filter_params_with_grad(std::vector<at::Tensor> params) {
  std::vector<at::Tensor> params_with_grads;
  std::vector<at::Tensor> grads;
  params_with_grads.reserve(params.size());
  grads.reserve(params.size());
  for (const auto& p : params) {
    if (p.grad().defined()) {
      params_with_grads.push_back(p);
      grads.push_back(p.grad());
    }
  }
  return {params_with_grads, grads};
}

@ssnl
Copy link
Collaborator
ssnl commented Dec 14, 2020

@mcarilli That's great to hear. Will do.

@mcarilli
Copy link
Collaborator

Optimizers should robustly handle the cases where not all parameters receive grads every iteration.
Updated with a simple filter-helper example I proposed to our mlperf people to reduce MTA list-packing overhead for their fused optimizers.
I'll probably PR a C++ list-packing helper like that for MTA amp.GradScaler.unscale_ soon.

@vadimkantorov
Copy link
Contributor

There's also now some FlatParameter in FSDP. Is it related to general parameter flattening?

@awgu
Copy link
Collaborator
awgu commented Oct 29, 2022

There's also now some FlatParameter in FSDP. Is it related to general parameter flattening?

I do not think that FlatParameter in FSDP is related to general parameter flattening. It is actually just a normal nn.Parameter (that does not even override a single method).

class FlatParameter(nn.Parameter):

We have specialized FlatParameter to represent something in the context of FSDP (adding attributes to it and using it in a specific way). However, with respect to torch functions, it is just a 1D nn.Parameter.

@fegin fegin assigned fegin and unassigned izdeby and fegin Mar 30, 2023
@ngimel ngimel closed this as completed May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A request for a proper, new feature. module: optimizer Related to torch.optim triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

0