-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Provide efficient implementation for operations on lists of tensors. #38655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Awesome, been looking forward to getting this in core for a long time. Do we want the implementations to handle arbitrary tensor layouts, or layouts that don't match across across or between lists? I think it must, for consistency with TensorIterator-based APIs, which do allow ie a+b where a and b have different layouts. For reference my current harness is here: https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_apply.cuh. It accepts a functor that allows arbitrary pointwise ops on lists of tensors, and processes them in one or a few launches with bandwidth comparable to one pointwise launch on a single flattened tensor. It can also be used for reductions. Several people (@slayton58, @FDecaYed, @thorjohnsen, @tulloch, @crcrpar) have used it to create custom fused optimizers. |
For consistency it should handle non-matching layouts, but it's ok to call .contiguous() on problematic tensors rather than torture the kernels themselves. Matching memory-dense (but not necessarily contiguous) tensors in the lists should not be a problem, right? |
Tangentially related: reductions on lists of tensors #27522 |
Let's do this! |
Yay, let's do it! How would the re-written optimizers look like? Explicit calls to |
@dzhulgakov On the optimizer side, the only minor annoyance is organizing params+grads+state buffers into lists according to how flexible the multi tensor apply backend is. For example, the Apex implementation only handles lists as long as all tensors in each list are the same dtype**, so params+grads need to be split accordingly (see @ajtulloch 's code). I think it remains legible. The split across dtypes could even be moved to the backend so the user can pass any tensor list. ** for ops that act on several lists, the dtype must be the same within each list but may differ across lists. This enables me to effectively fuse casting with any pointwise op, ie, read list of fp16 tensors, perform arbitrary pointwise op, and write out into a list of fp32 tensors. Also, fun fact, if the output tensor list was created as sliced views of a large empty tensor, what you've done is a fused cast + pointwise op + cat. Multi tensor apply is a great way to do cat, cleaner than the existing implementation in that no auxiliary metadata tensors need to be allocated. Metadata is passed via kernel args. |
what definition of layout are we using here? strides?
this is a bit of bike-shedding, but it's not obvious what the semantics are when we call things like AFAICT, up until @mcarilli's point about doing "fused cast + pointwise op + cat" we are really just talking about doing a "for each" operator application -- and even @mcarilli's optimization would maybe fit (although I'm not sure whether autograd would actually work given the limitations on views). So, long story short, it seems nicer to call things "_add_foreach.Tensor[]" or "_add_foreach.TensorList" or similar. |
Agreed, _add_foreach is a better name, we definitely don't want opaque things. |
This is great! NestedTensor can lower a lot of operations to TensorList. In fact, the TensorList operations NestedTensor stand to launch will be a subset of the ones planned to be supported here. NestedTensor is a Tensor-esque type and as such must have a consistent dtype, device, dimension and layout. So, it's not the right container for this case, if we want to support mixed dtypes, layouts etc. One comment here, you might want the Tensors in a given TensorList to be part of a large contiguous piece of memory. This is a condition that could be detected at runtime based on data_ptr, offsets, numel() etc., but there could also be place for a PackedTensorList structure or such that has some additional properties along these lines. This is again something NestedTensor could dispatch to. |
This is great news! |
@cpuhrsch I don't think Greg means "TensorList" will be its own class, if that's what you meant. This is not something intended to rival NestedTensor, it's lower-level and can serve as an implementation detail of NestedTensor someday. I think Greg means |
If |
With current multi_tensor_apply implementation in apex there's very little difference between operating on a large contiguous space of memory or on the tensors in the different memory locations. It is still good to keep in mind as potential optimization, but the general idea is to not require |
**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API **Tests** Tested via unit tests **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]
**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API **Tests** Tested via unit tests **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]
…ors, Scalar scalar) (#41554) Summary: Initial PR for the Tensor List functionality. **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA. - Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)` **Tests** Tested via unit tests **Plan for the next PRs** 1. Cover these ops with `multi_tensor_apply` support - exponent - division - mul_ - add_ - addcmul_ - addcdiv_ - Sqrt 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. Pull Request resolved: #41554 Reviewed By: cpuhrsch Differential Revision: D22829724 Pulled By: izdeby fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1
…ar scalar) " **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]
**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]
Summary: Pull Request resolved: #42537 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_exp(TensorList tl1) torch._foreach_exp_(TensorList tl1) torch._foreach_sqrt(TensorList tl1) torch._foreach_sqrt_(TensorList tl1) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331889 Pulled By: izdeby fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_sub(TensorList tl1, TensorList tl2) torch._foreach_sub_(TensorList self, TensorList tl2) torch._foreach_mul(TensorList tl1, TensorList tl2) torch._foreach_mul_(TensorList self, TensorList tl2) torch._foreach_div(TensorList tl1, TensorList tl2) torch._foreach_div_(TensorList self, TensorList tl2) torch._foreach_sub(TensorList tl1, Scalar scalar) torch._foreach_sub_(TensorList self, Scalar scalar) torch._foreach_mul(TensorList tl1, Scalar scalar) torch._foreach_mul_(TensorList self, Scalar scalar) torch._foreach_div(TensorList tl1, Scalar scalar) torch._foreach_div(TensorList self, Scalar scalar) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Differential Revision: [D23331891](https://our.internmc.facebook.com/intern/diff/D23331891) [ghstack-poisoned]
[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_sub(TensorList tl1, TensorList tl2) torch._foreach_sub_(TensorList self, TensorList tl2) torch._foreach_mul(TensorList tl1, TensorList tl2) torch._foreach_mul_(TensorList self, TensorList tl2) torch._foreach_div(TensorList tl1, TensorList tl2) torch._foreach_div_(TensorList self, TensorList tl2) torch._foreach_sub(TensorList tl1, Scalar scalar) torch._foreach_sub_(TensorList self, Scalar scalar) torch._foreach_mul(TensorList tl1, Scalar scalar) torch._foreach_mul_(TensorList self, Scalar scalar) torch._foreach_div(TensorList tl1, Scalar scalar) torch._foreach_div(TensorList self, Scalar scalar) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Differential Revision: [D23331891](https://our.internmc.facebook.com/intern/diff/D23331891) [ghstack-poisoned]
… lists" Differential Revision: [D23331896](https://our.internmc.facebook.com/intern/diff/D23331896) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** Added alpha overloads for add/sub ops with lists **Tests** Tested via unit tests [ghstack-poisoned]
Differential Revision: [D23331896](https://our.internmc.facebook.com/intern/diff/D23331896) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** Added alpha overloads for add/sub ops with lists **Tests** Tested via unit tests [ghstack-poisoned]
Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** - We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs. - Rewriting adam optimizer with _foreach_* APIs [ghstack-poisoned]
9E88 …APIs" Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** - We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs. - Rewriting adam optimizer with _foreach_* APIs [ghstack-poisoned]
Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** - We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs. - Rewriting adam optimizer with _foreach_* APIs [ghstack-poisoned]
@ngimel I'm curious about the performance tradeoff between this and performing single kernel ops on a giant flat vector containing all parameters (i.e., the actual parameters would be non-overlapping views of it). |
The benchmarks show that multi-tensor-apply kernels achieve high fraction of peak bandwidth (@mcarilli will know more). The problem with giant flat vector containing params is that it's usually pretty fragile (unless you have a simple training loop, always use all the parameters, no parameter sharing etc). |
One way to support flatten_parameters would be checking if all inputs to multi_tensor_apply are indeed contiguous in memory (a complication may be if there are any memory alignment gutters; one could sort tensors by memory address and check that difference between addresses is numel * itemsize), and then launching a single kernel. Then an optional flatten_parameters may be supported to be explicitly used only for advanced cases. But as @ngimel had mentioned in #38655 (comment), the perf sacrifice maybe is not big anyway |
@ngimel Nice, thanks for responding quickly. Since the conditions for giant flat vector is actually met for most use cases, it is great to hear that we will have a solution that doesn't sacrifice much performance for generality. |
On the GPU, MTA kernels should achieve bandwidth comparable to acting on a giant flat vector. If you find a realistic case where they don't, let us know. The main perf difference vs acting on a flat buffer occurs on the CPU side: packing tensor lists for MTA launches in Python can be expensive (example: we observe over 3 msec to build lists from the entire parameter set in BERT, in some cases). To reduce this overhead, MTA clients (e.g. optimizers) could maintain such lists persistently where possible, and where necessary, filter them on the C++ side (eg for optimizers, with a helper like
|
@mcarilli That's great to hear. Will do. |
Optimizers should robustly handle the cases where not all parameters receive grads every iteration. |
There's also now some FlatParameter in FSDP. Is it related to general parameter flattening? |
I do not think that pytorch/torch/distributed/fsdp/flat_param.py Line 123 in 14d5f13
We have specialized |
Uh oh!
There was an error while loading. Please reload this page.
We should have efficient implementations for a small subset of operations on the lists of tensors, such as
Motivation
For a lot of GPU training workloads, the performance of the optimizer becomes the bottleneck. Currently there are two extreme cases of gpu optimizer performance - very flexible and hackable implementations in torch.optim, and fully fused rigid implementations in apex (github.com/nvidia/apex), available for sgd, Adam, and, as of couple days back, adagrad. People
who want to experiment with optimizers either have to suffer through bad performance that typically would be seen when implementing optimizer the usual way, iterating over lists of tensors, or, if they are up to it, locally modify apex to have a fast fused kernel. The latter part can be pretty challenging.
A typical optimizer launches 5-10 kernels per parameter tensor. Models can have hundreds of parameters, so 500-1000 kernels have to be launched. Those kernels are often very small, and the operations become cpu-bound. If we can launch a single kernel that operates on all the parameter tensors at the same time, we can reduce the number of kernels launches by a factor of 100 and relieve the cpu-boundness.
Other situations where efficient implementation of operations on tensor lists are helpful are
.zero_grad()
function, unscaling gradients and checking fornan
values in the mixed precision training, computing norm of parameters forclip_grad_norm
, converting parameters/gradients between low- and high- precision type for mixed precision training.Nvidia has kernels that can efficiently operate on disjoint list of tensors in apex, but it's impossible to use those kernels outside of apex, and each particular operation has to be implemented (currently those kernels are used for mixed precision gradient scaling and a few fused optimizers). If efficient list operations were available in pytorch core, that would
enable much easier experimentation.
Proposal
Expose experimental tensor list operations for arithmetic operations,
fill_
,norm
backed by efficient implementations.Initial list of operations can be discussed, but tentatively, it should be enough to implement operations in the existing optimizers. In particular, we should support inplace variants, and operations between tensor list and scalar. E.g., for
add
we want thefollowing variants:
Type promotion should be supported as for regular operations.
Shapes in the list arguments should be the same, error out on mismatch or if broadcasting is required (this is open for debate).
For backends that don't have efficient implementations or don't need them fall back to implemetations looping over tensors in the list.
After this is done, existing optimizers can be rewritten to use these ops, leading to better performance. Optimizer implementation itself will still remain readable, accessible and hackable.
Backward formulas can be provided for operations on the list. However, given that primary focus is use in optimizers, backward should not be a blocker.
Alternatives
This proposal has a significant overlap with NestedTensor proposal, but we understand that it is hard to implement NestedTensors to cover all the relevant scenarios. This proposal is much more limited in scope, and efficient implementations that are put in the core as part of this proposal can be reused for NestedTensor.
Similarly, once efficient implementations are in the core, TorchScript can have passes that will convert operations that loop over tensors in the list to call into these efficient implementations.
cc @cpuhrsch @vincentqb @mcarilli @soumith @dzhulgakov @ezyang
The text was updated successfully, but these errors were encountered: