Provide efficient implementation for operations on lists of tensors. #38655

ngimel · 2020-05-18T15:48:55Z

We should have efficient implementations for a small subset of operations on the lists of tensors, such as

_tensor_list_add.Tensor(Tensor[] self, Tensor[] other, *, Scalar alpha=1)

Motivation

For a lot of GPU training workloads, the performance of the optimizer becomes the bottleneck. Currently there are two extreme cases of gpu optimizer performance - very flexible and hackable implementations in torch.optim, and fully fused rigid implementations in apex (github.com/nvidia/apex), available for sgd, Adam, and, as of couple days back, adagrad. People
who want to experiment with optimizers either have to suffer through bad performance that typically would be seen when implementing optimizer the usual way, iterating over lists of tensors, or, if they are up to it, locally modify apex to have a fast fused kernel. The latter part can be pretty challenging.

A typical optimizer launches 5-10 kernels per parameter tensor. Models can have hundreds of parameters, so 500-1000 kernels have to be launched. Those kernels are often very small, and the operations become cpu-bound. If we can launch a single kernel that operates on all the parameter tensors at the same time, we can reduce the number of kernels launches by a factor of 100 and relieve the cpu-boundness.

Other situations where efficient implementation of operations on tensor lists are helpful are .zero_grad() function, unscaling gradients and checking for nan values in the mixed precision training, computing norm of parameters for clip_grad_norm, converting parameters/gradients between low- and high- precision type for mixed precision training.

Nvidia has kernels that can efficiently operate on disjoint list of tensors in apex, but it's impossible to use those kernels outside of apex, and each particular operation has to be implemented (currently those kernels are used for mixed precision gradient scaling and a few fused optimizers). If efficient list operations were available in pytorch core, that would
enable much easier experimentation.

Proposal

Expose experimental tensor list operations for arithmetic operations, fill_, norm backed by efficient implementations.
Initial list of operations can be discussed, but tentatively, it should be enough to implement operations in the existing optimizers. In particular, we should support inplace variants, and operations between tensor list and scalar. E.g., for add we want the
following variants:

_add_tensor_list.Tensor(Tensor[] self, Tensor other[], *, Scalar alpha=1) -> Tensor[]
_add_tensor_list.Scalar(Tensor[] self, Scalar other, Scalar alpha=1) -> Tensor[]
_add_tensor_list_.Tensor(Tensor[](a!) self, Tensor[] other, *, Scalar alpha=1) -> Tensor[](a!) #inplace
_add_tensor_list_.Scalar(Tensor[](a!) self, Scalar other, Scalar alpha=1) -> Tensor[](a!) #inplace scalar variant
_add_tensor_list.out(Tensor[] self, Tensor[] other, *, Scalar alpha=1, Tensor[](a!) out) -> Tensor[](a!)

Type promotion should be supported as for regular operations.
Shapes in the list arguments should be the same, error out on mismatch or if broadcasting is required (this is open for debate).
For backends that don't have efficient implementations or don't need them fall back to implemetations looping over tensors in the list.

After this is done, existing optimizers can be rewritten to use these ops, leading to better performance. Optimizer implementation itself will still remain readable, accessible and hackable.

Backward formulas can be provided for operations on the list. However, given that primary focus is use in optimizers, backward should not be a blocker.

Alternatives

This proposal has a significant overlap with NestedTensor proposal, but we understand that it is hard to implement NestedTensors to cover all the relevant scenarios. This proposal is much more limited in scope, and efficient implementations that are put in the core as part of this proposal can be reused for NestedTensor.
Similarly, once efficient implementations are in the core, TorchScript can have passes that will convert operations that loop over tensors in the list to call into these efficient implementations.

cc @cpuhrsch @vincentqb @mcarilli @soumith @dzhulgakov @ezyang

The text was updated successfully, but these errors were encountered:

mcarilli · 2020-05-18T15:54:33Z

Awesome, been looking forward to getting this in core for a long time. Do we want the implementations to handle arbitrary tensor layouts, or layouts that don't match across across or between lists? I think it must, for consistency with TensorIterator-based APIs, which do allow ie a+b where a and b have different layouts.

For reference my current harness is here: https://github.com/NVIDIA/apex/blob/master/csrc/multi_tensor_apply.cuh. It accepts a functor that allows arbitrary pointwise ops on lists of tensors, and processes them in one or a few launches with bandwidth comparable to one pointwise launch on a single flattened tensor. It can also be used for reductions. Several people (@slayton58, @FDecaYed, @thorjohnsen, @tulloch, @crcrpar) have used it to create custom fused optimizers.

ngimel · 2020-05-18T16:00:41Z

For consistency it should handle non-matching layouts, but it's ok to call .contiguous() on problematic tensors rather than torture the kernels themselves. Matching memory-dense (but not necessarily contiguous) tensors in the lists should not be a problem, right?

vadimkantorov · 2020-05-18T16:23:02Z

Tangentially related: reductions on lists of tensors #27522

soumith · 2020-05-18T22:59:38Z

_add_tensor_list as a private function makes sense, and from what I can see the work on the kernels can be reused to any final / better implementation.

Let's do this!

dzhulgakov · 2020-05-19T05:56:05Z

Yay, let's do it!

How would the re-written optimizers look like? Explicit calls to torch._tensor_list_add and others might be a bit of a mouthful. And if we wrap it into some python wrapper with operator overloading it might not work with TorchScript.

mcarilli · 2020-05-19T16:18:50Z

@dzhulgakov On the optimizer side, the only minor annoyance is organizing params+grads+state buffers into lists according to how flexible the multi tensor apply backend is. For example, the Apex implementation only handles lists as long as all tensors in each list are the same dtype**, so params+grads need to be split accordingly (see @ajtulloch 's code). I think it remains legible. The split across dtypes could even be moved to the backend so the user can pass any tensor list.

** for ops that act on several lists, the dtype must be the same within each list but may differ across lists. This enables me to effectively fuse casting with any pointwise op, ie, read list of fp16 tensors, perform arbitrary pointwise op, and write out into a list of fp32 tensors. Also, fun fact, if the output tensor list was created as sliced views of a large empty tensor, what you've done is a fused cast + pointwise op + cat. Multi tensor apply is a great way to do cat, cleaner than the existing implementation in that no auxiliary metadata tensors need to be allocated. Metadata is passed via kernel args.

gchanan · 2020-05-19T19:23:30Z

Awesome, been looking forward to getting this in core for a long time. Do we want the implementations to handle arbitrary tensor layouts, or layouts that don't match across across or between lists? I think it must, for consistency with TensorIterator-based APIs, which do allow ie a+b where a and b have different layouts.

what definition of layout are we using here? strides?

Shapes in the list arguments should be the same, error out on mismatch or if broadcasting is required (this is open for debate).

this is a bit of bike-shedding, but it's not obvious what the semantics are when we call things like _add_tensor_list -- it sounds like tensor_list is some opaque thing with its own semantics (like NestedTensor) when I don't think we want to start getting in that general game.

AFAICT, up until @mcarilli's point about doing "fused cast + pointwise op + cat" we are really just talking about doing a "for each" operator application -- and even @mcarilli's optimization would maybe fit (although I'm not sure whether autograd would actually work given the limitations on views).

So, long story short, it seems nicer to call things "_add_foreach.Tensor[]" or "_add_foreach.TensorList" or similar.

ngimel · 2020-05-19T19:34:53Z

Agreed, _add_foreach is a better name, we definitely don't want opaque things.
Layout here means strides, yes.

cpuhrsch · 2020-05-20T02:03:47Z

This is great! NestedTensor can lower a lot of operations to TensorList. In fact, the TensorList operations NestedTensor stand to launch will be a subset of the ones planned to be supported here. NestedTensor is a Tensor-esque type and as such must have a consistent dtype, device, dimension and layout. So, it's not the right container for this case, if we want to support mixed dtypes, layouts etc.

One comment here, you might want the Tensors in a given TensorList to be part of a large contiguous piece of memory. This is a condition that could be detected at runtime based on data_ptr, offsets, numel() etc., but there could also be place for a PackedTensorList structure or such that has some additional properties along these lines. This is again something NestedTensor could dispatch to.

FDecaYed · 2020-05-20T03:08:51Z

This is great news!
One thing to note is launching single kernel over a list of tensor is not only a CPU optimization, but also help achieving higher bandwidth on device side.
So for workloads that's not CPU bound, probably thanks to the async nature of work scheduling, this can still benefit them.

mcarilli · 2020-05-20T03:17:52Z

@cpuhrsch I don't think Greg means "TensorList" will be its own class, if that's what you meant. This is not something intended to rival NestedTensor, it's lower-level and can serve as an implementation detail of NestedTensor someday. I think Greg means _add_foreach.TensorList is what the overload declaration will look like in native_functions.yaml and TensorList is ArrayRef<Tensor>.

vadimkantorov · 2020-05-20T10:17:13Z

One comment here, you might want the Tensors in a given TensorList to be part of a large contiguous piece of memory.

If flatten_parameters was supported on general models, that could be the result of it :)

ngimel · 2020-05-20T16:24:11Z

With current multi_tensor_apply implementation in apex there's very little difference between operating on a large contiguous space of memory or on the tensors in the different memory locations. It is still good to keep in mind as potential optimization, but the general idea is to not require flatten_parameters or something similar (which is memory and performance overhead) and still be able to operate on parameters/gradients in an efficient way.

**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API **Tests** Tested via unit tests **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

…ors, Scalar scalar) (#41554) Summary: Initial PR for the Tensor List functionality. **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **In this PR** - Adding `multi_tensor_apply` mechanism which will help to efficiently apply passed functor on a given list of tensors on CUDA. - Adding a first private API - `std::vector<Tensor> _foreach_add(TensorList tensors, Scalar scalar)` **Tests** Tested via unit tests **Plan for the next PRs** 1. Cover these ops with `multi_tensor_apply` support - exponent - division - mul_ - add_ - addcmul_ - addcdiv_ - Sqrt 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. Pull Request resolved: #41554 Reviewed By: cpuhrsch Differential Revision: D22829724 Pulled By: izdeby fbshipit-source-id: 47febdbf7845cf931958a638567b7428a24782b1

…ar scalar) " **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

**Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **In this PR** - Adding a `std::vector<Tensor> _foreach_add_(TensorList tensors, Scalar scalar)` API - Resolving some additional comments from previous [PR](#41554). **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists **Plan for the next PRs** 1. APIs - Binary Ops for list with Scalar - Binary Ops for list with list - Unary Ops for list 2. Rewrite PyTorch optimizers to use for-each operators in order to get performance gains. [ghstack-poisoned]

Summary: Pull Request resolved: #42537 [First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_exp(TensorList tl1) torch._foreach_exp_(TensorList tl1) torch._foreach_sqrt(TensorList tl1) torch._foreach_sqrt_(TensorList tl1) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Test Plan: Imported from OSS Reviewed By: cpuhrsch Differential Revision: D23331889 Pulled By: izdeby fbshipit-source-id: 8b04673b8412957472ed56361954ca3884eb9376

[First PR: Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar)](#41554). **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same. ---------------- **In this PR** Adding APIs: ``` torch._foreach_sub(TensorList tl1, TensorList tl2) torch._foreach_sub_(TensorList self, TensorList tl2) torch._foreach_mul(TensorList tl1, TensorList tl2) torch._foreach_mul_(TensorList self, TensorList tl2) torch._foreach_div(TensorList tl1, TensorList tl2) torch._foreach_div_(TensorList self, TensorList tl2) torch._foreach_sub(TensorList tl1, Scalar scalar) torch._foreach_sub_(TensorList self, Scalar scalar) torch._foreach_mul(TensorList tl1, Scalar scalar) torch._foreach_mul_(TensorList self, Scalar scalar) torch._foreach_div(TensorList tl1, Scalar scalar) torch._foreach_div(TensorList self, Scalar scalar) ``` **Tests** Tested via unit tests **TODO** 1. Properly handle empty lists 2. Properly handle bool tensors **Plan for the next PRs** 1. APIs - Unary Ops for list - Pointwise Ops 2. Complete tasks from TODO 3. Rewrite PyTorch optimizers to use for-each operators for performance gains. Differential Revision: [D23331891](https://our.internmc.facebook.com/intern/diff/D23331891) [ghstack-poisoned]

… lists" Differential Revision: [D23331896](https://our.internmc.facebook.com/intern/diff/D23331896) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** Added alpha overloads for add/sub ops with lists **Tests** Tested via unit tests [ghstack-poisoned]

Differential Revision: [D23331896](https://our.internmc.facebook.com/intern/diff/D23331896) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** Added alpha overloads for add/sub ops with lists **Tests** Tested via unit tests [ghstack-poisoned]

Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** - We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs. - Rewriting adam optimizer with _foreach_* APIs [ghstack-poisoned]

9E88 …APIs" Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** - We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs. - Rewriting adam optimizer with _foreach_* APIs [ghstack-poisoned]

Differential Revision: [D23331893](https://our.internmc.facebook.com/intern/diff/D23331893) **Motivation** [GitHub issue](#38655) Current PyTorch optimizer implementations are not efficient in cases when we work with a lot of small feature tensors. Starting a lot of kernels slows down the whole process. We need to reduce the number of kernels that we start. As an example, we should be looking at [NVIDIAs Apex](https://github.com/NVIDIA/apex). In order to track progress, we will pick PyTorchs DCGAN model with Adam optimizer and once the optimizer is reimplemented with tensor lists, benchmark the model performance against original model version, Apexs version with original Adam optimizer and it’s FusedAdam optimizer. **Current API restrictions** - List can't be empty (will fixed in upcoming PRs). - All tensors in the list must have the same dtype, device and size. **Broadcasting** At this point we don't support broadcasting. **What is 'Fast' and 'Slow' route** In particular cases, we cant process an op with a fast list CUDA kernel. Still, we can do with a regular for-loop where the op will be applied to each tensor individually through the dispatch mechanisms. There are a few checks that decide whether the op will be performed via a 'fast' or 'slow' path. To go the fast route, - All tensors must have strided layout - All tensors must be dense and not have overlapping memory - The resulting tensor type must be the same dtype. - All Tensors must be on the same device. ---------------- **In this PR** - We are introducing new namespace under torch.optim - torch.optim.multi_tensor, where we will have optimizers rewritten with _foreach_* APIs. - Rewriting adam optimizer with _foreach_* APIs [ghstack-poisoned]

ssnl · 2020-12-14T18:55:09Z

@ngimel I'm curious about the performance tradeoff between this and performing single kernel ops on a giant flat vector containing all parameters (i.e., the actual parameters would be non-overlapping views of it).

ngimel · 2020-12-14T18:59:44Z

The benchmarks show that multi-tensor-apply kernels achieve high fraction of peak bandwidth (@mcarilli will know more). The problem with giant flat vector containing params is that it's usually pretty fragile (unless you have a simple training loop, always use all the parameters, no parameter sharing etc).

vadimkantorov · 2020-12-14T19:05:07Z

One way to support flatten_parameters would be checking if all inputs to multi_tensor_apply are indeed contiguous in memory (a complication may be if there are any memory alignment gutters; one could sort tensors by memory address and check that difference between addresses is numel * itemsize), and then launching a single kernel. Then an optional flatten_parameters may be supported to be explicitly used only for advanced cases.

But as @ngimel had mentioned in #38655 (comment), the perf sacrifice maybe is not big anyway

ssnl · 2020-12-14T19:05:59Z

@ngimel Nice, thanks for responding quickly. Since the conditions for giant flat vector is actually met for most use cases, it is great to hear that we will have a solution that doesn't sacrifice much performance for generality.

mcarilli · 2020-12-14T20:19:35Z

I'm curious about the performance tradeoff between this and performing single kernel ops on a giant flat vector containing all parameters (i.e., the actual parameters would be non-overlapping views of it).

On the GPU, MTA kernels should achieve bandwidth comparable to acting on a giant flat vector. If you find a realistic case where they don't, let us know.

The main perf difference vs acting on a flat buffer occurs on the CPU side: packing tensor lists for MTA launches in Python can be expensive (example: we observe over 3 msec to build lists from the entire parameter set in BERT, in some cases). To reduce this overhead, MTA clients (e.g. optimizers) could maintain such lists persistently where possible, and where necessary, filter them on the C++ side (eg for optimizers, with a helper like

std::tuple<std::vector<at::Tensor>,
           std::vector<at::Tensor>>
filter_params_with_grad(std::vector<at::Tensor> params) {
  std::vector<at::Tensor> params_with_grads;
  std::vector<at::Tensor> grads;
  params_with_grads.reserve(params.size());
  grads.reserve(params.size());
  for (const auto& p : params) {
    if (p.grad().defined()) {
      params_with_grads.push_back(p);
      grads.push_back(p.grad());
    }
  }
  return {params_with_grads, grads};
}

ssnl · 2020-12-14T20:21:53Z

@mcarilli That's great to hear. Will do.

mcarilli · 2020-12-14T21:21:43Z

Optimizers should robustly handle the cases where not all parameters receive grads every iteration.
Updated with a simple filter-helper example I proposed to our mlperf people to reduce MTA list-packing overhead for their fused optimizers.
I'll probably PR a C++ list-packing helper like that for MTA amp.GradScaler.unscale_ soon.

vadimkantorov · 2022-08-18T14:01:39Z

There's also now some FlatParameter in FSDP. Is it related to general parameter flattening?

awgu · 2022-10-29T03:20:21Z

There's also now some FlatParameter in FSDP. Is it related to general parameter flattening?

I do not think that FlatParameter in FSDP is related to general parameter flattening. It is actually just a normal nn.Parameter (that does not even override a single method).

pytorch/torch/distributed/fsdp/flat_param.py

Line 123 in 14d5f13

class FlatParameter(nn.Parameter):

We have specialized FlatParameter to represent something in the context of FSDP (adding attributes to it and using it in a specific way). However, with respect to torch functions, it is just a 1D nn.Parameter.

ngimel added feature A request for a proper, new feature. module: optimizer Related to torch.optim labels May 18, 2020

agolynski added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 19, 2020

vincentqb added the module: nestedtensor NestedTensor tag see issue #25032 label May 19, 2020

vincentqb removed the module: nestedtensor NestedTensor tag see issue #25032 label May 20, 2020

cpuhrsch closed this as completed May 21, 2020

cpuhrsch reopened this May 21, 2020

zou3519 mentioned this issue Jun 23, 2020

[feature request] batch_apply, a general-purpose device-agnostic batch iterator #40441

Open

izdeby self-assigned this Jun 24, 2020

izdeby mentioned this issue Jul 20, 2020

Add private API to support tensor lists: _foreach_add(TensorList tensors, Scalar scalar) #41554

Closed

izdeby mentioned this issue Aug 4, 2020

Add _foreach_add_(TensorList tensors, Scalar scalar) API #42531

Closed

izdeby mentioned this issue Aug 4, 2020

Add _foreach_add(TensorList tl1, TensorList tl2) and _foreach_add_(TensorList tl1, TensorList tl2) APIs #42533

Closed

This was referenced Sep 8, 2020

Added alpha overloads for add/sub ops with lists #43413

Closed

[WIP] Rewrote adam optimizer with foreach APIs #43507

Closed

albanD mentioned this issue Dec 11, 2020

NN Module functional API #49171

Closed

crcrpar mentioned this issue May 24, 2021

Foreach Functions Tracking Issue #58833

Open

28 tasks

wangshangsam mentioned this issue Jul 22, 2021

[Optim] Upgrade adam to 1.9 UofT-EcoSystem/hfta#39

Open

2 tasks

crcrpar mentioned this issue Nov 9, 2021

[RFC] APEX style fused optimizers in PyTorch #68041

Open

jxtps mentioned this issue Jun 12, 2022

Adam not optimally implemented: unnecessary torch.div #79352

Closed

vadimkantorov mentioned this issue Aug 18, 2022

[docs] Docs missing online for RNNBase.flatten_parameters #28658

Closed

vadimkantorov mentioned this issue Aug 25, 2022

Expose API for Registering Post-Gradient-Computation Hook #84057

Closed

fegin assigned fegin and unassigned izdeby and fegin Mar 30, 2023

ngimel closed this as completed May 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide efficient implementation for operations on lists of tensors. #38655

Provide efficient implementation for operations on lists of tensors. #38655

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Provide efficient implementation for operations on lists of tensors. #38655

Provide efficient implementation for operations on lists of tensors. #38655

Comments

Uh oh!

Motivation

Proposal

Alternatives

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!