[feature request] Reduction (torch.add / torch.logaddexp / torch.max / torch.min / torch.mean) of several tensors without extra copies/allocations / memory accesses } TensorList inputs support #27522

vadimkantorov · 2019-10-08T03:11:57Z

If I understand correctly sum(tensor_list) will allocate and keep O(N) intermediate tensors (same with a for loop) where N is number of tensors, which can be quite large in the case of big DenseNet. I propose to maybe generalize torch.add to support more than two tensors as input.

Currently one can do: functools.reduce(lambda acc, x: acc.add_(x), tensor_list, torch.zeros_like(tensor_list[0])), so it's not super-urging, but a more idiomatic, TorchScript-able way may be nice

The text was updated successfully, but these errors were encountered:

pietern · 2019-10-08T13:38:01Z

I agree that would be nice.

Would you be willing to submit a PR for this?

@gchanan What would this take on the dispatch side? I imagine this would require changing the core op to take N inputs, so it might be more complicated than it seems.

ssnl · 2019-10-08T16:57:37Z

Also wondering how hard it would be to automatically parallelize this with streams (when suited) in backend.

vadimkantorov · 2019-10-11T14:15:25Z

@pietern Unfortunately, I don't have enough time/resources to figure out how to do it. I'm currently stuck even with a much simpler edit: #27621 ...

gchanan · 2019-10-15T18:05:06Z

do you mean torch.sum(tensors) or (python built-in) sum(tensors)?

gchanan · 2019-10-15T18:05:53Z

fwiw NumPy supports np.sum(tuple of nd-arrays).

fmassa · 2019-10-15T18:20:27Z

@gchanan

fwiw NumPy supports np.sum(tuple of nd-arrays).

This is not equivalent to sum(tuple of nd-arrays) though.

In fact, numpy convert its input into an array in np.sum (and in pretty much all other methods), so np.sum(tuple of nd-arrays) becomes np.sum(np.asarray(tuple of nd-arrays)), which gives a number (sum over all elements).

But sum(tuple of nd-arrays) perform the sum of all the arrays, giving a nd-array.

a = [np.random.rand(2, 2, 2) for _ in range(5)]
print(sum(a).shape)
# gives (2, 2, 2)
print(np.sum(a).shape)
# gives ()

TF has a dedicated function for that.

The JIT can definitely optimize the particular pattern sum(tuple of tensors) into a single kernel, but I'm not sure if we should add a new function in PyTorch for that.

vadimkantorov · 2019-10-15T18:35:34Z

@fmasssa can we adapt torch.add to do this?

vadimkantorov · 2019-10-15T18:37:48Z

do you mean torch.sum(tensors) or (python built-in) sum(tensors)?

@gchanan I meant equivalent of sum(tensors)

gchanan · 2019-10-15T18:46:05Z

@vadimkantorov thanks for clarifying.

I think all the dispatch building blocks are there for this, it's the same as torch.cat AFAICT.

fmassa · 2019-10-15T18:49:05Z

@vadimkantorov

can we adapt torch.add to do this?

we could, but torch.add already has two overloads, and I think adding support for a tuple / iterable to it might make it more confusing.

vadimkantorov · 2019-10-15T18:53:04Z

@fmassa One way is to factor out the fused multipy-add into a separate method torch.multiply_add (bc-breaking probably) and then have torch.add(*args), although similarity with torch.cat(args) is something to consider too

vadimkantorov · 2019-12-26T01:21:36Z

It would also be nice to have torch.logsumexp support multiple input tensors, so that prior torch.stack can be avoided. Actually, this kind of API makes sense for all reduction functions.

Alternatively, maybe this could be supported without API change if a NestedTensor becomes reality (so that torch.stack is done only logically without materialization).

e.g. consider a simplistic impl of CTC:

@torch.jit.script
def ctc_loss(log_probs, targets, input_lengths, target_lengths, blank : int = 0, reduction : str = 'none'):
    targets_ = torch.full((targets.shape[0], 2 * targets.shape[-1] + 1), blank, device = targets.device, dtype = targets.dtype)
    temporal_mask = torch.arange(targets.shape[-1], device = input_lengths.device, dtype = input_lengths.dtype).unsqueeze(0) < target_lengths.unsqueeze(1)
    targets_[:, 1::2] = temporal_mask * targets + (~temporal_mask) * targets_[:, 1::2]
    max_target_length = int(target_lengths.max())
    max_target_length_ = 2 * max_target_length + 1
    targets_ = targets_[:, :max_target_length_]

    neginf = torch.as_tensor([float('-inf')], device = log_probs.device, dtype = log_probs.dtype)

    batch_size = targets.shape[0]
    log_alpha = torch.empty(batch_size, log_probs.shape[0], 2 + max_target_length_, device = log_probs.device, dtype = log_probs.dtype)
    log_alpha[:, :3].fill_(neginf.sum())
    log_alpha[:, 0, 2 + 0] = log_probs[0, :, blank]
    log_alpha[:, 0, 2 + 1] = log_probs[0, torch.arange(batch_size), targets_[:, 1]]

    log_probs_ = log_probs.gather(-1, targets_.expand(len(log_probs), -1, -1))
    la3_ = torch.cat([torch.as_tensor([[True, True]], device = targets.device).expand(batch_size, -1), targets_[:, 2:] != targets_[:, :-2]], dim = 1)
    for t in range(1, len(log_probs)):
        la3 = log_alpha[:, t - 1, 0:-2]
        la2 = log_alpha[:, t - 1, 1:-1]
        la1 = log_alpha[:, t - 1, 2:]
        log_alpha[:, t, 2:] = log_probs_[t] + torch.logsumexp(torch.stack([la1, la2, torch.where(la3_, la3, neginf)]), dim = 0)

    l1 = log_alpha[:, input_lengths - 1, 2 + target_lengths * 2].diag()
    l2 = log_alpha[:, input_lengths - 1, 2 + target_lengths * 2 - 1].diag()
    return -torch.logsumexp(torch.stack([l1, l2]), dim = 0)

In order to do logsumexp of several tensors I have to first stack them.

vadimkantorov · 2020-01-13T15:33:27Z

In #32100 it seems that NumPy actually supports lists of tensors as reduction functions input instead of a stacked tensor

vadimkantorov · 2020-05-13T08:21:52Z

Related: #38377

vadimkantorov · 2020-07-07T10:00:05Z

Given that logaddexp is merged, I thought of bringing this up again: maybe via nested tensor.

Another instance that could be useful: torch.argmax across two tensors (without stacking them first). E.g. if we pass two tensors in that dimension, it would return for every element 0 or 1.

NumPy seems to support it for arbitrary functions via ufuncs: np.add.reduce([a,b])

vadimkantorov · 2020-07-07T10:10:34Z

Yay! NestedTensor seems to support reductions now: https://github.com/pytorch/nestedtensor/blob/master/examples/naryops_and_reduce.ipynb

some API musings:

# if default nested dim is 0?
torch.add(NestedTensor([a, b]), dim = 0)

# maybe we can specify it semantically?
torch.add(NestedTensor([a, b]), dim = torch.nested_dimension)

# or have torch.stack return a nested tensor, since it's already an established way to bind tensors together, and new dim = 0 is expected
torch.add(torch.stack([a, b], layout = torch.nested), dim = 0)

# or maybe we can ask any op do torch.as_tensor and do torch.as_nested_tensor if it's a list of tensors
torch.add([a, b], dim = torch.nested_dimension)

@cpuhrsch @izdeby

mruberry · 2020-10-08T07:41:49Z

@izdeby Maybe for each related?

vadimkantorov · 2021-11-21T18:16:47Z

Maybe if your test doesn't require grads they will be deallocated quickly. With gradients I think it should do N copies

stas00 · 2021-11-21T18:25:05Z

Yes, not O(N), but just O(2)

Feel free to experiment:

pip install ipyexperiments

and then 4 jupyter notebook cells:

from ipyexperiments import IPyExperimentsPytorch
import torch
import functools
def consume_gpu_ram(n): return torch.ones((n, n)).cuda()
exp1 = IPyExperimentsPytorch()

t = [consume_gpu_ram(2**13) for x in range(4)]

y1 = t[0] + t[1] + t[2] + t[3]

y2 = functools.reduce(lambda acc, x: acc.add_(x), t, torch.zeros_like(t[0]))

vadimkantorov · 2021-11-21T18:46:55Z

From what I understand, in your test elements in t do not require gradients and thus this computation doesn't get tracked by autograd and are not kept around

stas00 · 2021-11-22T03:36:04Z

That's a great point, Vadim.

Let's try with grads.

If I use:

torch.ones((n, n), requires_grad=True)

then everything changes depending on the reduction function from which backward is called, e.g. sum:

and with max:

note that in both cases, your version doesn't fair very differently from normal + peak-memory-wise.

If you want to break it down:

stas00 · 2021-11-22T03:47:16Z

I suppose that's now what we normally do. We do this instead:

Adding more elements doesn't change the memory consumption, so definitely not O(N)

with sum as a reduction function instead of max, it's 512MB vs 256MB peak gpu memory, regardless of N.

cpuhrsch · 2021-12-01T19:50:34Z

Is it at all realistic to use Triton here? Surely their tutorial on summing vectors can be extended to multiple vectors. You'd then be left with implementing autograd for this operation, which can be achieved via autograd.Function.

vadimkantorov · 2021-12-01T19:54:14Z

I think it doens't make much sense to take a dependency on Triton just for this summation optimization. If it's used for all other multi-tensor reductions, it's more reasonable... Ideally, these multi-tensor reductions would be supported for all reductions via NestedTensor bindings (or just accepting tensorlists in place of tensor) ...

cpuhrsch · 2021-12-01T20:03:12Z

@vadimkantorov - You could guard against the presence of Trition and fall back to a less efficient implementation if it's not present. It should also allow you to unblock your perf explorations such as overall influence on the model and potential gains.

vadimkantorov · 2021-12-01T20:05:32Z

I think for summation specifically manual functools.reduce(lambda acc, x: acc.add_(x), t, torch.zeros_like(t[0])) works for reducing memory (if one copy for output is permissible)

vadimkantorov · 2021-12-31T09:44:26Z

But as #70533 mentions, probably the sum / functools.reduce workarounds aren't very TorchScript-able...

cpuhrsch · 2022-01-04T17:02:56Z

If you need TorchScript support you're unlikely to get around using the C++ extensibility features and ship your own binary code at this point. Neither Triton nor functools are JIT-able.

vadimkantorov · 2022-04-17T15:05:41Z

@cpuhrsch a real-world use-case for torch.sum(tensor_list, dim = 0): https://github.com/pytorch/pytorch/blob/master/torch/distributed/nn/functional.py#L298

also, could there be an improvement of using a single allocation instead of [torch.empty_like(tensor) for tensor in grad_outputs]? using nestedtensor (e.g. a _foreach_empty_like or a new general multi/foreach-allocation utility?) elements in the list do not have an equal shapes, right? this should be quite simple if the tensors are all dense / not meta-tensors

for torch.sum to benefit from more efficiency, nestedtensor/tensorlist should not force a copy (especially if number of tensor to be summed is small). ideally should always be an inplace version of sum/add torch.add_(res, *[list of extra summants]), this would be the most idiomatic to replace the manual loop

vadimkantorov · 2022-04-18T13:05:42Z

another usecase/argument to study about ops on multiple tensors and using or avoiding stack/copy: 5654e63#diff-8333640cff01dcc6a2ab20085003e8c998db905ad85c09603a20a6e363aff2e2R135

vadimkantorov · 2022-05-20T08:19:33Z

@cpuhrsch Also, one could imagine reduction across multiple tensors of different sizes without prior padding, e.g. using neutral elements as imputation: 0 for sum. Also might be possible for mean.

Motivating example: #27522 (comment) where one computes logsumexp of tensors of slightly different sizes: N, N-1, N-2. Also related to #77876

vadimkantorov · 2022-06-07T09:57:40Z

@cpuhrsch Recent apple transformers optimization stresses reducing memory copies and also uses softmax on a tensorlist:

https://github.com/apple/ml-ane-transformers/blob/main/ane_transformers/huggingface/distilbert.py#L151

So supporting copy-avoiding TensorList (in addition to always-copying NestedTensor) at least for some ops may be a good idea

cpuhrsch · 2022-06-07T16:29:53Z

@vadimkantorov - while I agree, I'm wondering whether we couldn't achieve the same by mostly sticking to NestedTensors across the operations? As in, let's look at where the TensorList comes from to begin with and use a NestedTensor right away.

vadimkantorov · 2022-06-07T16:34:45Z

copy-semantics may be unwanted overhead for some things (e.g. optimizer existing multi-tensor-apply that may try to do things inplace as much as possible?)

Regarding frontend, allowing regular python tensor lists may be terser syntax, e.g. torch.softmax([...my tensor list], dim = -1) -> tensor list (but computed with parallelization if needed, especially on CPU, or with CUDA streams, related: #78507)

vadimkantorov · 2022-09-13T15:58:09Z

Another useful method to support for TensorList inputs is torch.max / torch.min

vadimkantorov · 2022-09-13T15:58:58Z

@vadimkantorov - while I agree, I'm wondering whether we couldn't achieve the same by mostly sticking to NestedTensors across the operations? As in, let's look at where the TensorList comes from to begin with and use a NestedTensor right away.

I guess there would be a perf difference for re-allocation / copies that NestedTensor currently forces

pietern added feature A request for a proper, new feature. module: internals Related to internal abstractions in c10 and ATen triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Oct 8, 2019

ezyang added module: operators and removed module: internals Related to internal abstractions in c10 and ATen labels Dec 4, 2019

vadimkantorov mentioned this issue Jan 3, 2020

A strange issue about detection of tensor in-place modification #31819

Closed

vadimkantorov mentioned this issue Jan 13, 2020

[RFC] NestedTensor - 0.0.1 #22169

Open

vadimkantorov mentioned this issue Jan 31, 2020

vmap proof of concept (python) #32836

Closed

vadimkantorov mentioned this issue May 18, 2020

Provide efficient implementation for operations on lists of tensors. #38655

Closed

vadimkantorov changed the title ~~[feature request] Sum of several tensors~~ [feature request] Sum (torch.add / torch.logaddexp) of several tensors Jul 22, 2020

mruberry added function request A request for a new function or the addition of new arguments/modes to an existing function. and removed feature A request for a proper, new feature. module: operators (deprecated) labels Oct 8, 2020

stas00 mentioned this issue Nov 25, 2021

[BUG] memory overhead issue with optimizer leading to OOM deepspeedai/DeepSpeed#1535

Open

vadimkantorov mentioned this issue Dec 31, 2021

JIT / TorchScript should support for sum(List[torch.Tensor]) like non-JIT PyTorch already does #70533

Open

vadimkantorov mentioned this issue Jan 20, 2022

[numpy compat] torch.stack and torch.tensor doesn't support nested list+tensors (NumPy does support) - at least document the difference in the error message #71479

Open

vadimkantorov mentioned this issue Mar 14, 2022

[feature request] Jagged / padding version of torch.stack / torch.cat + some general nested tensor discussion #65156

Open

vadimkantorov changed the title ~~[feature request] Sum (torch.add / torch.logaddexp) of several tensors~~ [feature request] Reduction (torch.add / torch.logaddexp / torch.max / torch.min) of several tensors without extra copies/allocations Sep 13, 2022

vadimkantorov mentioned this issue Sep 16, 2022

NestedTensor 2.0 issue tracking #85157

Open

13 tasks

vadimkantorov mentioned this issue Apr 28, 2023

[discussion] "TensorList" as first-class python-world abstraction (to mirror existing dispatcher's TensorList and torch::autograd::tensor_list) and as key for dispatch for merging torch._foreach_* into regular torch.* functions #100249

Open

vadimkantorov mentioned this issue Jul 17, 2023

Padded tensor subclass #105325

Open

[feature request] Reduction (torch.add / torch.logaddexp / torch.max / torch.min / torch.mean) of several tensors without extra copies/allocations / memory accesses } TensorList inputs support #27522

[feature request] Reduction (torch.add / torch.logaddexp / torch.max / torch.min / torch.mean) of several tensors without extra copies/allocations / memory accesses } TensorList inputs support #27522

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!