Reset mask for torch.cumsum? #53095

18jeffreyma · 2021-03-02T07:25:10Z

🚀 Feature

torch.cumsum current performs a cumulative sum over a given axis without resetting the cumulative sum (i.e. the final term is the sum of all terms in the initial vector). Proposing adding an argument mask which is a vectors of 0 or 1s, where 1 indicates to reset the cumulative sum at that vector, allowing computation of multiple cumulative sums in one vector.

Motivation

I'm currently looking to compute a cumulative sum of log probabilities for a reinforcement learning task per trajectory, and find it generally easier to track all trajectories in a single vector (for ease of transportation between CPU and GPU) and also track a vector that differentiates between trajectories.

Pitch

I'd like something like an argument to torch.cumsum called mask which has the same length as the axis to cumulative sum over, and indicates when to reset the cumulative sum. For example, if my vector was x =[a, b, c, d, e] and my mask was m = [1, 0, 0, 1, 0], then torch.cumsum(x, mask=m) should return something like [a, a+b, a+b+c, d, d+e].

Alternatives

My current implementation is here (which manually iterates through a vector of log probabilities and uses a inverted mask to reset the cumulative sum).

https://github.com/18jeffreyma/multi_cmd/blob/3106417eb4acb9230576e78e7267a24b2efa6b3d/multi_cmd/rl_utils/multi_copg.py#L179

However, this implementation seems to not play very well with CUDA acceleration on autograd, as I yield zero performance benefit on a Nvidia DGX versus using CPU, which is the primary reason I bring up this feature request. Alternatively, is there some autograd issue with implementing this masked-reset cumulative sum using a for-loop?

cc @mruberry @rgommers @heitorschueroff

The text was updated successfully, but these errors were encountered:

zou3519 · 2021-03-02T15:29:44Z

Marking this as "triage review" because I'm not sure what to label this as.

This reminds me a lot of the "reduction with splits" that Caffe2 has (where one provides some "splits" for a vector and the reduction reduces each section of a vector as defined by the splits).

From some googling, I haven't been able to find a precedent for this kind of functionality; numpy's cumsum doesn't have a reset mask. However this sounds like a useful operation to have.

18jeffreyma · 2021-03-02T19:06:49Z

Gotcha, thanks for confirming! In terms of a temporary work around while this is implemented, is manually iterating through each element to be cumulative sum and tracking cumulative sum slower in terms of autodifferentation as compared to slicing, computing cumulative sum on slices, then concatenating? I.e.. something like this:

for i in range(slp.size(0)):
     if i == 0:
         slp[0] = 0.
     else:
         slp[i] = torch.add(slp[i-1], lp[i-1]) * mask[i-1]

18jeffreyma · 2021-03-04T06:32:46Z

In a similar vein, would you happen to know where exactly in the repo, torch.cumsum is implemented? Would rlly appreciate it to understand a bit further why slicing and doing cumsum is more efficient than the implementation above.

zou3519 · 2021-03-08T16:02:49Z

In a similar vein, would you happen to know where exactly in the repo, torch.cumsum is implemented? Would rlly appreciate it to understand a bit further why slicing and doing cumsum is more efficient than the implementation above.

It's a little complicated, but cumsum is implemented here:

pytorch/aten/src/ATen/native/ReduceOps.cpp

Lines 84 to 91 in d54be1a

    
           Tensor cumsum(const Tensor& self, int64_t dim, c10::optional<ScalarType> dtype) { 
        
             auto result = [&]() { 
        
               NoNamesGuard guard; 
        
               return at::_cumsum(integer_upcast(self, dtype), dim); 
        
             }(); 
        
             namedinference::propagate_names(result, self); 
        
             return result; 
        
           }

This calls into _cumsum, which calls into either _cumsum_cpu or _cumsum_cuda depending on the backend:

pytorch/aten/src/ATen/native/native_functions.yaml

Lines 6648 to 6649 in d54be1a

    
           CPU: _cumsum_cpu 
        
           CUDA: _cumsum_cuda

zou3519 · 2021-03-08T16:04:45Z

Gotcha, thanks for confirming! In terms of a temporary work around while this is implemented, is manually iterating through each element to be cumulative sum and tracking cumulative sum slower in terms of autodifferentation as compared to slicing, computing cumulative sum on slices, then concatenating? I.e.. something like this:
for i in range(slp.size(0)):
     if i == 0:
         slp[0] = 0.
     else:
         slp[i] = torch.add(slp[i-1], lp[i-1]) * mask[i-1]

It's hard to say without benchmarking and knowing the input sizes. But generally manually iterating through each element is not very efficient. The more operations that can be batched together, the better. So I'd expect the latter (i.e. slicing, computing cumulative sum on slices, then concatenating) to be faster

sergiynesterenko90 · 2023-07-28T12:55:57Z

+1 for this, it would be helpful for reinforcement learning applications. Suspect original post is thinking the same (mentioned a vector of log probs)

biuq · 2024-05-09T10:37:01Z

This can be implemented without python loops in the current version of PyTorch (tested on 2.2). Below is slightly modified implementation based on heinsen_sequence:

import torch
import torch.nn.functional as F

def complex_log(float_input, eps=1e-6):
    eps = float_input.new_tensor(eps)
    real = float_input.abs().maximum(eps).log()
    imag = (float_input < 0).to(float_input.dtype) * torch.pi
    return torch.complex(real, imag)

def associative_scan(values: torch.Tensor, coeffs: torch.Tensor, dim: int):
    log_values = complex_log(values.float())
    log_coeffs = complex_log(coeffs.float())
    a_star = torch.cumsum(log_coeffs, dim=dim)
    log_x0_plus_b_star = torch.logcumsumexp(log_values - a_star, dim=dim)
    log_x = a_star + log_x0_plus_b_star
    return torch.exp(log_x).real

input = torch.tensor([1, 2, 3, 4, 5])
inverted_reset_mask = torch.tensor([0, 1, 1, 0, 1])
output = associative_scan(input, inverted_reset_mask, dim=0)

print(output)

tensor([1.0000, 3.0000, 6.0000, 4.0000, 9.0000])

Related #95408

zou3519 added feature A request for a proper, new feature. module: reductions triage review labels Mar 2, 2021

ngimel added enhancement Not as big of a feature, but technically not a bug. Should be easy to fix and removed feature A request for a proper, new feature. triage review labels Mar 15, 2021

jbschlosser added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Mar 15, 2021

heitorschueroff mentioned this issue Jul 8, 2021

Reductions tracking issue #61417

Open

51 tasks

heitorschueroff added module: numpy Related to numpy support, and also numpy compatibility of our operators and removed module: reductions labels Jul 9, 2021

mruberry added the module: reductions label Jul 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reset mask for torch.cumsum? #53095

Reset mask for torch.cumsum? #53095

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reset mask for torch.cumsum? #53095

Reset mask for torch.cumsum? #53095

Comments

Uh oh!

🚀 Feature

Motivation

Pitch

Alternatives

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!