Unnecessary cuda synchronizations that we should remove in PyTorch #108968

Chillee · 2023-09-10T04:51:03Z

🚀 The feature, motivation and pitch

There are a number of unnecessary cuda synchronizations in PyTorch ops, and I think we should endeavor to remove them whenever possible.
To check syncs, you can use torch.cuda.set_sync_debug_mode("warn")

I'm creating this issue to track ones that I've seen/found.

torch.multinomial with num_samples=1. For this I think we should simply remove the error check causing the sync, and ideally turn it into a cuda async error. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/Distributions.cpp#L615

A = torch.rand(10)
torch.multinomial(A, num_samples=1)

repeat_interleave with a tensor number of repeats encourages synchronization. We cannot use repeats with a non-cuda tensor, and that forces a synchronization. For this I think we should add a list of ints overload or allow passing a CPU tensor for repeats.

A = torch.randn(3, device='cuda')
num_repeats = torch.tensor([2, 3, 5])
out = torch.repeat_interleave(A, num_repeats.cuda(), dim=0)

Indexing with a scalar tensor performs a synchronization. See Turn indexing with a scalar tensor into an copy into a view and avoid a D2H synchronization. #105641 for more details.
torch.normal also incurs a sync on std: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/DistributionTemplates.h#L222
nanmedian incurs a sync: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/Sorting.cpp#L149
prod_backward: torch.prod cannot be used with cudagraphs #128396

Alternatives

No response

Additional context

No response

cc @ptrblck

The text was updated successfully, but these errors were encountered:

vadimkantorov · 2023-09-10T09:15:50Z

@Chillee maybe also that's why part of why repeat_interleave is slow: #31980, also a bit related: #73175

lezcano · 2023-09-10T14:48:19Z

On point 2, see data-apis/array-api#654. The array API will have repeats be a tuple.

vadimkantorov · 2023-09-10T15:07:01Z

I hope that tensor/array can also be accepted where tuple is expected - this might be important for PyTorch where we can have some minor improvements if the counts is already a tensor (obtained from some computation with arrays) and is on the target device

Chillee · 2023-09-10T16:36:58Z

@vadimkantorov repeat_interleave currently takes in a tensor but not tuples. In general, it's not always a good to take in tensors where we currently take tuples, since they might require device to host synchronizations anyways (for example, view).

kgryte · 2023-09-21T01:01:01Z

The array API will have repeats be a tuple.

@lezcano Currently, most array libraries support a one-dimensional array for repeats. For that proposal, based on recent feedback, I think we're leaning toward supporting both sequences and arrays.

DeNeutoy · 2023-10-20T02:42:39Z

+1 to this issue - this is causing some non-trivial performance issues for us:

For context, this repeat interleave is very large (e.g the repeats tensor may be of size ~128, with values up to 4000).

I don't think this is causing slowdowns exactly, but we do train multigpu models, where random syncing inside cuda ops is presumably more of a problem.

drisspg added module: performance Issues related to performance, either of kernel code or framework glue module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 11, 2023

kgryte mentioned this issue Sep 21, 2023

RFC: add support for repeating each element of an array data-apis/array-api#654

Closed

kshitij12345 mentioned this issue Nov 5, 2024

thunderFX : pass to remove empty autocast regions Lightning-AI/lightning-thunder#1400

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unnecessary cuda synchronizations that we should remove in PyTorch #108968

Unnecessary cuda synchronizations that we should remove in PyTorch #108968

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Unnecessary cuda synchronizations that we should remove in PyTorch #108968

Unnecessary cuda synchronizations that we should remove in PyTorch #108968

Comments

Uh oh!

🚀 The feature, motivation and pitch

Alternatives

Additional context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!