You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Reland] Launch kernel on current stream & remove record_stream entirely
Relanding #148590 due to merge conflict.
This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves#147729
- Resolves#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.
Joint work with @cenzhaometa who wants to remove the event sync overhead.
Squashed contents:
* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**
Differential Revision: D70135605
* [PGNCCL] Make avoid-record-stream default
* [c10d] Add asyncOp argument to Ops
* Change python side wait
* Pass asyncOp at ProcessGroup level
* Watchdog unstashing tensors as a safety net
* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753
* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079
* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130
[ghstack-poisoned]
0 commit comments