8000 [WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode by cenzhaometa · Pull Request #147820 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cenzhaometa
Copy link
Contributor
@cenzhaometa cenzhaometa commented Feb 25, 2025

Summary:
PTD current workflow:

  • PTD creates its own dedicated ncclStream for comm operation
  • it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:

  • introduces a new env TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1
  • when it's specified, PTD uses current-stream as the nccl-stream and avoids stream sync

this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15%

Test Plan:

before

[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces

{F1975408857}

  • c10d::allreduce_(69us)
    • cudaStreamSync (23us)
    • nccl::all_reduce(26us)

after

[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces

{F1975408962}

  • c10d:allreduce_(37us)
    • cudaStreamSync (gone)
    • nccl::all_reduce(20us)

Differential Revision: D70135605

Resolves #147729

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

Copy link
pytorch-bot bot commented Feb 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147820

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit fa36ae3 with merge base 10ffd94 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Feb 25, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70135605

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70135605

cenzhaometa added a commit to cenzhaometa/pytorch that referenced this pull request Feb 27, 2025
…rch#147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- introduces a new env `TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1`
- when it's specified, PTD uses current-stream as the nccl-stream and avoids stream sync

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Test Plan:
- 
# AMD
> before
```
[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces

 {F1975408857} 
- c10d::allreduce_(69us)
  - cudaStreamSync (23us)
  - nccl::all_reduce(26us)

> after
```
[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces

 {F1975408962} 

- c10d:allreduce_(37us)
  - cudaStreamSync (gone)
  - nccl::all_reduce(20us)

# NV
> before
```
[cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces

 {F1975437097} 
- c10d::allreduce_ (62us)
  - cudaStreamWait (0us)
  - nccl::all_reduce (47us)

> after
```
[cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces

 {F1975437192} 

- c10d::allreduce_ (62us)
  - cudaStreamWait (gone)
  - nccl:all_reduce (53us)

Differential Revision: D70135605
…ytorch#147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Test Plan:
- 
# AMD
> before
```
[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces

 {F1975408857} 
- c10d::allreduce_(69us)
  - cudaStreamSync (23us)
  - nccl::all_reduce(26us)

> after
```
[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces

 {F1975408962} 

- c10d:allreduce_(37us)
  - cudaStreamSync (gone)
  - nccl::all_reduce(20us)

# NV
> before
```
[cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces

 {F1975437097} 
- c10d::allreduce_ (62us)
  - cudaStreamWait (0us)
  - nccl::all_reduce (47us)

> after
```
[cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M
```
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces

 {F1975437192} 

- c10d::allreduce_ (62us)
  - cudaStreamWait (gone)
  - nccl:all_reduce (53us)

Differential Revision: D70135605
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D70135605

@kwen2501 kwen2501 changed the title [ptd][nccl] provide a knob to use current-stream as nccl-stream [ptd][nccl] use current-stream as nccl-stream under async=False mode Feb 27, 2025
@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2025
Copy link
Contributor
@wconstab wconstab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm except the sparseIndices line

@@ -134,6 +134,8 @@ class BroadcastOptions:
class AllreduceOptions:
reduceOp: ReduceOp
timeout: timedelta
asyncOp: bool
sparseIndices: Optional[Tensor]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line added by mistake?

@skyw
Copy link
Contributor
skyw commented Feb 28, 2025

Does "current stream" also include stream context under the with clause? Which sets "current stream" under the context manager.

@kwen2501
Copy link
Contributor
kwen2501 commented Mar 3, 2025

This PR is still WIP, it is not functional yet

@kwen2501 kwen2501 changed the title [ptd][nccl] use current-stream as nccl-stream under async=False mode [WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode Mar 3, 2025
@kwen2501 kwen2501 marked this pull request as draft March 3, 2025 16:36
@kwen2501
Copy link
Contributor
kwen2501 commented Mar 3, 2025

Does "current stream" also include stream context under the with clause? Which sets "current stream" under the context manager.

Yes. Caution is, user would need to do tensor life management, if using a tensor created under main stream to this "with" context stream AND not using it back in main.

@ngimel
Copy link
Collaborator
ngimel commented Mar 3, 2025

Generally, relying on env variables to control behavior is very coarse grained and not particularly convenient (e.g. when spawning processes the environment has to be passed together with processes). Can we make this a property of pg?

kwen2501 added a commit that referenced this pull request Mar 5, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Mar 5, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: 4793680
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 6, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: ac5295d
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 6, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: 0f222d3
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 6, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: 8306cce
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 7, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: c31ca32
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 7, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: dfc758a
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 8, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: baec42c
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 8, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: 96d18e0
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 9, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=Fals
A3DB
e [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: 27b228c
Pull Request resolved: #148590
kwen2501 added a commit that referenced this pull request Mar 21, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: f391590
Pull Request resolved: #148590

Stash tensors for reduce_scatter_v and all_gather_v

ghstack-source-id: f391590
Pull Request resolved: #149753
kwen2501 added a commit that referenced this pull request Mar 28, 2025
…147820)

Summary:

PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG

this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

Differential Revision: D70135605

[PGNCCL] Make avoid-record-stream default

[c10d] Add asyncOp argument to Ops

Change python side wait

Pass asyncOp at ProcessGroup level

Watchdog unstashing tensors as a safety net

lint

ghstack-source-id: e4b48e5
Pull Request resolved: #148590

Stash tensors for reduce_scatter_v and all_gather_v

ghstack-source-id: e4b48e5
Pull Request resolved: #149753

[c10d] Move unstashing from watchdog to main thread

ghstack-source-id: e4b48e5
Pull Request resolved: #150079

[PGNCCL][BE] Merge mutex into TensorShelf for encapsulation

ghstack-source-id: e4b48e5
Pull Request resolved: #150130
kwen2501 added a commit that referenced this pull request Apr 1, 2025
…irely

Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**
Differential Revision: D70135605

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130

[ghstack-poisoned]
kwen2501 added a commit that referenced this pull request Apr 1, 2025
…irely

Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**
Differential Revision: D70135605

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130

ghstack-source-id: ce103fc
Pull Request resolved: #150398
pytorchmergebot pushed a commit that referenced this pull request Apr 1, 2025
…irely (#150398)

Relanding #148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves #147729
- Resolves #146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves #147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: #149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: #150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: #150130

Pull Request resolved: #150398
Approved by: https://github.com/atalman
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
…irely (pytorch#150398)

Relanding pytorch#148590 due to merge conflict.

This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related):
1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back.
- Resolves pytorch#147729
- Resolves pytorch#146881
- Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user.
2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling.
- Resolves pytorch#147168
3. Remove tensor life management when async_op=False; only use it when async_op=True.
4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](pytorch#147168 (comment)).
5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels.

Joint work with @cenzhaometa who wants to remove the event sync overhead.

Squashed contents:

* [ptd][nccl] use current-stream as nccl-stream under async=False mode (pytorch#147820)
PTD current workflow:
- PTD creates its own dedicated `ncclStream` for comm operation
- it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective
such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
- async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead
- async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready
- pass down async from c10d down to NCCL-PG
this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%**

* [PGNCCL] Make avoid-record-stream default

* [c10d] Add asyncOp argument to Ops

* Change python side wait

* Pass asyncOp at ProcessGroup level

* Watchdog unstashing tensors as a safety net

* Stash tensors for reduce_scatter_v and all_gather_v
Pull Request approved: pytorch#149753

* [c10d] Move unstashing from watchdog to main thread
Pull Request approved: pytorch#150079

* [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation
Pull Request approved: pytorch#150130

Pull Request resolved: pytorch#150398
Approved by: https://github.com/atalman
Copy link
Contributor
github-actions bot commented May 2, 2025

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label May 2, 2025
github-merge-queue bot pushed a commit to intel/torch-xpu-ops that referenced this pull request May 16, 2025
Refer pytorch/pytorch#147820
pytorch/pytorch#150398
To launch kernels on the current stream and reduce the CPU overhead
introduced by `recordStream`, an `async` option is introduced.

For example, in an `allreduce` operation between two ranks:

- `rank0` corresponds to `device0`, using the current device's `stream0`
to create the communicator and preserving `stream0`.

When `async = true`:
- Both `rank0` and `rank1` perform the collective using `stream0`, which
is associated with the communicator.
- To prevent potential reads by `stream0` from unready tensors (e.g.,
from `rank1`), synchronization with the current stream is required.
- After the collective completes, to prevent premature release of the
input tensors, `recordStream` must be used for stream tracking, or the
tensors need to be temporarily stored (e.g., in `reduce_scatter` or
`all2all`).

When `async = false`:
- Both `rank0` and `rank1` use their respective **current streams** for
collectives (i.e., `rank0` uses `stream0`, `rank1` uses `stream1`).
- In this case, the collective op handles synchronization implicitly.

Previously, we defaulted to `async = true`. Now, the `async` option is
explicitly introduced and set to `false` by default, leveraging the
current stream to avoid the overhead of stream synchronization.

---------

Co-authored-by: mengfei25 <mengfei.li@Intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request fb-exported oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL
6 participants
0