[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820

cenzhaometa · 2025-02-25T07:19:14Z

Summary:
PTD current workflow:

PTD creates its own dedicated ncclStream for comm operation
it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective

such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).

This diff:

introduces a new env TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1
when it's specified, PTD uses current-stream as the nccl-stream and avoids stream sync

this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15%

Test Plan:

before

[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces

{F1975408857}

c10d::allreduce_(69us)
- cudaStreamSync (23us)
- nccl::all_reduce(26us)

after

[cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M

https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces

{F1975408962}

c10d:allreduce_(37us)
- cudaStreamSync (gone)
- nccl::all_reduce(20us)

Differential Revision: D70135605

Resolves #147729

cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

pytorch-bot · 2025-02-25T07:19:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147820

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit fa36ae3 with merge base 10ffd94 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner-noclang / linux-job (gh)
>>> Lint for torch/_C/_distributed_c10d.pyi:
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, lf.linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_eager_async_allreduce_inductor_wait
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, lf.linux.2xlarge) (gh)
test_modules_can_be_imported

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-02-25T07:19:22Z

This pull request was exported from Phabricator. Differential Revision: D70135605

facebook-github-bot · 2025-02-27T08:05:28Z

This pull request was exported from Phabricator. Differential Revision: D70135605

…rch#147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - introduces a new env `TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1` - when it's specified, PTD uses current-stream as the nccl-stream and avoids stream sync this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Test Plan: - # AMD > before ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces {F1975408857} - c10d::allreduce_(69us) - cudaStreamSync (23us) - nccl::all_reduce(26us) > after ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces {F1975408962} - c10d:allreduce_(37us) - cudaStreamSync (gone) - nccl::all_reduce(20us) # NV > before ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces {F1975437097} - c10d::allreduce_ (62us) - cudaStreamWait (0us) - nccl::all_reduce (47us) > after ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces {F1975437192} - c10d::allreduce_ (62us) - cudaStreamWait (gone) - nccl:all_reduce (53us) Differential Revision: D70135605

…ytorch#147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Test Plan: - # AMD > before ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces {F1975408857} - c10d::allreduce_(69us) - cudaStreamSync (23us) - nccl::all_reduce(26us) > after ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces {F1975408962} - c10d:allreduce_(37us) - cudaStreamSync (gone) - nccl::all_reduce(20us) # NV > before ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces {F1975437097} - c10d::allreduce_ (62us) - cudaStreamWait (0us) - nccl::all_reduce (47us) > after ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces {F1975437192} - c10d::allreduce_ (62us) - cudaStreamWait (gone) - nccl:all_reduce (53us) Differential Revision: D70135605

facebook-github-bot · 2025-02-27T21:25:52Z

This pull request was exported from Phabricator. Differential Revision: D70135605

wconstab

Lgtm except the sparseIndices line

wconstab · 2025-02-28T01:34:03Z

torch/_C/_distributed_c10d.pyi

@@ -134,6 +134,8 @@ class BroadcastOptions:
 class AllreduceOptions:
    reduceOp: ReduceOp
    timeout: timedelta
+    asyncOp: bool
+    sparseIndices: Optional[Tensor]


Is this line added by mistake?

skyw · 2025-02-28T23:50:20Z

Does "current stream" also include stream context under the with clause? Which sets "current stream" under the context manager.

kwen2501 · 2025-03-03T16:36:07Z

This PR is still WIP, it is not functional yet

kwen2501 · 2025-03-03T16:37:05Z

Does "current stream" also include stream context under the with clause? Which sets "current stream" under the context manager.

Yes. Caution is, user would need to do tensor life management, if using a tensor created under main stream to this "with" context stream AND not using it back in main.

ngimel · 2025-03-03T17:18:53Z

Generally, relying on env variables to control behavior is very coarse grained and not particularly convenient (e.g. when spawning processes the environment has to be passed together with processes). Can we make this a property of pg?

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint [ghstack-poisoned]

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 4793680 Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: ac5295d Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 0f222d3 Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 8306cce Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: c31ca32 Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: dfc758a Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: baec42c Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 96d18e0 Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=Fals A3DB e [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 27b228c Pull Request resolved: #148590

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: f391590 Pull Request resolved: #148590 Stash tensors for reduce_scatter_v and all_gather_v ghstack-source-id: f391590 Pull Request resolved: #149753

…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: e4b48e5 Pull Request resolved: #148590 Stash tensors for reduce_scatter_v and all_gather_v ghstack-source-id: e4b48e5 Pull Request resolved: #149753 [c10d] Move unstashing from watchdog to main thread ghstack-source-id: e4b48e5 Pull Request resolved: #150079 [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation ghstack-source-id: e4b48e5 Pull Request resolved: #150130

@cenzhaometa

…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 [ghstack-poisoned]

…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 ghstack-source-id: ce103fc Pull Request resolved: #150398

@cenzhaometa

…irely (#150398) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 Pull Request resolved: #150398 Approved by: https://github.com/atalman

@cenzhaometa

…irely (pytorch#150398) Relanding pytorch#148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves pytorch#147729 - Resolves pytorch#146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves pytorch#147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](pytorch#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (pytorch#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: pytorch#149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: pytorch#150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: pytorch#150130 Pull Request resolved: pytorch#150398 Approved by: https://github.com/atalman

github-actions · 2025-05-02T17:35:45Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

Refer pytorch/pytorch#147820 pytorch/pytorch#150398 To launch kernels on the current stream and reduce the CPU overhead introduced by `recordStream`, an `async` option is introduced. For example, in an `allreduce` operation between two ranks: - `rank0` corresponds to `device0`, using the current device's `stream0` to create the communicator and preserving `stream0`. When `async = true`: - Both `rank0` and `rank1` perform the collective using `stream0`, which is associated with the communicator. - To prevent potential reads by `stream0` from unready tensors (e.g., from `rank1`), synchronization with the current stream is required. - After the collective completes, to prevent premature release of the input tensors, `recordStream` must be used for stream tracking, or the tensors need to be temporarily stored (e.g., in `reduce_scatter` or `all2all`). When `async = false`: - Both `rank0` and `rank1` use their respective **current streams** for collectives (i.e., `rank0` uses `stream0`, `rank1` uses `stream1`). - In this case, the collective op handles synchronization implicitly. Previously, we defaulted to `async = true`. Now, the `async` option is explicitly introduced and set to `false` by default, leveraging the current stream to avoid the overhead of stream synchronization. --------- Co-authored-by: mengfei25 <mengfei.li@Intel.com>

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Feb 25, 2025

facebook-github-bot added the fb-exported label Feb 25, 2025

cenzhaometa force-pushed the export-D70135605 branch from b402084 to 9bac479 Compare February 27, 2025 08:05

cenzhaometa requested a review from kwen2501 as a code owner February 27, 2025 08:05

cenzhaometa force-pushed the export-D70135605 branch from 9bac479 to fa36ae3 Compare February 27, 2025 21:25

kwen2501 changed the title ~~[ptd][nccl] provide a knob to use current-stream as nccl-stream~~ [ptd][nccl] use current-stream as nccl-stream under async=False mode Feb 27, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 27, 2025

wconstab reviewed Feb 28, 2025

View reviewed changes

kwen2501 mentioned this pull request Feb 28, 2025

[RFE][Distributed][NCCL] A feature request for stream management API in PG NCCL #147729

Closed

kwen2501 changed the title ~~[ptd][nccl] use current-stream as nccl-stream under async=False mode~~ [WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode Mar 3, 2025

kwen2501 marked this pull request as draft March 3, 2025 16:36

kwen2501 mentioned this pull request Apr 1, 2025

[Reland] Launch kernel on current stream & remove record_stream entirely #150398

Closed

Chao1Han mentioned this pull request Apr 9, 2025

Intro async flag and use current stream avoid stream sync intel/torch-xpu-ops#1546

Merged

github-actions bot added the Stale label May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820

[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820

Are you sure you want to change the base?

[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820

Uh oh!

Conversation

Uh oh!

before

after

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147820

❌ 3 New Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!