-
Notifications
You must be signed in to change notification settings - Fork 24.3k
[WIP][ptd][nccl] use current-stream as nccl-stream under async=False mode #147820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147820
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit fa36ae3 with merge base 10ffd94 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This pull request was exported from Phabricator. Differential Revision: D70135605 |
b402084
to
9bac479
Compare
This pull request was exported from Phabricator. Differential Revision: D70135605 |
…rch#147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - introduces a new env `TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1` - when it's specified, PTD uses current-stream as the nccl-stream and avoids stream sync this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Test Plan: - # AMD > before ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces {F1975408857} - c10d::allreduce_(69us) - cudaStreamSync (23us) - nccl::all_reduce(26us) > after ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces {F1975408962} - c10d:allreduce_(37us) - cudaStreamSync (gone) - nccl::all_reduce(20us) # NV > before ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces {F1975437097} - c10d::allreduce_ (62us) - cudaStreamWait (0us) - nccl::all_reduce (47us) > after ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces {F1975437192} - c10d::allreduce_ (62us) - cudaStreamWait (gone) - nccl:all_reduce (53us) Differential Revision: D70135605
…ytorch#147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Test Plan: - # AMD > before ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces {F1975408857} - c10d::allreduce_(69us) - cudaStreamSync (23us) - nccl::all_reduce(26us) > after ``` [cenzhao@devgpu039.atn3 ~/fbsource/fbcode (2265d32f0)]$ buck2 run @//mode/opt-amd-gpu -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;MSCCL_ALGO_DIR=/data/users/${USER}/fbsource/third-party/rccl/develop/tools/msccl-algorithms;RCCL_MSCCLPP_THRESHOLD=$((128*1024*1024));RCCL_MSCCLPP_ENABLE=1;TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces {F1975408962} - c10d:allreduce_(37us) - cudaStreamSync (gone) - nccl::all_reduce(20us) # NV > before ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-2.Feb_25_11_11_28.3328768.pt.trace.json.gz&bucket=hpc_traces {F1975437097} - c10d::allreduce_ (62us) - cudaStreamWait (0us) - nccl::all_reduce (47us) > after ``` [cenzhao@devgpu019.prn3 ~/fbsource/fbcode (e3f64263c)]$ buck2 run @//mode/opt -c fbcode.split-dwarf=True //param_bench/train/comms/pt:launcher -- --launcher mpi --nnode 1 --collective all_reduce --b 20M --e 20M --data-type bfloat16 --backend nccl --n 100 --w 5 --envs "NCCL_DEBUG_FILE=/tmp/dedicated_log_rccl.%h.%p.log;NCCL_DEBUG=INFO;NCCL_DEBUG_SUBSYS=INIT,COLL;TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1" --size-start-profiler 20M ``` https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu019.prn3.facebook.com/rank-4.Feb_25_11_17_05.3469865.pt.trace.json.gz&bucket=hpc_traces {F1975437192} - c10d::allreduce_ (62us) - cudaStreamWait (gone) - nccl:all_reduce (53us) Differential Revision: D70135605
9bac479
to
fa36ae3
Compare
This pull request was exported from Phabricator. Differential Revision: D70135605 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm except the sparseIndices line
@@ -134,6 +134,8 @@ class BroadcastOptions: | |||
class AllreduceOptions: | |||
reduceOp: ReduceOp | |||
timeout: timedelta | |||
asyncOp: bool | |||
sparseIndices: Optional[Tensor] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this line added by mistake?
Does "current stream" also include stream context under the with clause? Which sets "current stream" under the context manager. |
This PR is still WIP, it is not functional yet |
Yes. Caution is, user would need to do tensor life management, if using a tensor created under main stream to this "with" context stream AND not using it back in main. |
Generally, relying on env variables to control behavior is very coarse grained and not particularly convenient (e.g. when spawning processes the environment has to be passed together with processes). Can we make this a property of pg? |
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint [ghstack-poisoned]
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 4793680 Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: ac5295d Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 0f222d3 Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 8306cce Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: c31ca32 Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: dfc758a Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: baec42c Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 96d18e0 Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=Fals A3DB e [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: 27b228c Pull Request resolved: #148590
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: f391590 Pull Request resolved: #148590 Stash tensors for reduce_scatter_v and all_gather_v ghstack-source-id: f391590 Pull Request resolved: #149753
…147820) Summary: PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 [PGNCCL] Make avoid-record-stream default [c10d] Add asyncOp argument to Ops Change python side wait Pass asyncOp at ProcessGroup level Watchdog unstashing tensors as a safety net lint ghstack-source-id: e4b48e5 Pull Request resolved: #148590 Stash tensors for reduce_scatter_v and all_gather_v ghstack-source-id: e4b48e5 Pull Request resolved: #149753 [c10d] Move unstashing from watchdog to main thread ghstack-source-id: e4b48e5 Pull Request resolved: #150079 [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation ghstack-source-id: e4b48e5 Pull Request resolved: #150130
…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 [ghstack-poisoned]
…irely Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** Differential Revision: D70135605 * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 ghstack-source-id: ce103fc Pull Request resolved: #150398
…irely (#150398) Relanding #148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves #147729 - Resolves #146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves #147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: #149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: #150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: #150130 Pull Request resolved: #150398 Approved by: https://github.com/atalman
…irely (pytorch#150398) Relanding pytorch#148590 due to merge conflict. This PR has multiple changes to `ProcessGroupNCCL` (which unfortunately are related): 1. When async_op=False, we directly launch the collective on "current" stream, instead of a trampoline stream and join back. - Resolves pytorch#147729 - Resolves pytorch#146881 - Also saves two event syncs (which have overhead in case of HIP) and one pybind when we call `work.wait()` in distributed_c10d.py on behalf of user. 2. Entirely remove `record_stream` and use CPU-side stashing for managing tensor lifetime against recycling. - Resolves pytorch#147168 3. Remove tensor life management when async_op=False; only use it when async_op=True. 4. To guard against user not calling `work.wait()`, we ask watchdog to unstash tensors after detecting completion of collectives, to prevent us from holding reference to tensors forever. This is a safety net, rather than a service guarantee, see discussion [here](pytorch#147168 (comment)). 5. Profile in async_op=False mode would look different -- collective kernels would show up in the same line and compute kernels. Joint work with @cenzhaometa who wants to remove the event sync overhead. Squashed contents: * [ptd][nccl] use current-stream as nccl-stream under async=False mode (pytorch#147820) PTD current workflow: - PTD creates its own dedicated `ncclStream` for comm operation - it will first add a dependency on current-stream (typically the compute stream) to ensure tensors are ready before invoking collective such stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us). This diff: - async=False [default], will use current-stream as nccl-stream and avoid the stream-sync overhead - async=True, will retain existing logic: create new nccl-stream, let it wait on current-stream to ensure tensors are ready - pass down async from c10d down to NCCL-PG this helps shave off 50% CPU overhead **(70us -> 35us)**, which reduce total CPU/GPU from **230us to 195us by 15%** * [PGNCCL] Make avoid-record-stream default * [c10d] Add asyncOp argument to Ops * Change python side wait * Pass asyncOp at ProcessGroup level * Watchdog unstashing tensors as a safety net * Stash tensors for reduce_scatter_v and all_gather_v Pull Request approved: pytorch#149753 * [c10d] Move unstashing from watchdog to main thread Pull Request approved: pytorch#150079 * [PGNCCL][BE] Merge mutex into TensorShelf for encapsulation Pull Request approved: pytorch#150130 Pull Request resolved: pytorch#150398 Approved by: https://github.com/atalman
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
Refer pytorch/pytorch#147820 pytorch/pytorch#150398 To launch kernels on the current stream and reduce the CPU overhead introduced by `recordStream`, an `async` option is introduced. For example, in an `allreduce` operation between two ranks: - `rank0` corresponds to `device0`, using the current device's `stream0` to create the communicator and preserving `stream0`. When `async = true`: - Both `rank0` and `rank1` perform the collective using `stream0`, which is associated with the communicator. - To prevent potential reads by `stream0` from unready tensors (e.g., from `rank1`), synchronization with the current stream is required. - After the collective completes, to prevent premature release of the input tensors, `recordStream` must be used for stream tracking, or the tensors need to be temporarily stored (e.g., in `reduce_scatter` or `all2all`). When `async = false`: - Both `rank0` and `rank1` use their respective **current streams** for collectives (i.e., `rank0` uses `stream0`, `rank1` uses `stream1`). - In this case, the collective op handles synchronization implicitly. Previously, we defaulted to `async = true`. Now, the `async` option is explicitly introduced and set to `false` by default, leveraging the current stream to avoid the overhead of stream synchronization. --------- Co-authored-by: mengfei25 <mengfei.li@Intel.com>
Summary:
PTD current workflow:
ncclStream
for comm operationsuch stream synchronization become expensive in Inference world (cpu overhead: 70us vs GPU kernel time: 160us).
This diff:
TORCH_NCCL_USE_CURRENT_STREAM_AS_NCCL_STREAM=1
this helps shave off 50% CPU overhead (70us -> 35us), which reduce total CPU/GPU from 230us to 195us by 15%
Test Plan:
before
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-0.Feb_24_16_19_28.354787.pt.trace.json.gz&bucket=hpc_traces
{F1975408857}
after
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devgpu039.atn3.facebook.com/rank-4.Feb_24_16_22_56.534269.pt.trace.json.gz&bucket=hpc_traces
{F1975408962}
Differential Revision: D70135605
Resolves #147729
cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o