[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style #119421

kwen2501 · 2024-02-07T23:21:21Z

Part 2 and last part of #118674:
Introduce actual "single-device" code change to ProcessGroupNCCL.

assert size == 1 and test refactor have been done in #119099.

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225

pytorch-bot · 2024-02-07T23:21:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119421

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 9b72df7 with merge base cd9a193 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

wconstab · 2024-02-08T04:37:03Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

@@ -416,13 +413,16 @@ class TORCH_API ProcessGroupNCCL : public Backend {

  c10::intrusive_ptr<Work> endCoalescing() override;

+  // For specifying a composite optype, such as ALLGATHER and REDUCE_SCATTER
+  c10::intrusive_ptr<Work> endCoalescing(OpType optype);


hmm this one is not an obvious devices->device side effect

You are right.
At one point, I was testing the non-even all-gather with this PR. That would translate into coalesced broadcasts. And I found that the flight recorder needs an OpType for naming every work, even for coalesced work. So this change is for being friendlier with flight recorder.

wconstab · 2024-02-08T04:38:21Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

      Fn fn,
      PreProcess pre,
      PostProcess post,
      OpType opType,
      const char* profilingTitle = nullptr,
      bool avoidRecordStreams = false);

+  template <typename Fn>
+  c10::intrusive_ptr<Work> collectiveCoalesced(


also this one.. are you sneakin something in ?

No. Long story short: the previous collective(vector) function was used in two ways: (1) multi-device collective and (2) coalesced collective. While I can refactor (1) into single-device signature, I have to maintain one helper for (2), which is the collectiveCoalesced(vector) here. Btw, the previous two-way use is definitely not a clean use, because it makes the meaning of the function vague, and requires some if-else conditions in it.

I have a follow-up plan regarding these two helpers: don't use helper at all.
That means:
(1) stop using lambda functions, which does not actually make our code shorter, but makes it harder to read. The lambda way also requires packing pre- and post- processing into lambda as well. These can be well flattened into normal logic flows if we stop using the collective helper.
(2) decompose functionalities in collective() into smaller utilities, such as getComm, getStream, createWork which can be commonly shared by all collectives. With that, an all-reduce can just flatten into:

comm = getComm(); stream = getStream(); ncclAllreduce(output, input, comm, stream); return createWork();

With this change, both helpers can be gone.

wconstab · 2024-02-08T04:41:24Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

@@ -137,14 +135,11 @@ ncclRedOpRAII getNcclReduceOp(
 #ifdef ENABLE_NCCL_PREMUL_SUM_SUPPORT
      switch (dataType) {
        case ncclHalf:
-          return unpackPreMulSum<at::Half, ncclHalf>(
-              reduceOp, comm, dev_in_group);
+          return unpackPreMulSum<at::Half, ncclHalf>(reduceOp, comm);


hmm whats unpackPreMulSum

PreMulSum is a ncclReduceOp. Compared to regular sum reduce, it multiplies each operand with a scalar of user's choice before summing them. For use in case of gradient overflow.

unpackPreMulSum here unpacks that scalar value out, from the composite ncclReduceOp datatype.

wconstab

ok so far so good but i am giving up for tonight. reviewed 7/8 files and up to line 600 in PGNCCL.cpp

suggest you split the PR into 2 or 3 chunks for easier reviewing and lower risk of landing?

kwen2501 · 2024-02-08T15:51:45Z

ok so far so good but i am giving up for tonight. reviewed 7/8 files and up to line 600 in PGNCCL.cpp

suggest you split the PR into 2 or 3 chunks for easier reviewing and lower risk of landing?

Thanks! Really appreciate that!

Once I change a collective, I have to change the collective() helper; once I change the collective() helper, I have to change all other collectives... Let me know if you have better ideas for splitting.

shuqiangzhang · 2024-02-08T17:27:32Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp

@@ -288,8 +286,7 @@ class TORCH_API ProcessGroupNCCL : public Backend {

    // Wrapper method for the static checkForNCCLErrors which can be overridden
    // for tests.
-    virtual std::exception_ptr checkForNCCLErrors(
-        const std::vector<std::shared_ptr<NCCLComm>>& ncclComms) const;
+    virtual std::exception_ptr checkForNCCLErrors();


nit: why remove 'const'?

Because the compilation would otherwise fail.
Note that we removed the ncclComms argument here, and will be checking self.ncclComm_ instead (which is the same thing). When we touch self.ncclComm_, it seems the compiler complains if const is there.

Yes, since the member variable is passed to a static func, that's fine

shuqiangzhang · 2024-02-08T17:52:00Z

torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp

-  }
-
-  return !checkForNCCLErrors(ncclComms_) && finishedGPUExecutionInternal();
+  C10_THROW_ERROR(NotImplementedError, "WorkNCCL::isSuccess() is deprecated");


I am sure you have checked no users are using this API ^_^

It is because of this:

pytorch/torch/csrc/distributed/c10d/init.cpp

Lines 2555 to 2561 in 7315ec7

.def(

"is_success",

[](::c10d::Work& work) -> bool {

TORCH_WARN_ONCE(

fmt::format(kDeprecationWarning, "Work::is_success"));

return work.isSuccess();

})

That is, at the binding site, it has been announced dead.

See other reason here: #119099 (comment)

shuqiangzhang · 2024-02-08T20:05:00Z

Long diff and LGTM, and I will let @wconstab to take another look. I think this PR also has implications beyond ProcessGroupNCCL, e.g., we might want to extend the API change to the base ProcessGroup layer and apply it to other backend such as Gloo as well. As I am seeing we are doing this Tensor to {Tensor} conversions to just fit the ProcessGroup layer API

kwen2501 · 2024-02-08T20:34:42Z

That's right!
The signatures of the collective APIs still conform with the multi-device form because those are defined by the Backend class. The Backend class is a shared API definition for all backends (UCC, XLA, Gloo, etc). Due to its shared nature, the move of it would be slower -- needs agreement from other backend developers. Or, when more backends make move, we can change it. AFAIK, ProcessGroupUCC has intention to make a move too (cc @Aidyn-A ).

kwen2501 · 2024-02-09T17:55:06Z

@pytorchbot merge

DanilBaibak · 2024-02-12T07:42:09Z

Hi @kwen2501! Sorry, I need to revert your PR because it breaks the trunk. Here you can find more information about the issue - distributed/test_c10d_nccl.py::NcclErrorHandlingTest::test_nccl_errors_blocking_nonzero_exit

kwen2501 · 2024-02-12T15:38:37Z

@pytorchbot merge

pytorchmergebot · 2024-02-12T15:40:39Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

API change coming from pytorch/pytorch#119421

kunalb · 2024-02-13T14:54:56Z

https://github.com/pytorch/pytorch/blame/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp in main seems to be fairly broken (see line 2799) where a function definition seems a bit broken
Did this merge badly? (the PR diff vs what I see on the raw file on main seem very different)

@wujingyue

* print bandwidth when perf_debug_verbose is true (NVIDIA#1689) print bandwidth when `perf_debug_verbose` is true. * in vectorization validation, add err msg if tv has no definition (NVIDIA#1690) check the existence of tv definition in vectorization validation * Accomodate Reduction IterDomains when concretizing reshape extents (NVIDIA#1692) We register extents for concretization when we concretize reshape. In order to do that, we line up `IterDomain`s in the symbolic reshaped TV and the new, concretized one. In cases where the concretized reshape is trivial, such as when the output shape is the same as the input, we do not create a new TV. In those cases, we will have the input to the original `ViewOp` as the concretized output. That input TV might have reduction domains, as in the provided test, in which case we need to filter those out when doing this alignment. This small PR just implements that filtering. Fixes NVIDIA#1691. * `MmaOp::evaluate` method (NVIDIA#1675) * Fix some typos. (NVIDIA#1700) * `torch.compile` and `eager` benchmarks for `softmax` (NVIDIA#1670) Adds `torch.compile` and `eager` baseline benchmarks to be used in weekly benchmark runs. Issue NVIDIA#1668. * Add a test for fusions with no inputs. (NVIDIA#1709) As a follow up to NVIDIA#1696 (comment). * Double the size of the fusion cache to workaround a CI issue. (NVIDIA#1702) By just removing entries when it fills up. * Check that the reduced axis is sharded on producer in isLowerableToCommunication (NVIDIA#1695) Currently, a reduction is lowerable to a communication iff only one axis is reduced and this axis is sharded across devices on the **producer** side. Before this patch, we would mistakenly check that the axis is sharded on **consumer** side, which led to some runtime assert error. * Add blank impl of isLowerableToCommunication. (NVIDIA#1698) isLowerableToCommunication is used in a few places to print error messages or short-circuit loops. Those places appear to be places that are intended to largely be used behind the distributed path. It's easier to just define the API instead of trying to conditionalize all the use sites and invent non-USE_DISTRIBUTED behavior. * Multidevice segmenter (NVIDIA#1696) # What Add an option in the segmenter to segment resharding Expr in separate singleton segment. To trigger it, set the segmenter's options as follows: ``` SegmentCandidateFinderOptions options{ .run_translate_welford = false, .run_combine_reductions = false, .run_herrmann_merge = true, .run_final_merge = true, .only_segment_resharding_exprs = true}; ``` and use the segmenter as follows with any (possibly dummy) inputs: ``` KernelArgumentHolder dummy_inputs; auto segmented_fusion = SegmentCandidateFinder::segment(std::move(fusion), dummy_inputs, options); ``` If `only_segment_resharding_exprs` is set to `false` (which is the case by default), the behavior of the segmenter is unchanged. We also provide a quite wide testing suite to validate our implementation. # Why Resharding Exprs need to be handled differently than other Exprs because we want them to result in posting a network collective from the host. Therefore those expressions cannot (for now) be fused to any kernel. For this reason, we need those Expr to be segmented before and after. # How _**Remark:** For now, the segmenter is only used [at one place before scheduling and compiling the fusion](https://github.com/NVIDIA/Fuser/blob/1603f39bab8c1bbe12e38f2b5de53dec3b7cc373/csrc/kernel_cache.cpp#L990)._ Recall that the segmenter first creates as many segments as there are Expr and then tries to merge the neighbour segments incrementally in an eager manner. The method ``` bool SegmentCandidateFinder::codeGenSupportedMerge( SegmentedGroup* group1, SegmentedGroup* group2) ``` returns whether two groups can be merged (i.e. fused into one kernel). With the current patch, if `SegmentCandidateFinderOptions::only_segment_resharding_exprs` is set to `true`, then the usual behavior of `codeGenSupportedMerge` is bypassed and the function returns whether one Expr among the groups is resharding. Because this segmentation shouldn't depend on the inputs data, we use default (aka empty) `KernelArgumentHolder`, from which it is invalid to instantiate a `SchedulerRuntimeInfo runtime_info_`. For this reason, we had to make the latter attribute optional. # Future/other directions Another way to achieve the same result is to manually add segment bounds surrounding the resharding Exprs as was suggested by @wujingyue here NVIDIA#1571 The current implementation looks a bit "hacky" and should be be integrated more properly once multidevice schedulers are implemented and/or the segmenter is refactored. Later, we might wanna be able to fuse communications and computes and also communications between them. This would require a more advanced segmenter and scheduler, but hopefully this patch could serve as a good basis # Example: consider the fusion: ``` auto fusion = std::make_unique<Fusion>(); FusionGuard fg(fusion.get()); TensorView* tv0 = makeContigTensor({4}); fusion->addInput(tv0); TensorView* tv1 = sum(tv0,{3}); TensorView* tv2 = set(tv1); TensorView* tv3 = sum(tv2, {2}); fusion->addOutput(tv3); ``` Manually scheduled as follows: ``` DeviceMesh mesh ({0,1,2,3}) for (auto tv : {tv0, tv1, tv2, tv3}) { tv->setDeviceMesh(mesh); } tv0->axis(0)->parallelize(ParallelType::DIDx); tv1->axis(0)->parallelize(ParallelType::DIDx); ``` This scheduling implies that - `tv0` and `tv1` are fully sharded on the devices {0,1,2,3} - `tv2` and `tv3` are fully replicated on those same devices - consequently, the "set" operation on the line `tv2 = set(tv1)` actually embedds an "AllGather" network collective. This Expr is resharding while all the other exprs are not. We thus excpect this expression to constitute an unmergeable segment. The segmenter in this situation with the option`SegmentCandidateFinderOptions::only_segment_resharding_exprs` set to `true` will result in three segments: - Compute segment 1: with the expr `tv1 = sum(tv0,{3})` - Communication segment 1: with the expr `tv2 = set(tv1)` - Compute segment 2: with the expr `tv3 = sum(tv2, {2})` * Vectorization Factor patch for computeInfoC2P with Broadcast in mapped IterDomain (NVIDIA#1625) Fixes NVIDIA#1567 This PR patches vectorization factor in `ContiguousInnerDimensionsMapper::computeInfoC2P`. Handling of resolved broadcast dimension should be made on mapped consumer tensors' from_ids, instead of the root_domain order. Added a few tests per @zasdfgbnm 's suggestion: ``` Case 0: T2[1024, 2, 512] = T0[1024, 2, 1] + T1[1024, 2, 512] allocation = rfactor --> T0 has no vectorization Case 1: T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2] allocation = rfactor --> T0 has vectorization 2 Case 2: T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2]; T3[512, 1024, 2] = transpose(T2[1024, 512, 2]) allocation = rfactor *except T1 has stride_order {1, 2, 0} --> T0 has vectorization 4 Case 3: T2[512, 1024, 2] = T0[1, 1024, 2] + T1[512, 1024, 2] T3[1024, 512, 2] = transpose(T2[512, 1024, 2]) allocation = rfactor --> T0 has vectorization 2 ``` --------- Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> * transpose scheduler fix: reduction IterDomain on input tensors (NVIDIA#1661) Fixes NVIDIA#1659 Reorders reduction IterDomain so it won't interfere with scheduling tiling from transpose scheduler. * Convert reduction of expanded dims to squeeze (NVIDIA#1679) See comment in arith.cpp for details. One controversial change here is to allow squeezing expanded dimensions, both in our IR's `SqueezeOp` and in the user-facing functions `squeeze`. This results in actually removing those dimensions. This behavior diverges from PyTorch, whose `squeeze` command will ignore requested squeezes if the size is not 1 regardless of whether that dimension is expanded. I'm happy to discuss this change and potentially take another course, but I think we do need to be able to remove expanded axes (see NVIDIA#1174 (comment) for another case where I encountered this limitation). Fixes NVIDIA#1678 * Make sure ValGraphs are created deterministically (NVIDIA#1714) While I was working on NVIDIA#32, I sometimes saw non-deterministic results. Hope this is the only source of non-determinism. * Fix squeeze-related errors (NVIDIA#1717) This fixes current failures in `pytest_ops.py -k squeeze` and some integration failues. This restores our previous semantics for squeeze, which **do not match PyTorch**. Namely, if squeeze is provided a dimension that cannot be squeezed, we will always raise an error. * NVFUSER_DISTRIBUTED instead of USE_DISTRIBUTED (NVIDIA#1711) * Add the missing `clang-format on` and reformat. (NVIDIA#1722) * Print a newline before the header. (NVIDIA#1720) * Associate each fusion cache with its local rank in distributed setting. (NVIDIA#1699) ### Problem: Currently, automatic serialization saves a single cache regardless of the number of devices. In a distributed setting, each process restores its fusion cache from the same common workspace. However, this workspace only contains the CUDA kernels for a single device. The remaining processes must recompile the kernels for their devices. ### Solution: A separate process is created for each device with `ddp` or `fsdp` and each process contains a separate `FusionCache`. This PR associates each fusion cache with its local rank in a distributed setting, allowing automatic serialization to create a separate workspace for each device. During deserialization, each process loads the workspace associated with its local rank. * Vectorized serial grid reduction (NVIDIA#1528) This change allows us to use vectorized loads/stores in `serialReductionStep`. The generated kernel now looks like ```c++ NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeWait<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); #pragma unroll for(nvfuser_index_t i16 = 0; i16 < 4LL; ++i16) { nvfuser_index_t i17; i17 = 32LL * i16; nvfuser_index_t i18; i18 = 4096LL * i16; nvfuser_index_t i19; i19 = i5 + i18; nvfuser_index_t i20; i20 = -i18; #pragma unroll for(nvfuser_index_t i21 = 0; i21 < 8LL; ++i21) { nvfuser_index_t i22; i22 = 512LL * (i21 + nvfuser_zero); Array<float, 4LL, 4> T3; T3.set(float(0.000000000e+00f)); reduction::serialReductionStep</*vec_size=*/4>( &T3[0LL], &T2[(i17 + (4LL * i21))], 0.000000000e+00f, &T6[(i19 + i22)], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); if ((b7 && (i6 < (i20 - i22)))) { loadLocalToGlobal<float, /*vec_size=*/4, /*is_volatile=*/false>( &T1[(i19 + i22)], &T3[0LL]); } } } grid_sync::blockSerializeRelease<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); NVFUSER_UPDATE_MAGIC_ZERO; ``` * removing out-dated assert on python API (NVIDIA#1724) removing out-dated asserts in python API `define_vector`; adding a tests verifying the behavior * make ci green again (NVIDIA#1730) skip failing test. Please enable it once we patch NVIDIA#1728 * Remove unnecessary `MATCHER_P`. (NVIDIA#1729) * Fix Issue NVIDIA#1734 (NVIDIA#1735) Closes Issue NVIDIA#1734 * Rename `AliasType` -> `AllocationType` (NVIDIA#1732) * Skip executing a kernel if it's empty. (NVIDIA#1723) I could change `compileFusion` to skip compilation as well. It turned out to be more complicated than I expected, so I took the easier route to skip just execution, which is at least an incremental improvement. * don't cache slice input tv (NVIDIA#1705) If the input tv is used by slice, don't cache it. Fix NVIDIA#1697 * Make `MmaOp::evaluate` return output of the same dtype as `MmaOp` (NVIDIA#1733) * Turing/Ampere Mma tests without `BroadcastOp` (NVIDIA#1672) This PR renames `matmulAtInput` into `matmulAtInput2D`, explicitly showing that it generates 2D inputs. This PR also adds a `matmulAtInput3DTuring`, which is used to generate the 3D fusion inputs (for example `[M, 1, K]` and `[1, K, N]`) for matmul. The `MmaTest` for Turing and Ampere is modified to exclude the `BroadcastOp` and use the 3D version for generating fusion inputs. This is only the initial step for making `scheduleMatmul` schedule a fusion not containing `BroadcastOp`, I intentionally keep it small. Other changes will be added in followup PRs. Fixes NVIDIA#1628 * io_alias_ const update (NVIDIA#1740) * Add benchmarks for RoPE. (NVIDIA#1739) This PR adds two implementations of the RoPE module and benchmarks them for NVIDIA#1597. `rope_with_cat_fusion` mimics the Hugging Face implementation. `rope_without_cat_fusion` implements an idea from @nikitaved to avoid concatenation. Even though it looks difficult for the compiler to do it all automatically, it's still useful to keep a record of the idea. As a side change, I made `fd.define_tensor` to accept empty contiguity. * Make nvfuser matmul benchmarks HSH instead of HSS (NVIDIA#1712) This matches the `at::matmul` baselines. This PR also adds a few more problem sizes, and runs each eagermode baseline with and without FP16 reduction allowed. * Reduce number of `MmaTest`s (NVIDIA#1738) This PR is stacked on top of NVIDIA#1672 Turing/Ampere mma is only TN, so it makes no sense to test other layouts in `MmaTest`s. These tests are intended to test mma instructions, `ldmatrix` and `ldmatrix.trans` is tested separately in other unit tests. Similar for `HopperRS` tests. * Weekly Benchmarks Input Range (NVIDIA#1708) * Rename axes= to dims= in frontend (NVIDIA#1741) Currently we accept `axes=` for some ops like `fd.ops.sum` and `dims=` for others like `fd.ops.squeeze`. This is a small attempt to make the frontend arguments more consistent. This change renames the `axis=` kwarg to `dim=` and the same for `axes=` -> `dims=`. I think we're free to set our own convention, but for reference: - PyTorch uses `dim=` in most places and accepts either a single dim or multiple using that same argument name, where applicable. - Numpy uses `axis=` and, like PyTorch, accepts a list where applicable. - `jax.lax` uses `dimensions=` * Avoid unused smem workspace for serial grid reductions (NVIDIA#1727) GridReduction can be lowered to either `gridReduce` or `serialReductionStep`. `gridReduce` requires a smem workspace in order to use multiple threads to aggregate partial sums. However, `serialReductionStep` does not coordinate among threads and has no use for a workspace. This change simply disables allocating that little bit of extra shared memory if our only grid reductions are serial, which currently only happens in split-K GEMM. This reduces the smem allocated in a simple test from 16896 B to 16384 B (about 97%). More importantly, this makes the computation in `mma_utils::generateSharedMemoryEpilogueHeuristics()` more accurate. Tests are updated to check that this computation is accurate. The change in `kernel.cpp` is responsible for reducing actual smem usage for split-K. The changes to `mma_utils` and `test_gpu_tensorcore.cpp` are needed for adding testing that our expected smem usage matches the actual usage. * Issue NVIDIA#1748 (NVIDIA#1749) Closes Issue NVIDIA#1748. Apart from `c10::cuda::GetDevice`, no other functionality seems affected. * Rename `axes` to `dims` in benchmarks fusion definitions (NVIDIA#1751) Changes the kwarg `axes` to `dims` following the API change in PR NVIDIA#1741. * Bump matmul benchmark checkMatch() tolerance (NVIDIA#1747) This is necessary due to recent switch to HSH Fixes NVIDIA#1746 * linter * change guard USE_DISTRIBUTED to NVFUSER_DISTRIBUTED in test/test_multidevice_sharding.cpp * linting * linter and cleanup * remove allocator.h/cpp files * Device index patch (NVIDIA#1752) Fixes NVIDIA#1748 guard c10::cuda::GetDevice API change on TORCH_VERSION with this change, it ensures that we can build against stable release `< 2.2.0`, as well as TOT after pytorch/pytorch#119142 For 2.3.0 nightly, if someone accidentally checkout a commit before the patch, the build will still fail. * fixing multidevice build (NVIDIA#1753) API change coming from pytorch/pytorch#119421 * patching API GUARD (NVIDIA#1754) patching API version guard so we'll still be able to build against older pytorch version. * Add a visitor for ValGraph (NVIDIA#1713) Used in the loop promotion analysis. Extracted from NVIDIA#32 * empty commit for triggering CI --------- Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com> Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com> Co-authored-by: Jingyue Wu <wujingyue@gmail.com> Co-authored-by: Tom Fogal <60981+tfogal@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Meghan Cowan <mcowan@nvidia.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>

kwen2501 · 2024-02-13T18:14:47Z

@kunalb Thanks much for catching that! I am fixing it via #119805

@kunalb

Fixes issue pointed out in #119421 (comment) When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly. Cc: @kunalb @H-Huang Pull Request resolved: #119805 Approved by: https://github.com/H-Huang

…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang

… single-device style (#119421)" This reverts commit f3e7d80. Reverted #119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](#119421 (comment)))

…device style (#119421) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: #119421 Approved by: https://github.com/shuqiangzhang

…y for all collectives (#135049) We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL. This partially revert what we did in #119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back. Pull Request resolved: #135049 Approved by: https://github.com/kwen2501

…y for all collectives (pytorch#135049) We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL. This partially revert what we did in pytorch#119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back. Pull Request resolved: pytorch#135049 Approved by: https://github.com/kwen2501

kwen2501 added module: c10d Issues/PRs related to collective communications and process groups release notes: distributed (c10d) release notes category labels Feb 7, 2024

github-actions bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 7, 2024

wconstab reviewed Feb 8, 2024

View reviewed changes

shuqiangzhang reviewed Feb 8, 2024

View reviewed changes

kwen2501 added 13 commits February 8, 2024 11:57

Simplify ProcessGroupNCCL into single-device style

79ec54c

Refactor overriden functions in tests

6c4e05c

Fix segv

c16ec89

Correctly print device

8bbaf7c

Fix NcclErrorHandlingTest

36ce680

Add logging in barrier wait

c25468d

Fix AG and RS recording

98fb9b6

Make flight recorder happy

9884725

Bump seq number once only for batch p2p

edf9cbb

Lint

8b249cd

Re-enable flatten path of AG and RS

242b2a5

Lint

4de6892

Swap check_gpu_tensors_different_devices with check_gpu_single_tensor

45f0b15

kwen2501 force-pushed the single_dev_part2 branch from d6d893b to 45f0b15 Compare February 8, 2024 19:58

kwen2501 mentioned this pull request Feb 8, 2024

avoid redundant event records and event blocks #119359

Closed

shuqiangzhang approved these changes Feb 9, 2024

View reviewed changes

pytorchmergebot added the Reverted label Feb 12, 2024

pytorchmergebot reopened this Feb 12, 2024

Generalize error message to make both CI and devGPU happy

9b72df7

pytorchmergebot added the merging label Feb 12, 2024

pytorchmergebot closed this in b2043c0 Feb 12, 2024

pytorchmergebot removed the merging label Feb 12, 2024

jjsjann123 added a commit to NVIDIA/Fuser that referenced this pull request Feb 12, 2024

fixing multidevice build

5f8c84b

API change coming from pytorch/pytorch#119421

jjsjann123 mentioned this pull request Feb 12, 2024

fixing multidevice build NVIDIA/Fuser#1753

Merged

jjsjann123 added a commit to NVIDIA/Fuser that referenced this pull request Feb 12, 2024

fixing multidevice build (#1753)

1830238

API change coming from pytorch/pytorch#119421

kwen2501 mentioned this pull request Feb 13, 2024

[c10d] Fix compilation of NCCL_EXP path #119805

Closed

kwen2501 mentioned this pull request Feb 13, 2024

Flight Recorder dump_entries() segfaults when used with coalesced operations #119758

Open

github-actions bot deleted the single_dev_part2 branch March 15, 2024 01:52

Aidyn-A mentioned this pull request Mar 26, 2024

[c10d][NCCL] Refactor coalesced storage #122651

Closed

oraluben mentioned this pull request Jul 10, 2024

[c10d] collective failure with single-device style (#119421) and nccl <= 2.12 #130414

Closed

oraluben mentioned this pull request Jul 17, 2024

Correctly set CUDAGuard in nccl collectives #130921

Closed

fduwjj mentioned this pull request Sep 5, 2024

[c10d] Change collective to take in a list of tensors so it work fully for all collectives #135049

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style #119421

[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style #119421

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	.def(
	"is_success",
	[](::c10d::Work& work) -> bool {
	TORCH_WARN_ONCE(
	fmt::format(kDeprecationWarning, "Work::is_success"));
	return work.isSuccess();
	})

[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style #119421

[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style #119421

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119421

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!