Convert reduction of expanded dims to squeeze #1679

jacobhinkle · 2024-01-26T15:18:35Z

See comment in arith.cpp for details.

One controversial change here is to allow squeezing expanded dimensions, both in our IR's SqueezeOp and in the user-facing functions squeeze. This results in actually removing those dimensions. This behavior diverges from PyTorch, whose squeeze command will ignore requested squeezes if the size is not 1 regardless of whether that dimension is expanded. I'm happy to discuss this change and potentially take another course, but I think we do need to be able to remove expanded axes (see #1174 (comment) for another case where I encountered this limitation).

Fixes #1678

See comment in arith.cpp for details. One controversial change here is to allow squeezing expanded dimensions, both in our IR's `SqueezeOp` and in the user-facing functions `squeeze`. This results in actually removing those dimensions. This behavior diverges from PyTorch, whose `squeeze` command will ignore requested squeezes if the size is not 1 regardless of whether that dimension is expanded. I'm happy to discuss this change and potentially take another course, but I think we do need to be able to remove expanded axes (see #1174 (comment) for another case where I encountered this limitation). Fixes #1678

jacobhinkle · 2024-01-26T15:18:48Z

!build --diff

jacobhinkle · 2024-01-26T16:06:09Z

csrc/ops/arith.cpp

-        !id->hasExpandedExtent() && id->extent()->isConstInt() &&
-        id->extent()->evaluate().as<int64_t>() == 1;


We were checking that extent is 1. I don't think we should need to do that, since we should not have any way to create a broadcast dim with unexpanded extent that's not 1.

jacobhinkle · 2024-01-26T16:41:39Z

There are currently two test failures. The first is

unknown file: Failure                                                                                            
C++ exception with description "shape '[8]' is invalid for input of size 96" thrown in the test body.            
To reproduce: NVFUSER_TEST_RANDOM_SEED=1706286828 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='NVFuserTest.FusionExpandReduce_CUDA'                                                                              
[  FAILED  ] NVFuserTest.FusionExpandReduce_CUDA (360 ms)

The fusion is defined like this:

  auto tv0 = makeConcreteTensor({1, 8});
  fusion->addInput(tv0);

  auto tv1 =
      expand(tv0, {IrBuilder::create<Val>(12L), IrBuilder::create<Val>(8L)});

  auto tv2 = sum(tv1, {0});
  fusion->addOutput(tv2);

resulting in this IR:

Inputs:
  T0_g[ bS0{1}, iS1{8} ], float
Outputs:
  T3_g[ iS5{8} ], float

%kernel_math {
T1_l[ bS2{1 ex 12}, iS3{8} ] = expand( T0_g[ bS0{1}, iS1{8} ], {12, 8} )
T2_l[ iS4{8} ]
   = squeeze( T1_l[ bS2{1 ex 12}, iS3{8} ] )
T3_g[ iS5{8} ]
   = T2_l[ iS4{8} ]
   * float(12);
}

It is properly computed, but the evaluate method of SqueezeOp translates to at::squeeze which does not squeeze the expanded axis. If we go this route we can check for expanded axes and first slice them before squeezing.

jacobhinkle · 2024-01-26T16:43:30Z

The other test failure is

unknown file: Failure
C++ exception with description "Splitting an axis of non-Serial parallel type is not supported at this time. Parallelization strategy must be set after calling split.. Tensor: T3_g[ iblockIdx.y8{gridDim.y}, iS9{( ceilDiv(( ceilDiv(4, blockDim.y) ), gridDim.y) )}, ithreadIdx.y7{blockDim.y} ]
Exception raised from split at /opt/pytorch/nvfuser/csrc/tensor_view.cpp:671 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x7f (0x7f50df36577e in /opt/pytorch/nvfuser/build/libnvfuser_codegen.so)
frame #1: nvfuser::TensorView::split(int, nvfuser::Val*, bool, bool) + 0x3fe (0x7f50dfa84c22 in /opt/pytorch/nvfuser/build/libnvfuser_codegen.so)
frame #2: <unknown function> + 0x79024a (0x55e21c44324a in build/nvfuser_tests)
frame #3: <unknown function> + 0xc982cd (0x55e21c94b2cd in build/nvfuser_tests)
frame #4: <unknown function> + 0xc9157f (0x55e21c94457f in build/nvfuser_tests)
frame #5: <unknown function> + 0xc66be2 (0x55e21c919be2 in build/nvfuser_tests)
frame #6: <unknown function> + 0xc67670 (0x55e21c91a670 in build/nvfuser_tests)
frame #7: <unknown function> + 0xc67f77 (0x55e21c91af77 in build/nvfuser_tests)
frame #8: <unknown function> + 0xc77e89 (0x55e21c92ae89 in build/nvfuser_tests)
frame #9: <unknown function> + 0xc991ac (0x55e21c94c1ac in build/nvfuser_tests)
frame #10: <unknown function> + 0xc92543 (0x55e21c945543 in build/nvfuser_tests)
frame #11: <unknown function> + 0xc765f1 (0x55e21c9295f1 in build/nvfuser_tests)
frame #12: <unknown function> + 0xc4a93c (0x55e21c8fd93c in build/nvfuser_tests)
frame #13: <unknown function> + 0xc4a8b5 (0x55e21c8fd8b5 in build/nvfuser_tests)
frame #14: <unknown function> + 0x29d90 (0x7f50a7f05d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x80 (0x7f50a7f05e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #16: <unknown function> + 0x3ebf45 (0x55e21c09ef45 in build/nvfuser_tests)
" thrown in the test body.
To reproduce: NVFUSER_TEST_RANDOM_SEED=1706286737 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='NVFuserTest.FusionExpandReduce2_CUDA'

That test uses a similar fusion but is doing some manual scheduling, which assumes we have not translated the reduction to a squeeze already, so it breaks before compilation.

Since this fusion is now translated to pointwise ops, it is no longer relevant.

This fixes ExpandReduce_CUDA test in test_gpu3.cpp

jacobhinkle · 2024-01-26T17:19:42Z

csrc/ir/nodes.cpp

-  return {at::expand_copy(in, expanded_size)};
+  return {in.expand(expanded_size)};


I believe this change is warranted regardless of whether we go forward with squeezing expanded ops. It was necessary in this case to fix broken test ExpandReduce_CUDA. cc @Priya2698

jacobhinkle · 2024-01-26T17:21:37Z

test/test_gpu3.cpp

-  // tv2[r{3}, i{4}]
-  tv2->split(0, NamedScalar::getParallelDim(ParallelType::TIDy));
-  tv2->axis(1)->parallelize(ParallelType::TIDy);
-  tv2->split(0, NamedScalar::getParallelDim(ParallelType::BIDy), false);
-  tv2->axis(0)->parallelize(ParallelType::BIDy);
-  tv2->split(-1, NamedScalar::getParallelDim(ParallelType::TIDx));
-  tv2->axis(-1)->parallelize(ParallelType::TIDx);
-  tv2->axis(-2)->parallelize(ParallelType::BIDx);
-  // [rBIDy, rO, rTIDy, iBIDx, iTIDx]
-  tv2->reorder({{-2, 0}, {-1, 1}, {2, 2}});
-  // [iBIDx, iTIDx, rTIDy, rBIDy, rO]
-  auto tv3 = tv2->rFactor({-1});


This test was meant to exercise reduction of expanded domains, which can always be translated to pointwise ops. I removed it because it seems no longer relevant to me.

jacobhinkle · 2024-01-26T17:23:25Z

Tests are passing now. One was fixed by updating the evaluate methods in SqueezeOp and ExpandOp and the other test was removed.

naoyam · 2024-01-30T00:14:44Z

This behavior diverges from PyTorch

Does it mean end results would not match if expand and squeeze are used?

I think we do need to be able to remove expanded axes (see #1174 (comment))

Could you please explain the reasoning again? Looked at the linked comment, still not really clear to me.

jacobhinkle · 2024-01-30T00:56:16Z

Does it mean end results would not match if expand and squeeze are used?

Possibly, which is the main downside I see to the current approach in this PR. To see the torch behavior, consider this example:

[nav] In [1]: import torch
[nav] In [2]: x = torch.randn((2, 1)).expand([2, 3])
[nav] In [3]: x.squeeze(1)
Out[3]: 
tensor([[2.0809, 2.0809, 2.0809],
        [2.1042, 2.1042, 2.1042]])

In this case, torch silently refuses to squeeze the second axis because it has extent > 1. In order to remove that axis you need to slice it instead.

[ins] In [4]: x[:, 0]
Out[4]: tensor([2.0809, 2.1042])

If we stuck with that behavior in our IR, we would probably want special handling for SliceOp when operating on Broadcast input axes. Instead, this PR is currently just allowing the first behavior to squeeze an expanded axis. I suppose another way to go would be to introduce an UnexpandOp which could undo expansions, then that could be followed by SqueezeOp when we want to remove an expanded broadcast.

A middle ground might be to keep this behavior in squeeze, but hide it behind a default-disabled option which would still allow us to use it when needed like squeeze(tv0, {1, 2}, /*squeeze_expanded=*/true).

I think we do need to be able to remove expanded axes (see #1174 (comment))

Could you please explain the reasoning again? Looked at the linked comment, still not really clear to me.

In #1174, we are encountering reshapes that need to split an expanded dimension. It's unnecessary to try and track a split expand back to the original broadcast. We could just forget about the original expanded broadcast and introduce two new ones. This seems simplest to me, and matches our current treatment of unexpanded broadcasts in reshape; we currently squeeze all pre-existing broadcasts and use a BroadcastOp after the ViewOp to introduce new ones. We could do the same with expand if we were able to remove the expanded broadcasts.

naoyam

Can you please add the repro as a test?

csrc/ops/arith.cpp

naoyam · 2024-01-30T18:54:09Z

csrc/ops/alias.cpp

@@ -244,8 +244,6 @@ TensorView* squeeze(TensorView* x, const std::vector<int64_t>& dims) {
          out_domain.push_back(id->cloneWithoutRFactor());
          continue;
        }
-        NVF_CHECK(


Please leave a comment about the deviation of behavior

Noted in the docstrings for both signatures of squeeze in alias.h.

jacobhinkle · 2024-01-30T20:42:56Z

Can you please add the repro as a test?

I can. I had left it out since it is essentially the same as this existing test

Fuser/test/test_gpu3.cpp

Line 3477 in 1603f39

TEST_F(NVFuserTest, FusionExpandReduce_CUDA) {

naoyam

LGTM

…ion_to_squeeze

@wujingyue

* print bandwidth when perf_debug_verbose is true (NVIDIA#1689) print bandwidth when `perf_debug_verbose` is true. * in vectorization validation, add err msg if tv has no definition (NVIDIA#1690) check the existence of tv definition in vectorization validation * Accomodate Reduction IterDomains when concretizing reshape extents (NVIDIA#1692) We register extents for concretization when we concretize reshape. In order to do that, we line up `IterDomain`s in the symbolic reshaped TV and the new, concretized one. In cases where the concretized reshape is trivial, such as when the output shape is the same as the input, we do not create a new TV. In those cases, we will have the input to the original `ViewOp` as the concretized output. That input TV might have reduction domains, as in the provided test, in which case we need to filter those out when doing this alignment. This small PR just implements that filtering. Fixes NVIDIA#1691. * `MmaOp::evaluate` method (NVIDIA#1675) * Fix some typos. (NVIDIA#1700) * `torch.compile` and `eager` benchmarks for `softmax` (NVIDIA#1670) Adds `torch.compile` and `eager` baseline benchmarks to be used in weekly benchmark runs. Issue NVIDIA#1668. * Add a test for fusions with no inputs. (NVIDIA#1709) As a follow up to NVIDIA#1696 (comment). * Double the size of the fusion cache to workaround a CI issue. (NVIDIA#1702) By just removing entries when it fills up. * Check that the reduced axis is sharded on producer in isLowerableToCommunication (NVIDIA#1695) Currently, a reduction is lowerable to a communication iff only one axis is reduced and this axis is sharded across devices on the **producer** side. Before this patch, we would mistakenly check that the axis is sharded on **consumer** side, which led to some runtime assert error. * Add blank impl of isLowerableToCommunication. (NVIDIA#1698) isLowerableToCommunication is used in a few places to print error messages or short-circuit loops. Those places appear to be places that are intended to largely be used behind the distributed path. It's easier to just define the API instead of trying to conditionalize all the use sites and invent non-USE_DISTRIBUTED behavior. * Multidevice segmenter (NVIDIA#1696) # What Add an option in the segmenter to segment resharding Expr in separate singleton segment. To trigger it, set the segmenter's options as follows: ``` SegmentCandidateFinderOptions options{ .run_translate_welford = false, .run_combine_reductions = false, .run_herrmann_merge = true, .run_final_merge = true, .only_segment_resharding_exprs = true}; ``` and use the segmenter as follows with any (possibly dummy) inputs: ``` KernelArgumentHolder dummy_inputs; auto segmented_fusion = SegmentCandidateFinder::segment(std::move(fusion), dummy_inputs, options); ``` If `only_segment_resharding_exprs` is set to `false` (which is the case by default), the behavior of the segmenter is unchanged. We also provide a quite wide testing suite to validate our implementation. # Why Resharding Exprs need to be handled differently than other Exprs because we want them to result in posting a network collective from the host. Therefore those expressions cannot (for now) be fused to any kernel. For this reason, we need those Expr to be segmented before and after. # How _**Remark:** For now, the segmenter is only used [at one place before scheduling and compiling the fusion](https://github.com/NVIDIA/Fuser/blob/1603f39bab8c1bbe12e38f2b5de53dec3b7cc373/csrc/kernel_cache.cpp#L990)._ Recall that the segmenter first creates as many segments as there are Expr and then tries to merge the neighbour segments incrementally in an eager manner. The method ``` bool SegmentCandidateFinder::codeGenSupportedMerge( SegmentedGroup* group1, SegmentedGroup* group2) ``` returns whether two groups can be merged (i.e. fused into one kernel). With the current patch, if `SegmentCandidateFinderOptions::only_segment_resharding_exprs` is set to `true`, then the usual behavior of `codeGenSupportedMerge` is bypassed and the function returns whether one Expr among the groups is resharding. Because this segmentation shouldn't depend on the inputs data, we use default (aka empty) `KernelArgumentHolder`, from which it is invalid to instantiate a `SchedulerRuntimeInfo runtime_info_`. For this reason, we had to make the latter attribute optional. # Future/other directions Another way to achieve the same result is to manually add segment bounds surrounding the resharding Exprs as was suggested by @wujingyue here NVIDIA#1571 The current implementation looks a bit "hacky" and should be be integrated more properly once multidevice schedulers are implemented and/or the segmenter is refactored. Later, we might wanna be able to fuse communications and computes and also communications between them. This would require a more advanced segmenter and scheduler, but hopefully this patch could serve as a good basis # Example: consider the fusion: ``` auto fusion = std::make_unique<Fusion>(); FusionGuard fg(fusion.get()); TensorView* tv0 = makeContigTensor({4}); fusion->addInput(tv0); TensorView* tv1 = sum(tv0,{3}); TensorView* tv2 = set(tv1); TensorView* tv3 = sum(tv2, {2}); fusion->addOutput(tv3); ``` Manually scheduled as follows: ``` DeviceMesh mesh ({0,1,2,3}) for (auto tv : {tv0, tv1, tv2, tv3}) { tv->setDeviceMesh(mesh); } tv0->axis(0)->parallelize(ParallelType::DIDx); tv1->axis(0)->parallelize(ParallelType::DIDx); ``` This scheduling implies that - `tv0` and `tv1` are fully sharded on the devices {0,1,2,3} - `tv2` and `tv3` are fully replicated on those same devices - consequently, the "set" operation on the line `tv2 = set(tv1)` actually embedds an "AllGather" network collective. This Expr is resharding while all the other exprs are not. We thus excpect this expression to constitute an unmergeable segment. The segmenter in this situation with the option`SegmentCandidateFinderOptions::only_segment_resharding_exprs` set to `true` will result in three segments: - Compute segment 1: with the expr `tv1 = sum(tv0,{3})` - Communication segment 1: with the expr `tv2 = set(tv1)` - Compute segment 2: with the expr `tv3 = sum(tv2, {2})` * Vectorization Factor patch for computeInfoC2P with Broadcast in mapped IterDomain (NVIDIA#1625) Fixes NVIDIA#1567 This PR patches vectorization factor in `ContiguousInnerDimensionsMapper::computeInfoC2P`. Handling of resolved broadcast dimension should be made on mapped consumer tensors' from_ids, instead of the root_domain order. Added a few tests per @zasdfgbnm 's suggestion: ``` Case 0: T2[1024, 2, 512] = T0[1024, 2, 1] + T1[1024, 2, 512] allocation = rfactor --> T0 has no vectorization Case 1: T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2] allocation = rfactor --> T0 has vectorization 2 Case 2: T2[1024, 512, 2] = T0[1024, 1, 2] + T1[1024, 512, 2]; T3[512, 1024, 2] = transpose(T2[1024, 512, 2]) allocation = rfactor *except T1 has stride_order {1, 2, 0} --> T0 has vectorization 4 Case 3: T2[512, 1024, 2] = T0[1, 1024, 2] + T1[512, 1024, 2] T3[1024, 512, 2] = transpose(T2[512, 1024, 2]) allocation = rfactor --> T0 has vectorization 2 ``` --------- Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> * transpose scheduler fix: reduction IterDomain on input tensors (NVIDIA#1661) Fixes NVIDIA#1659 Reorders reduction IterDomain so it won't interfere with scheduling tiling from transpose scheduler. * Convert reduction of expanded dims to squeeze (NVIDIA#1679) See comment in arith.cpp for details. One controversial change here is to allow squeezing expanded dimensions, both in our IR's `SqueezeOp` and in the user-facing functions `squeeze`. This results in actually removing those dimensions. This behavior diverges from PyTorch, whose `squeeze` command will ignore requested squeezes if the size is not 1 regardless of whether that dimension is expanded. I'm happy to discuss this change and potentially take another course, but I think we do need to be able to remove expanded axes (see NVIDIA#1174 (comment) for another case where I encountered this limitation). Fixes NVIDIA#1678 * Make sure ValGraphs are created deterministically (NVIDIA#1714) While I was working on NVIDIA#32, I sometimes saw non-deterministic results. Hope this is the only source of non-determinism. * Fix squeeze-related errors (NVIDIA#1717) This fixes current failures in `pytest_ops.py -k squeeze` and some integration failues. This restores our previous semantics for squeeze, which **do not match PyTorch**. Namely, if squeeze is provided a dimension that cannot be squeezed, we will always raise an error. * NVFUSER_DISTRIBUTED instead of USE_DISTRIBUTED (NVIDIA#1711) * Add the missing `clang-format on` and reformat. (NVIDIA#1722) * Print a newline before the header. (NVIDIA#1720) * Associate each fusion cache with its local rank in distributed setting. (NVIDIA#1699) ### Problem: Currently, automatic serialization saves a single cache regardless of the number of devices. In a distributed setting, each process restores its fusion cache from the same common workspace. However, this workspace only contains the CUDA kernels for a single device. The remaining processes must recompile the kernels for their devices. ### Solution: A separate process is created for each device with `ddp` or `fsdp` and each process contains a separate `FusionCache`. This PR associates each fusion cache with its local rank in a distributed setting, allowing automatic serialization to create a separate workspace for each device. During deserialization, each process loads the workspace associated with its local rank. * Vectorized serial grid reduction (NVIDIA#1528) This change allows us to use vectorized loads/stores in `serialReductionStep`. The generated kernel now looks like ```c++ NVFUSER_UPDATE_MAGIC_ZERO; grid_sync::blockSerializeWait<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); #pragma unroll for(nvfuser_index_t i16 = 0; i16 < 4LL; ++i16) { nvfuser_index_t i17; i17 = 32LL * i16; nvfuser_index_t i18; i18 = 4096LL * i16; nvfuser_index_t i19; i19 = i5 + i18; nvfuser_index_t i20; i20 = -i18; #pragma unroll for(nvfuser_index_t i21 = 0; i21 < 8LL; ++i21) { nvfuser_index_t i22; i22 = 512LL * (i21 + nvfuser_zero); Array<float, 4LL, 4> T3; T3.set(float(0.000000000e+00f)); reduction::serialReductionStep</*vec_size=*/4>( &T3[0LL], &T2[(i17 + (4LL * i21))], 0.000000000e+00f, &T6[(i19 + i22)], [](float &a, float b) { a = a + b; }, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == 0, index_utils::maskedOffset<false, false, true>(blockIdx, gridDim) == index_utils::maskedSize<false, false, true>(gridDim) - 1, true, true); if ((b7 && (i6 < (i20 - i22)))) { loadLocalToGlobal<float, /*vec_size=*/4, /*is_volatile=*/false>( &T1[(i19 + i22)], &T3[0LL]); } } } grid_sync::blockSerializeRelease<false, false, true>(&T5[index_utils::maskedOffset<true, true, false>(blockIdx, gridDim)]); NVFUSER_UPDATE_MAGIC_ZERO; ``` * removing out-dated assert on python API (NVIDIA#1724) removing out-dated asserts in python API `define_vector`; adding a tests verifying the behavior * make ci green again (NVIDIA#1730) skip failing test. Please enable it once we patch NVIDIA#1728 * Remove unnecessary `MATCHER_P`. (NVIDIA#1729) * Fix Issue NVIDIA#1734 (NVIDIA#1735) Closes Issue NVIDIA#1734 * Rename `AliasType` -> `AllocationType` (NVIDIA#1732) * Skip executing a kernel if it's empty. (NVIDIA#1723) I could change `compileFusion` to skip compilation as well. It turned out to be more complicated than I expected, so I took the easier route to skip just execution, which is at least an incremental improvement. * don't cache slice input tv (NVIDIA#1705) If the input tv is used by slice, don't cache it. Fix NVIDIA#1697 * Make `MmaOp::evaluate` return output of the same dtype as `MmaOp` (NVIDIA#1733) * Turing/Ampere Mma tests without `BroadcastOp` (NVIDIA#1672) This PR renames `matmulAtInput` into `matmulAtInput2D`, explicitly showing that it generates 2D inputs. This PR also adds a `matmulAtInput3DTuring`, which is used to generate the 3D fusion inputs (for example `[M, 1, K]` and `[1, K, N]`) for matmul. The `MmaTest` for Turing and Ampere is modified to exclude the `BroadcastOp` and use the 3D version for generating fusion inputs. This is only the initial step for making `scheduleMatmul` schedule a fusion not containing `BroadcastOp`, I intentionally keep it small. Other changes will be added in followup PRs. Fixes NVIDIA#1628 * io_alias_ const update (NVIDIA#1740) * Add benchmarks for RoPE. (NVIDIA#1739) This PR adds two implementations of the RoPE module and benchmarks them for NVIDIA#1597. `rope_with_cat_fusion` mimics the Hugging Face implementation. `rope_without_cat_fusion` implements an idea from @nikitaved to avoid concatenation. Even though it looks difficult for the compiler to do it all automatically, it's still useful to keep a record of the idea. As a side change, I made `fd.define_tensor` to accept empty contiguity. * Make nvfuser matmul benchmarks HSH instead of HSS (NVIDIA#1712) This matches the `at::matmul` baselines. This PR also adds a few more problem sizes, and runs each eagermode baseline with and without FP16 reduction allowed. * Reduce number of `MmaTest`s (NVIDIA#1738) This PR is stacked on top of NVIDIA#1672 Turing/Ampere mma is only TN, so it makes no sense to test other layouts in `MmaTest`s. These tests are intended to test mma instructions, `ldmatrix` and `ldmatrix.trans` is tested separately in other unit tests. Similar for `HopperRS` tests. * Weekly Benchmarks Input Range (NVIDIA#1708) * Rename axes= to dims= in frontend (NVIDIA#1741) Currently we accept `axes=` for some ops like `fd.ops.sum` and `dims=` for others like `fd.ops.squeeze`. This is a small attempt to make the frontend arguments more consistent. This change renames the `axis=` kwarg to `dim=` and the same for `axes=` -> `dims=`. I think we're free to set our own convention, but for reference: - PyTorch uses `dim=` in most places and accepts either a single dim or multiple using that same argument name, where applicable. - Numpy uses `axis=` and, like PyTorch, accepts a list where applicable. - `jax.lax` uses `dimensions=` * Avoid unused smem workspace for serial grid reductions (NVIDIA#1727) GridReduction can be lowered to either `gridReduce` or `serialReductionStep`. `gridReduce` requires a smem workspace in order to use multiple threads to aggregate partial sums. However, `serialReductionStep` does not coordinate among threads and has no use for a workspace. This change simply disables allocating that little bit of extra shared memory if our only grid reductions are serial, which currently only happens in split-K GEMM. This reduces the smem allocated in a simple test from 16896 B to 16384 B (about 97%). More importantly, this makes the computation in `mma_utils::generateSharedMemoryEpilogueHeuristics()` more accurate. Tests are updated to check that this computation is accurate. The change in `kernel.cpp` is responsible for reducing actual smem usage for split-K. The changes to `mma_utils` and `test_gpu_tensorcore.cpp` are needed for adding testing that our expected smem usage matches the actual usage. * Issue NVIDIA#1748 (NVIDIA#1749) Closes Issue NVIDIA#1748. Apart from `c10::cuda::GetDevice`, no other functionality seems affected. * Rename `axes` to `dims` in benchmarks fusion definitions (NVIDIA#1751) Changes the kwarg `axes` to `dims` following the API change in PR NVIDIA#1741. * Bump matmul benchmark checkMatch() tolerance (NVIDIA#1747) This is necessary due to recent switch to HSH Fixes NVIDIA#1746 * linter * change guard USE_DISTRIBUTED to NVFUSER_DISTRIBUTED in test/test_multidevice_sharding.cpp * linting * linter and cleanup * remove allocator.h/cpp files * Device index patch (NVIDIA#1752) Fixes NVIDIA#1748 guard c10::cuda::GetDevice API change on TORCH_VERSION with this change, it ensures that we can build against stable release `< 2.2.0`, as well as TOT after pytorch/pytorch#119142 For 2.3.0 nightly, if someone accidentally checkout a commit before the patch, the build will still fail. * fixing multidevice build (NVIDIA#1753) API change coming from pytorch/pytorch#119421 * patching API GUARD (NVIDIA#1754) patching API version guard so we'll still be able to build against older pytorch version. * Add a visitor for ValGraph (NVIDIA#1713) Used in the loop promotion analysis. Extracted from NVIDIA#32 * empty commit for triggering CI --------- Co-authored-by: Liqiang Lu <116412316+liqiangxl@users.noreply.github.com> Co-authored-by: Jacob Hinkle <1454944+jacobhinkle@users.noreply.github.com> Co-authored-by: Priya Mishra <52657555+Priya2698@users.noreply.github.com> Co-authored-by: Jingyue Wu <wujingyue@gmail.com> Co-authored-by: Tom Fogal <60981+tfogal@users.noreply.github.com> Co-authored-by: jjsjann123 <jiej@nvidia.com> Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com> Co-authored-by: Naoya Maruyama <naoyam@users.noreply.github.com> Co-authored-by: Meghan Cowan <mcowan@nvidia.com> Co-authored-by: Ryan Spring <rspring@nvidia.com>

jacobhinkle requested a review from naoyam January 26, 2024 15:18

jacobhinkle marked this pull request as draft January 26, 2024 15:19

Guard against Xor special (uncommon) case

e49a940

jacobhinkle commented Jan 26, 2024

View reviewed changes

jacobhinkle added 3 commits January 26, 2024 17:16

Remove ExpandReduce2_CUDA test

ca1b203

Since this fusion is now translated to pointwise ops, it is no longer relevant.

Use expand instead of expand_copy in ExpandOp::evaluate

8457b1c

Slice before squeezing expanded dims in SqueezeOp::evaluate

64172b5

This fixes ExpandReduce_CUDA test in test_gpu3.cpp

jacobhinkle commented Jan 26, 2024

View reviewed changes

jacobhinkle marked this pull request as ready for review January 26, 2024 17:38

naoyam reviewed Jan 30, 2024

View reviewed changes

Add squeeze_expanded option

< 8000 div class="text-right ml-1"> eff321d

jacobhinkle added 2 commits January 30, 2024 20:45

Rename is_broadcast_reduction -> is_squeeze

2a538e2

Add repro as python test

1b1e943

jacobhinkle requested a review from naoyam February 2, 2024 13:41

naoyam approved these changes Feb 2, 2024

View reviewed changes

Merge remote-tracking branch 'origin/main' into convert_expand_reduct…

850e5a4

…ion_to_squeeze

jacobhinkle merged commit 91e56b0 into main Feb 2, 2024

jacobhinkle deleted the convert_expand_reduction_to_squeeze branch February 2, 2024 19:51

naoyam mentioned this pull request Feb 2, 2024

Make sure ValGraphs are created deterministically #1714

Merged

jacobhinkle mentioned this pull request Feb 8, 2024

python_tests.test_normalization.test_instance_norm_multigpu failure #1728

Open

jacobhinkle mentioned this pull request Jun 7, 2024

Move checkConcretization for reshapes #2363

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Convert reduction of expanded dims to squeeze #1679

Convert reduction of expanded dims to squeeze #1679

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		!id->hasExpandedExtent() && id->extent()->isConstInt() &&
		id->extent()->evaluate().as<int64_t>() == 1;

		return {at::expand_copy(in, expanded_size)};
		return {in.expand(expanded_size)};

Convert reduction of expanded dims to squeeze #1679

Convert reduction of expanded dims to squeeze #1679

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!