[amd] fix tunableop gemm (#153764)

mxz297 · pytorchmergebot · commit 6f835a4769d4 · 2025-05-19T04:07:48.000Z
Summary: Tunableop on AMD has perf regression for a while. It turns out that the tunableop code path will first run tuned GEMM and then run heuristics GEMM (so run two GEMMs...).... Test Plan: ``` CUDA_VISIBLE_DEVICES=0 buck test @//mode/opt-amd-gpu -c fbcode.rocm_arch=mi300 -c fbcode.rocm_ck_rtz=true fbcode//accelerators/workloads/microbench/RE:test_emu_v1p4 -- --exact 'accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)' --run-disabled ``` Before the diff ``` File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/ecc11ed52295855f/accelerators/workloads/microbench/RE/__test_emu_v1p4__/test_emu_v1p4#link-tree/accelerators/workloads/microbench/RE/test_emu_v1p4.py", line 47, in test_gemm self.assertTrue(result < AMD_GEMM_BASELINE * AMD_GEMM_THRESHOLD) Buck UI: https://www.internalfb.com/buck2/b4b8dfca-0301-4c5d-83d6-d866d840c42d Test UI: https://www.internalfb.com/intern/testinfra/testrun/14355223896396807 Network: Up: 10MiB Down: 1.9GiB (reSessionID-23b213fe-a460-4788-86c6-a52343ff10f4) Loading targets. Remaining 0/5144 93161 dirs read, 753263 targets declared Analyzing targets. Remaining 0/70523 2837379 actions, 3262810 artifacts declared Executing actions. Remaining 0/472286 217:26:58.1s exec time total Command: test. Finished 122 local, 522 remote, 199785 cache (99% hit) 211:26:30.5s exec time cached (97%) Time elapsed: 12:50.2s Test execution completed but the tests failed Tests finished: Pass 0. Fail 1. Fatal 0. Skip 0. Build failure 0 1 TESTS FAILED ✗ accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest) Run $ fdb buck test <args> to debug accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest) ^^^ just prefix your previous command! ($ fdb !!) Learn more at https://fburl.com/fdb ``` After the diff ``` Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: henryoier, henryhu6 Differential Revision: D74910115 Pull Request resolved: #153764 Approved by: https://github.com/yangsiyu007, https://github.com/xw285cornell
diff --git a/aten/src/ATen/native/cuda/Blas.cpp b/aten/src/ATen/native/cuda/Blas.cpp
@@ -467,9 +467,8 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
               alpha,
               (&result != &self) ? self.const_data_ptr<scalar_t>() : nullptr,
               activation_to_gemm_and_blas_arg(activation));
-        }
-
-        okay = at::cuda::blas::gemm_and_bias<scalar_t>(
+        } else {
+          okay = at::cuda::blas::gemm_and_bias<scalar_t>(
             args.transa == 't',
             args.transb == 't',
             args.m,
@@ -486,7 +485,8 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
             args.result->data_ptr<scalar_t>(),
             args.result_ld,
             activation_to_gemm_and_blas_arg(activation)
-        );
+          );
+        }
       });
     }
     if (!okay) {