8000 [amd] fix tunableop gemm (#153764) · pytorch/pytorch@6f835a4 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6f835a4

Browse files
mxz297pytorchmergebot
authored andcommitted
[amd] fix tunableop gemm (#153764)
Summary: Tunableop on AMD has perf regression for a while. It turns out that the tunableop code path will first run tuned GEMM and then run heuristics GEMM (so run two GEMMs...).... Test Plan: ``` CUDA_VISIBLE_DEVICES=0 buck test @//mode/opt-amd-gpu -c fbcode.rocm_arch=mi300 -c fbcode.rocm_ck_rtz=true fbcode//accelerators/workloads/microbench/RE:test_emu_v1p4 -- --exact 'accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest)' --run-disabled ``` Before the diff ``` File "/data/users/mxz/fbsource/buck-out/v2/gen/fbcode/ecc11ed52295855f/accelerators/workloads/microbench/RE/__test_emu_v1p4__/test_emu_v1p4#link-tree/accelerators/workloads/microbench/RE/test_emu_v1p4.py", line 47, in test_gemm self.assertTrue(result < AMD_GEMM_BASELINE * AMD_GEMM_THRESHOLD) Buck UI: https://www.internalfb.com/buck2/b4b8dfca-0301-4c5d-83d6-d866d840c42d Test UI: https://www.internalfb.com/intern/testinfra/testrun/14355223896396807 Network: Up: 10MiB Down: 1.9GiB (reSessionID-23b213fe-a460-4788-86c6-a52343ff10f4) Loading targets. Remaining 0/5144 93161 dirs read, 753263 targets declared Analyzing targets. Remaining 0/70523 2837379 actions, 3262810 artifacts declared Executing actions. Remaining 0/472286 217:26:58.1s exec time total Command: test. Finished 122 local, 522 remote, 199785 cache (99% hit) 211:26:30.5s exec time cached (97%) Time elapsed: 12:50.2s Test execution completed but the tests failed Tests finished: Pass 0. Fail 1. Fatal 0. Skip 0. Build failure 0 1 TESTS FAILED ✗ accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest) Run $ fdb buck test <args> to debug accelerators/workloads/microbench/RE:test_emu_v1p4 - test_gemm (accelerators.workloads.microbench.RE.test_emu_v1p4.EMUv1p4PerfTest) ^^^ just prefix your previous command! ($ fdb !!) Learn more at https://fburl.com/fdb ``` After the diff ``` Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Reviewed By: henryoier, henryhu6 Differential Revision: D74910115 Pull Request resolved: #153764 Approved by: https://github.com/yangsiyu007, https://github.com/xw285cornell
1 parent 2ade886 commit 6f835a4

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

aten/src/ATen/native/cuda/Blas.cpp

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -467,9 +467,8 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
467467
alpha,
468468
(&result != &self) ? self.const_data_ptr<scalar_t>() : nullptr,
469469
activation_to_gemm_and_blas_arg(activation));
470-
}
471-
472-
okay = at::cuda::blas::gemm_and_bias<scalar_t>(
470+
} else {
471+
okay = at::cuda::blas::gemm_and_bias<scalar_t>(
473472
args.transa == 't',
474473
args.transb == 't',
475474
args.m,
@@ -486,7 +485,8 @@ Tensor& addmm_out_cuda_impl(Tensor& result, const Tensor& self, const Tensor& ma
486485
args.result->data_ptr<scalar_t>(),
487486
args.result_ld,
488487
activation_to_gemm_and_blas_arg(activation)
489-
);
488+
);
489+
}
490490
});
491491
}
492492
if (!okay) {

0 commit comments

Comments
 (0)
0