-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[ROCm] MI300X FP8 scaled_mm is extremely slow on nightly #143465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Are you able to try building the latest tip of hipblaslt develop branch and rerunning your numbers? |
Since you're using the nightly wheel it will perhaps take some manual hacking of your torch install to use the latest hipblaslt. You might be able to build and install latest hipblaslt to some other location on your system and then use LD_LIBRARY_PATH to point to its lib instead of the hipblaslt lib that is bundled in the nightly wheel. In case you want to try the hack of copying the newer hipblaslt into your torch install location, note that hipblaslt has both the libhipblaslt.so component but also its auxiliary files containing the GPU code objects [kernels] stored relative to libhipblaslt.so under hipblaslt/library/*.co and *.dat files. You'll need to copy all of it. |
Hi @OrenLeung we are on this. Thank you @jeffdaily for your response. |
@OrenLeung it looks e4m3 is more optimized while e5m2 isn't. by the way, your code mixed e5m2 and e4m3, I have reported to library owner. Thank you Oren. |
Probably. I checked hipblaslt commits, and I do see more f8 (e4m3) commits than bf8 (e5m2). :-) |
e5m2 support is being handled by the hipblaslt team. So this is being closed. |
@OrenLeung I will follow up tomorrow. Thank you. Even e4m3 * e4m3, I got, m=16384 n=8192 k=1280: 101.02149096903179 The perf is really bad. |
@OrenLeung we have 3 teams on this issue internally. I will keep you posted through other channels. Thank you. |
Setting the env var PYTORCH_TUNABLEOP_ENABLED=1 is able to provide significantly improved performance for the reproducers provided in this PR. |
@jeffdaily thanks for tip. any chance that out of the box will be tuned for torch._scaled_mm?
|
Hi, We are also seeing this issue. Is there a fix coming without Tunable Ops? |
Hi @deke997 many F8B8 Gemm has been optimized and checked in. https://github.com/ROCm/hipBLASLt/commits/develop/ We are actively working on this. If you wish to test, please compile the latest hipblaslt. Thank you. |
@OrenLeung This is with a relatively recent version of hipblasLt develop (maybe a month old):
Let us know if you want us to continue to keep this issue open or if it can be closed. |
hi @naromero77amd thanks for working on this! if let's close this issue if the current pypi torch nightly is able to get above ~1000 TFLOP/s too! |
The performance improvements come from the version of hipblasLt that is bunding with ROCm stack. The particular version of PyTorch is not as relevant. In the latest ROCm 6.4 docker images, e.g.
The performance issue is resolved:
|
@naromero77amd on cuda, pypi torch bundles cublas via pypi nvidia dependencies automatically. would rocm pypi torch be doing the same? sorry if this is a dumb question. pip list
Package Version
------------------------ ------------
nvidia-cublas-cu12 12.8.3.14
nvidia-cuda-cupti-cu12 12.8.57
nvidia-cuda-nvrtc-cu12 12.8.61
nvidia-cuda-runtime-cu12 12.8.57
nvidia-cudnn-cu12 9.7.1.26
nvidia-cufft-cu12 11.3.3.41
nvidia-cufile-cu12 1.13.0.11
nvidia-curand-cu12 10.3.9.55
nvidia-cusolver-cu12 11.7.2.55
nvidia-cusparse-cu12 12.5.7.53
nvidia-cusparselt-cu12 0.6.3
nvidia-nccl-cu12 2.26.2
nvidia-nvjitlink-cu12 12.8.61
nvidia-nvtx-cu12 12.8.55
torch 2.7.0+cu128
torchaudio 2.7.0+cu128
torchvision 0.22.0+cu128
triton 3.3.0
... |
@OrenLeung At this point, we have bounded the ROCm acceleration libraries (hipBLASLT, MIOpen, etc.) inside the PyTorch WHL package. The torch domain libraries are bounded in the same way as the NV package at the Python level. |
thanks for the explanation @sunway513 it was able to get the performance update inside rocm docker container but for the pypi torch whl installation, the performance improvements still isn't there for stable whl or nightly whl. I am following https://pytorch.org/get-started/locally/ Can it be updated such that u bundled the updated ROCm libraries inside the torch pypi whl package too? Docker
Pypi Stablefollowing the installations from https://pytorch.org/get-started/locally/ pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.3 user@tw031:~$ python ./reprod.py
m=16384 n=8192 k=1280: 102.20135224999864
m=16384 n=1024 k=8192: 102.85467618252625
m=16384 n=8192 k=7168: 104.92880160773375
m=16384 n=3584 k=8192: 105.19174292269251
m=8192 n=8192 k=8192: 104.96762894820529
user@tw031:~$ pip list | grep torch
pytorch-triton-rocm 3.3.0
torch 2.7.0+rocm6.3 Pypi Nightlyfollowing the installations from https://pytorch.org/get-started/locally/ pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3 user@tw031:~$ python ./reprod.py
m=16384 n=8192 k=1280: 101.688998175193
m=16384 n=1024 k=8192: 102.83025828797749
m=16384 n=8192 k=7168: 104.56994939801766
m=16384 n=3584 k=8192: 104.70265188641152
m=8192 n=8192 k=8192: 104.52714592170736
user@tw031:~$ pip list | grep torch
pytorch-triton-rocm 3.3.0+git96316ce5
torch 2.8.0.dev20250426+rocm6.3
torchaudio 2.6.0.dev20250426+rocm6.3
torchvision 0.22.0.dev20250426+rocm6.3 |
Thanks for trying out the WHL packages, @OrenLeung. Besides, if you would like to prefetch the PyTorch WHLs with ROCm6.4 base, you can try out the ones that ROCm has been hosting for fast availability: |
Nightly wheels are at ROCm 6.4 now, but the get-started chooser hasn't been updated yet.
|
Thanks for fixing this issue! I see that get started website nightly is now ROCm 6.4 with the updated hipBlasLt |
🐛 Describe the bug
Hi AMD Team,
torch._scaled_mm
is extremely slow on MI300X at ~100TFLOP/s verus ~1200TFLOP/s on H100Can you look into this?
cc: @hliuca
ROCm
H100
Reprod
cc: @hliuca
Versions
pip list | grep torch pytorch-triton-rocm 3.2.0+git35c6c7c6 torch 2.6.0.dev20241216+rocm6.2.4
cc @msaroufim @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd
The text was updated successfully, but these errors were encountered: