Updated Scaled_mm to support more scaling formats via CuBlas #153555

drisspg · 2025-05-14T17:14:49Z

Summary

In Cuda 12.9 cublas released support for an expanded set of scaling strategies besides just per-tensor: https://developer.nvidia.com/blog/boosting-matrix-multiplication-speed-and-flexibility-with-nvidia-cublas-12-9/

Currently on Cuda:

SM89

_scaled_mm dispatches to one of 2 backends on H100:

Per-Tensor scaling -> CublasLT
Per-Row scaling -> RowWise Cutlass kernel
GroupWise Scaling -> Not supported | some support in AO
BlockWise Scaling -> Not supported | some support in AO

H100

_scaled_mm dispatches to one of 2 backends on H100:

Per-Tensor scaling -> CublasLT
Per-Row scaling -> RowWise Cutlass kernel
GroupWise Scaling -> Not supported | some support in AO
BlockWise Scaling -> Not supported | some support in AO

B200

_scaled_mm dispatches to one of 2 backends on H100:

Per-Tensor scaling -> CublasLT
Per-Row scaling -> RowWise Cutlass kernel (template is not optimal)
GroupWise Scaling -> MXFP8 BlockWise scaling is support via CublasLT
BlockWise Scaling -> Not supported

We should add new cublas bindings to enable this more performant code path.

Blockers

We ideally would remove the cutlass templates since Cublas claims appear to be universally more performant. The main blocker is that we would lose support for SM89 hardware

We don't currently ship a prebuilt version of PyTorch for 12.9

cc @ptrblck @msaroufim @eqy @jerryzh168 @yanbing-j @vkuzo @albanD @kadeng @penguinwu @ngimel, @lw

The text was updated successfully, but these errors were encountered:

drisspg added module: cuda Related to torch.cuda, and CUDA support in general topic: performance topic category Blackwell Specific failures or issues related to sm100 + Cuda arches labels May 14, 2025

malfet added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module enhancement Not as big of a fe 5C84 ature, but technically not a bug. Should be easy to fix module: float8 For torch.float8_e5m2 and torch.float8_e4m3 labels May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated Scaled_mm to support more scaling formats via CuBlas #153555

Updated Scaled_mm to support more scaling formats via CuBlas #153555

Updated Scaled_mm to support more scaling formats via CuBlas #153555

Updated Scaled_mm to support more scaling formats via CuBlas #153555

Comments

Uh oh!

Summary

SM89

H100

B200

Blockers