Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 #128989

drisspg · 2024-06-18T20:04:19Z

Summary

First PR got reverted and needed a redo

This pull request introduces an fp8 row-scaling kernel as an optional implementation for scaled_mm. The kernel selection is based on the scaling tensors of the inputs. For inputs x and y of shape [M, K] and [K, N] respectively, the following conditions must be met:

x's scale should be a 1-dimensional tensor of length M.
y's scale should be a 1-dimensional tensor of length N.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for y are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:

Todo

We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace sm_90 with sm_90a?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

ifdef

I tried to use : #if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900 to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

cc @yanbing-j @vkuzo @albanD @kadeng

pytorch-bot · 2024-06-18T20:04:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128989

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 25 New Failures, 2 Unrelated Failures

As of commit 4c6fe75 with merge base 9a7e251 ():

NEW FAILURES - The following jobs have failed:

windows-binary-conda / conda-py3_10-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_10-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_11-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_12-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_8-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda11_8-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_1-test (gh)
Process completed with exit code 1.
windows-binary-conda / conda-py3_9-cuda12_4-test (gh)
Process completed with exit code 1.
windows-binary-wheel / wheel-py3_10-cuda12_1-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_10-cuda12_4-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_11-cuda12_1-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_11-cuda12_4-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_12-cuda12_1-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_12-cuda12_4-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_8-cuda12_1-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_8-cuda12_4-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_9-cuda12_1-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM
windows-binary-wheel / wheel-py3_9-cuda12_4-test (gh)
RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

linux-binary-libtorch-pre-cxx11 / libtorch-rocm6_1-shared-with-deps-pre-cxx11-test (gh) (detected as infra flaky with no log or failing log classifier)
pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh) (similar failure)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2024-06-18T22:37:58Z

@pytorchbot merge

pytorchmergebot · 2024-06-18T22:39:34Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

drisspg · 2024-06-19T04:47:47Z

@pytorchbot merge -f "unrelated failures"

pytorchmergebot · 2024-06-19T04:49:11Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

xuzhao9 · 2024-06-27T23:49:11Z

This PR seems to break FBGEMM runtime: https://github.com/pytorch/benchmark/actions/runs/9704891003/job/26785961181

Error message:

/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: cannot open shared object file: No such file or directory
/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: cuTensorMapEncodeTiled
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/__init__.py", line 27, in <module>
    torch.ops.load_library(
  File "/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/_ops.py", line 1298, in load_library
    ctypes.CDLL(path)
  File "/home/runner/miniconda3/envs/torchbench/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: cuTensorMapEncodeTiled

drisspg · 2024-06-28T00:00:40Z

@xuzhao9 this is interesting and I imagine it has to do with: https://github.com/pytorch/pytorch/pull/128989/files#diff-ac10d47f44dcf2fc2ec547d3dcdf796dea0498b4c3461e820152afb4cbdfae75R16-R58

malfet · 2024-06-28T00:09:46Z

@xuzhao9 following fixes the linker error for PyTorch, due to cutlass using driver API:

pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu

Lines 14 to 44 in 0ffb175

    
           #if defined(BUILD_ROWWISE_FP8_KERNEL) 
        
           // We are going to override the cuTensorMapEncodeTiled driver api with our lazy loader 
        
           static CUresult CUDAAPI nvrtc_cuTensorMapEncodeTiled( 
        
               CUtensorMap* tensorMap, 
        
               CUtensorMapDataType tensorDataType, 
        
               cuuint32_t tensorRank, 
        
               void* globalAddress, 
        
               const cuuint64_t* globalDim, 
        
               const cuuint64_t* globalStrides, 
        
               const cuuint32_t* boxDim, 
        
               const cuuint32_t* elementStrides, 
        
               CUtensorMapInterleave interleave, 
        
               CUtensorMapSwizzle swizzle, 
        
               CUtensorMapL2promotion l2Promotion, 
        
               CUtensorMapFloatOOBfill oobFill) { 
        
             return at::globalContext().getNVRTC().cuTensorMapEncodeTiled( 
        
                 tensorMap, 
        
                 tensorDataType, 
        
                 tensorRank, 
        
                 globalAddress, 
        
                 globalDim, 
        
                 globalStrides, 
        
                 boxDim, 
        
                 elementStrides, 
        
                 interleave, 
        
                 swizzle, 
        
                 l2Promotion, 
        
                 oobFill); 
        
           }

Perhaps fbgemm needs to add something similar? But we can not link PyTorch with libcuda

drisspg requested a review from eqy as a code owner June 18, 2024 20:04

drisspg added 6 commits June 18, 2024 13:06

Enable fp8 rowwise scaling kernel on cuda

8ace477

feels better

df5e433

update build includes

3309894

add more tests, address comment

fff94d3

interesting it is getting included by the cute include

0886476

skip windows

f637544

drisspg requested review from vkuzo and yangsiyu007 June 18, 2024 20:09

yangsiyu007 approved these changes Jun 18, 2024

View reviewed changes

drisspg force-pushed the add-row-wise-scaling-2 branch 2 times, most recently from e6d341a to 252c489 Compare June 18, 2024 20:33

drisspg added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: float8 For torch.float8_e5m2 and torch.float8_e4m3 labels Jun 18, 2024

drisspg changed the title ~~Enable fp8 rowwise scaling kernel on cuda, TAKE 2~~ Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 Jun 18, 2024

drisspg added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jun 18, 2024

cleanup rebase

4c6fe75

drisspg force-pushed the add-row-wise-scaling-2 branch from 252c489 to 4c6fe75 Compare June 18, 2024 22:10

pytorchmergebot added the merging label Jun 18, 2024

pytorchmergebot removed the merging label Jun 18, 2024

vkuzo approved these changes Jun 19, 2024

View reviewed changes

pytorchmergebot added the merging label Jun 19, 2024

pytorchmergebot closed this in fcf2a13 Jun 19, 2024

pytorchmergebot added Merged and removed merging labels Jun 19, 2024

drisspg mentioned this pull request Jun 20, 2024

Update sm90 to -> sm90a pytorch/builder#1878

Open

drisspg mentioned this pull request Jul 10, 2024

[RFC] Float8 Inference pytorch-labs/float8_experimental#314

Closed

drisspg mentioned this pull request Jul 30, 2024

[RFC]: Float8 Inference pytorch/ao#574

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 #128989

Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 #128989

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 #128989

Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 #128989

Uh oh!

Conversation

Summary

Todo

ifdef

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128989

❌ 25 New Failures, 2 Unrelated Failures

Uh oh!

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!