8000 Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 by drisspg · Pull Request #128989 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 #128989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

drisspg
Copy link
Contributor
@drisspg drisspg commented Jun 18, 2024

Summary

First PR got reverted and needed a redo

This pull request introduces an fp8 row-scaling kernel as an optional implementation for scaled_mm. The kernel selection is based on the scaling tensors of the inputs. For inputs x and y of shape [M, K] and [K, N] respectively, the following conditions must be met:

  • x's scale should be a 1-dimensional tensor of length M.
  • y's scale should be a 1-dimensional tensor of length N.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for y are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:

Todo

We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace sm_90 with sm_90a?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

ifdef

I tried to use : #if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900 to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

cc @yanbing-j @vkuzo @albanD @kadeng

@drisspg drisspg requested a review from eqy as a code owner June 18, 2024 20:04
Copy link
pytorch-bot bot commented Jun 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128989

Note: Links to docs will display an error until the docs builds have been completed.

❌ 25 New Failures, 2 Unrelated Failures

As of commit 4c6fe75 with merge base 9a7e251 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@drisspg drisspg requested review from vkuzo and yangsiyu007 June 18, 2024 20:09
@drisspg drisspg force-pushed the add-row-wise-scaling-2 branch 2 times, most recently from e6d341a to 252c489 Compare June 18, 2024 20:33
@drisspg drisspg added ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category module: float8 For torch.float8_e5m2 and torch.float8_e4m3 labels Jun 18, 2024
@drisspg drisspg changed the title Enable fp8 rowwise scaling kernel on cuda, TAKE 2 Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 Jun 18, 2024
@drisspg drisspg added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Jun 18, 2024
@drisspg drisspg force-pushed the add-row-wise-scaling-2 branch from 252c489 to 4c6fe75 Compare June 18, 2024 22:10
@drisspg
Copy link
Contributor Author
drisspg commented Jun 18, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@drisspg
Copy link
Contributor Author
drisspg commented Jun 19, 2024

@pytorchbot merge -f "unrelated failures"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xuzhao9
Copy link
Contributor
xuzhao9 commented Jun 27, 2024

This PR seems to break FBGEMM runtime: https://github.com/pytorch/benchmark/actions/runs/9704891003/job/26785961181

Error message:

/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/fbgemm_gpu_py.so: cannot open shared object file: No such file or directory
/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: cuTensorMapEncodeTiled
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/__init__.py", line 27, in <module>
    torch.ops.load_library(
  File "/home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/torch/_ops.py", line 1298, in load_library
    ctypes.CDLL(path)
  File "/home/runner/miniconda3/envs/torchbench/lib/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: /home/runner/miniconda3/envs/torchbench/lib/python3.11/site-packages/fbgemm_gpu/experimental/gen_ai/fbgemm_gpu_experimental_gen_ai_py.so: undefined symbol: cuTensorMapEncodeTiled

@drisspg
Copy link
Contributor Author
drisspg commented Jun 28, 2024

@malfet
Copy link
Contributor
malfet commented Jun 28, 2024

@xuzhao9 following fixes the linker error for PyTorch, due to cutlass using driver API:

#if defined(BUILD_ROWWISE_FP8_KERNEL)
// We are going to override the cuTensorMapEncodeTiled driver api with our lazy loader
static CUresult CUDAAPI nvrtc_cuTensorMapEncodeTiled(
CUtensorMap* tensorMap,
CUtensorMapDataType tensorDataType,
cuuint32_t tensorRank,
void* globalAddress,
const cuuint64_t* globalDim,
const cuuint64_t* globalStrides,
const cuuint32_t* boxDim,
const cuuint32_t* elementStrides,
CUtensorMapInterleave interleave,
CUtensorMapSwizzle swizzle,
CUtensorMapL2promotion l2Promotion,
CUtensorMapFloatOOBfill oobFill) {
return at::globalContext().getNVRTC().cuTensorMapEncodeTiled(
tensorMap,
tensorDataType,
tensorRank,
globalAddress,
globalDim,
globalStrides,
boxDim,
elementStrides,
interleave,
swizzle,
l2Promotion,
oobFill);
}

Perhaps fbgemm needs to add something similar? But we can not link PyTorch with libcuda

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged module: float8 For torch.float8_e5m2 and torch.float8_e4m3 topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
Add this suggestion to a batch that can be applied as a single commit. This suggestion is invalid because no changes were made to the code. Suggestions cannot be applied while the pull request is closed. Suggestions cannot be applied while viewing a subset of changes. Only one suggestion per line can be applied in a batch. Add this suggestion to a batch that can be applied as a single commit. Applying suggestions on deleted lines is not supported. You must change the existing code in this line in order to create a valid suggestion. Outdated suggestions cannot be applied. This suggestion has been applied or marked resolved. Suggestions cannot be applied from pending reviews. Suggestions cannot be applied on multi-line comments. Suggestions cannot be applied while the pull request is queued to merge. Suggestion cannot be applied right now. Please check back later.
0