[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` #152378

eqy · 2025-04-28T23:33:55Z

CUTLASS only supports SM carveout via green contexts on sm100

cc @ptrblck @msaroufim @jerryzh168 @yanbing-j @vkuzo @albanD @kadeng @penguinwu

8000

pytorch-bot · 2025-04-28T23:34:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152378

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 50 New Failures, 6 Cancelled Jobs, 2 Unrelated Failures

As of commit 9d23557 with merge base 119f64d ():

NEW FAILURES - The following jobs have failed:

pull / cuda12.4-py3.10-gcc9-sm75 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-cpp-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-functorch-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-docs / build-docs-python-false (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cpu-py3.10-gcc11-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda12.6-py3.10-gcc11 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-cuda12.6-py3.10-gcc11-sm89 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3_9-clang9-xla / build (gh)
ninja: build stopped: subcommand failed
pull / linux-focal-py3.13-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 2, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 3, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 4, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (default, 5, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.13-clang10 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (crossref, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 1, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 2, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 3, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 4, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (default, 5, 5, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10-onnx / test (default, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-focal-py3.9-clang10-onnx / test (default, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (backwards_compat, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 1, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 2, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 4, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (default, 5, 5, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (distributed, 2, 2, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (docs_test, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (jit_legacy, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-py3.9-gcc11 / test (numpy_2_x, 1, 1, linux.2xlarge) (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-rocm-py3.10 / build (gh)
Final attempt failed. Child_process exited with error code 1
pull / linux-jammy-xpu-2025.0-py3.9 / build (gh)
undefined reference to sycl::_V1::exception::exception(std::error_code, std::shared_ptrsycl::_V1::context, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)'`

CANCELLED JOBS - The following jobs were cancelled. Please retry:

Lint / lintrunner-clang / linux-job (gh)
##[error]The operation was canceled.
Lint / lintrunner-noclang / linux-job (gh)
##[error]The operation was canceled.
Lint / quick-checks / linux-job (gh)
##[error]The operation was canceled.
Lint / Test tools / linux-job (gh)
##[error]The operation was canceled.
Lint / toc / linux-job (gh)
##[error]The operation was canceled.
Lint / workflow-checks / linux-job (gh)
##[error]The operation was canceled.

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3.9-gcc11 / test (default, 3, 5, linux.2xlarge) (gh) (matched linux rule in flaky-rules.json)
The process '/usr/bin/git' failed with exit code 1

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / unstable-linux-focal-cuda12.6-py3.10-gcc11-sm89-xfail / build (gh)
Final attempt failed. Child_process exited with error code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

albanD

I'm confused. I would expect the scaled_mm op to still work, even if it is at lower performance?

eqy · 2025-04-29T17:25:50Z

I'm confused. I would expect the scaled_mm op to still work, even if it is at lower performance?

This test isn't really about exercising scaled_mm, which is still expected to work. Rather it's about gating the number of SMs that the kernel launches, which CUTLASS currently only respects on sm90. In fact the sm setting in the params is expected to throw a compile time warning that it will have no effect.

test/test_matmul_cuda.py

Update test_matmul_cuda.py

53ffaaa

eqy added module: cuda Related to torch.cuda, and CUDA support in general open source topic: not user facing topic category matrix multiplication module: float8 For torch.float8_e5m2 and torch.float8_e4m3 labels Apr 28, 2025

eqy requested a review from lw April 28, 2025 23:33

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 29, 2025

Skylion007 approved these changes Apr 29, 2025

View reviewed changes

albanD requested changes Apr 29, 2025

View reviewed changes

nWEIdia approved these changes May 1, 2025

View reviewed changes

nWEIdia reviewed May 1, 2025

View reviewed changes

test/test_matmul_cuda.py Outdated Show resolved Hide resolved

Update test_matmul_cuda.py

9d23557

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` #152378

[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` #152378

[FP8][CUTLASS] xFail honor_sm_carveout on sm100 #152378

Are you sure you want to change the base?

[FP8][CUTLASS] xFail honor_sm_carveout on sm100 #152378

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/152378

❌ 50 New Failures, 6 Cancelled Jobs, 2 Unrelated Failures

Choose a reason for hiding this comment

[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` #152378

[FP8][CUTLASS] xFail `honor_sm_carveout` on `sm100` #152378