Add more GPU architectures support #112

RayWang96 · 2025-06-12T02:03:59Z

Hi DeepGEMM Team,

This PR is submitted by NVIDIA and introduces support for more GPU architectures.

Key Points

MXFP8 support
- Same 1x128 by 128x128 SF (scaling factor) recipe as original DeepGEMM, but using UE8M0 instead of FP32 for SF data format.
- Same SF input layout as original DeepGEMM.
Multi-architecture support
- Refactored the file structure for better extensibility, working on merging other GPU architectures (e.g. Ada) into DeepGEMM.
- Notice: code structural changes are significant, we'd like to align with you on a multi-architecture support strategy before proceeding further.

On-going and coming soon

Small M optimization (~128)
Very small M values (<64), with swap AB optimization
Masked grouped GEMM optimization

Long-term optimization

Other architecture support (e.g. Ada)
Add more input format support such as offset layout used by TRT-LLM

We are actively working on these optimizations. It aims to continuously provide support for NVIDIA's latest GPU hardware in DeepGEMM, ensuring the community can perform efficient and scalable training and inference with DeepSeek models for next-generation AI workloads.

Thanks!

LyricZhao · 2025-06-12T02:06:40Z

Thanks for this, as we don't have SM100 GPUs, so let's make it a draft and wait for community feedback :)

zhyncs · 2025-06-12T02:11:05Z

I'll help verify. Stay tuned.

lucifer1004 · 2025-06-12T05:10:14Z

CMakeLists.txt


-set(CMAKE_CXX_STANDARD 20)
-set(CMAKE_CUDA_STANDARD 20)
+set(CMAKE_CXX_STANDARD 17)


Why change back to 17 from 20?

This is a conflict after rebasing. I'll resolve this later.

zhyncs · 2025-06-12T06:04:31Z

The Blackwell part looks good to me. I believe the remaining work involves making it compatible with Hopper. Thanks for this great work!

fzyzcjy · 2025-06-12T06:33:48Z

deep_gemm/utils/layout.py

+    return sf
+
+
+def transform_sf_into_required_layout(sf: torch.Tensor,


would be great to allow skipping (INT, 128, 128) case where the data is already pre-transformed by other fused kernels. For example, my patch:

should_skip_transform = ( (sf.dtype == torch.int and gran == (1, 128) and get_device_arch() == '100a') or (sf.dtype == torch.int and gran == (128, 128) and get_device_arch() == '100a') ) if not should_skip_transform: # Pre-transform checks check_sf_layout(sf, mn=mn, k=k, gran=gran, num_groups=num_groups) ... if should_skip_transform: return check_sf_layout(sf, mn=mn, k=k, gran=(1, 128), num_groups=num_groups, tma_stride_check=True, type_check=torch.int) ...

update: please refer to sgl-project#1

That makes sense. I'm thinking about how to skip transform in the most appropriate way, but for now, let me merge your patch as a workaround.

Yes, there may be better methods, my patch is just quickly hack to ensure SGLang can work

yewentao256 · 2025-06-16T19:12:12Z

deep_gemm/__init__.py

-    k_grouped_wgrad_gemm_fp8_fp8_fp32_nt,
-    ceil_div,
-    set_num_sms, get_num_sms,
-    get_col_major_tma_aligned_tensor,


get_col_major_tma_aligned_tensor is this function deprecated?

Also, if we don't import the function here, the users have to refactor the code. Eg:

from deep_gemm import calc_diff, ceil_div -> from deep_gemm.testing.numeric import calc_diff from deep_gemm.utils.math import ceil_div

Is this expected?

get_col_major_tma_aligned_tensor is this function deprecated?

Not used in SM100, will be added to utils/layout.py in SM90 support.

Is this expected?

Yes, I think they are not main functions of DeepGEMM, should not be exposed at the top level. And these two functions are simple, I don't think it will have much impact on users.

Make sense to me, thanks!

@RayWang96 When you said not used in SM100, does this mean it will be automatically handled in SM100 and we don't need the code like https://github.com/vllm-project/vllm/blob/dac8cc49f43f7d2639d873532a408949169821a9/vllm/model_executor/layers/quantization/fp8.py#L654?

fp8_gemm_nt can implicitly handle the transformation, or this step can be skipped to only validate the layout. It's recommended to use a previous kernel to transform to the specified layout first, then call GEMM.

So do you mean it is recommended that we call this get_col_major_tma_aligned_tensor each time before we call any gemm operation?

youkaichao · 2025-06-18T01:13:32Z

setup.py


    setuptools.setup(
        name='deep_gemm',
-        version='1.0.0' + revision,


for large API changes, we can bump version to 2.0?

Thanks! We will sync up with the DeepGEMM team to decide on the version number.

yewentao256 · 2025-06-25T00:20:57Z

@RayWang96 Hi, will this pr be merged soon? I think my integration has been merged to the vllm main by accident vllm-project/vllm#19820. But if we merge this pr now, I think it would be fine.

LyricZhao · 2025-06-25T00:49:45Z

@yewentao256 Sorry, not now. We still needs 2-3 weeks to merge with the Hopper code.

yewentao256 · 2025-06-25T00:58:45Z

@yewentao256 Sorry, not now. We still needs 2-3 weeks to merge with the Hopper code.

okk, thanks for letting me know

yewentao256 · 2025-06-25T18:27:31Z

vllm-project/vllm#20090
Added unit test in vllm that may be helpful.

This unit test can pass on H100. But for B200 of your integration, it can not pass, so perhaps you can use this to validate the accuracy as well.

Even if we use the exact same tensor input, and just stop at the first m_grouped_gemm_nt_contiguous inside the DeepGemmExperts, the result would be different from the correct answer from H100.

RayWang96 · 2025-07-02T10:28:46Z

@yewentao256 I read and tried to debug the unit test of VLLM, and I think the problem is that the scale is not being converted to E8M0, DeepGEMM Blackwell currently supports only E8M0 scale.

The unit test is also comparing the precision of two different setups: Triton FP8 using FP32 scales vs DeepGEMM using E8M0 scales. As a result, the assert_close test fails, I think it's expected. We should benchmark Triton FP8 and DeepGEMM FP8 against Triton BF16. In that scenario, I found their precision loss was very similar.

yewentao256 · 2025-07-02T13:27:11Z

@yewentao256 I read and tried to debug the unit test of VLLM, and I think the problem is that the scale is not being converted to E8M0, DeepGEMM Blackwell currently supports only E8M0 scale.

The unit test is also comparing the precision of two different setups: Triton FP8 using FP32 scales vs DeepGEMM using E8M0 scales. As a result, the assert_close test fails, I think it's expected. We should benchmark Triton FP8 and DeepGEMM FP8 against Triton BF16. In that scenario, I found their precision loss was very similar.

So the correctness of DeepGemm narrow down to a smaller scope? I am thinking since this is supported on H100, to make sure the correctness for all of the models that used DeepGemm, it should still pass on B200. Actually this makes all of the unit test corresponding to DG on vllm failed

RayWang96 · 2025-07-02T14:03:10Z

So the correctness of DeepGemm narrow down to a smaller scope? I am thinking since this is supported on H100, to make sure the correctness for all of the models that used DeepGemm, it should still pass on B200. Actually this makes all of the unit test corresponding to DG on vllm failed

@yewentao256

First, if you directly pass an FP32 scale into DeepGEMM Blackwell, it will first be converted to an integer and then transformed into the format required by the input (see here). Therefore, this is actually incorrect—the scale needs to be pre-converted to a power of two (see here). This might be one reason why the unit test failed.

On this generation of GPUs, the scale must use E8M0 to fully utilize the computing power (refer to this). In our tests, we found that this does not have much impact on the accuracy of the DeepSeek model. Therefore, using the E8M0 scale will be our most recommended approach.

Another reason why unit test failed might be that, if GEMM with FP32 scales is used as the baseline, then the results of GEMM with E8M0 scales will not align numerically. Therefore, I think the correct way to write the unit test is to compare it with the BF16 GEMM results, rather than with the FP32 scale GEMM results.

yewentao256 · 2025-07-02T22:28:30Z

@RayWang96 Thanks for your reply!
I checked it over, the main reason is that we use torch.assert_allclose for element-wise comparison. Just switched to your cal_diff function in test_core for a global comparison, it works now.
Although the element-wise accuracy decreases a little bit, but the e2e accuracy works for Qwen3, so I think it is fine.

LLM_USE_DEEP_GEMM=1 lm_eval   --model vllm   --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True"   --trust_remote_code   --tasks gsm8k   --num_fewshot 5   --batch_size auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8218|±  |0.0105|
|     |       |strict-match    |     5|exact_match|↑  |0.8946|±  |0.0085|

lm_eval   --model vllm   --model_args "pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True"   --trust_remote_code   --tasks gsm8k   --num_fewshot 5   --batch_size auto
vllm (pretrained=Qwen/Qwen3-30B-A3B-FP8,max_model_len=32768,enforce_eager=True,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8271|±  |0.0104|
|     |       |strict-match    |     5|exact_match|↑  |0.8984|±  |0.0083|

yewentao256 · 2025-07-03T20:19:45Z

deep_gemm/utils/math.py

+
+def ceil_to_ue8m0(x: torch.Tensor):
+    assert x.view(-1).amax().item() > 0
+    return torch.pow(2.0, torch.ceil(torch.log2(x.abs())))


Here .item() will cause an cuda graph issue.

File "/home/wentao/.wentao_env/lib/python3.12/site-packages/deep_gemm-1.1.0-py3.12.egg/deep_gemm/utils/math.py", line 33, in per_token_cast_to_fp8 sf = ceil_to_ue8m0(x_amax / 448.0) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wentao/.wentao_env/lib/python3.12/site-packages/deep_gemm-1.1.0-py3.12.egg/deep_gemm/utils/math.py", line 24, in ceil_to_ue8m0 assert x.view(-1).amax().item() > 0 ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/wentao/.wentao_env/lib/python3.12/site-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: operation not permitted when stream is capturing CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Perhaps deleting the assert statement?

Thanks, this function is primarily intended for testing purposes. For performance optimization, a custom implementation should be considered, potentially fused into other kernels.

Thanks for letting me know

simple86 · 2025-07-07T02:40:49Z

tests/test_core.py

-                  f'{(m * k + k * n + m * n * 2) / 1e9 / t:4.0f} GB/s')
+        # noinspection PyShadowingNames
+        def test_func():
+            deep_gemm.fp8_gemm_nt(a, b, d)


Should it be deep_gemm.fp8_gemm_nt(a, b, d, c=c) ?

Good catch, thanks. We did fix this in our internal version.

Are there any other changes in the internal version, especially for the interfaces?
Currently we are actively working on integration on this version in vllm

Mostly performance optimization, and will not break old interfaces.

Sounds great, thanks!

yewentao256 · 2025-07-09T17:34:07Z

In our tests, we found that this does not have much impact on the accuracy of the DeepSeek model.

Hi @RayWang96

export VLLM_ALL2ALL_BACKEND="deepep_high_throughput"
VLLM_USE_DEEP_GEMM=1 lm_eval   --model vllm   --model_args "pretrained=deepseek-ai/DeepSeek-R1,data_parallel_size=8,gpu_memory_utilization=0.95,max_model_len=16384,enable_expert_parallel=True"   --tasks gsm8k   --batch_size auto   --num_fewshot 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9492|±  | 0.006|
|     |       |strict-match    |     5|exact_match|↑  |0.9492|±  | 0.006|

# Nodeepgemm
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9530|±  |0.0058|
|     |       |strict-match    |     5|exact_match|↑  |0.9522|±  |0.0059|

Another question is about E2E accuracy, this is the result we have on vllm, does this match your result internally?

…ch_support

RayWang96 · 2025-07-18T03:28:00Z

We've added Hopper support, it's ready to be merged into main branch, @LyricZhao.

Anonymous cloning is more difficult with git@ (it requires per user SSH keys), which makes CI automation more difficult. This commit reverts to using https:// as it was before deepseek-ai#112

Anonymous cloning is more difficult with git@ (it requires per user SSH keys), which makes CI automation more difficult. This commit reverts to using https:// as it was before #112

Add more GPU architectures support

a437e0b

zhyncs approved these changes Jun 12, 2025

View reviewed changes

lucifer1004 reviewed Jun 12, 2025

View reviewed changes

fzyzcjy mentioned this pull request Jun 12, 2025

Reproduction commands for decode speed sgl-project/sglang#7116

Closed

fzyzcjy reviewed Jun 12, 2025

View reviewed changes

Update layout.py

cc416ee

yewentao256 reviewed Jun 16, 2025

View reviewed changes

youkaichao reviewed B022 Jun 18, 2025

View reviewed changes

This was referenced Jun 18, 2025

Register deepgemm moe kernels to work with v1 engine vllm-project/vllm#19759

Closed

[Feature] Integrate new deepgemm vllm-project/vllm#19820

Merged

This was referenced Jun 23, 2025

Low E2E throughput performance on SM100 #118

Closed

support group_gemm_offset, group_gemm_offset_swapAB #116

Closed

yewentao256 mentioned this pull request Jun 25, 2025

Revert "[Feature] Integrate new deepgemm (#19820)" vllm-project/vllm#20049

Merged

yewentao256 mentioned this pull request Jun 25, 2025

[Feature] Integrate SM100 DeepGEMM support vllm-project/vllm#20087

Merged

LyricZhao mentioned this pull request Jul 2, 2025

Add initial support for Nvidia Blackwell (SM120) #119

Closed

yewentao256 reviewed Jul 3, 2025

View reviewed changes

simple86 reviewed Jul 7, 2025

View reviewed changes

yewentao256 mentioned this pull request Jul 11, 2025

[Bug] Fix DeepGemm for EP low latency case vllm-project/vllm#20833

Merged

Optimize performance, Add SM90 support, Add 1D2D SM100 support

c52b1c1

RayWang96 closed this Jul 18, 2025

RayWang96 force-pushed the multi_arch_support branch from 61f65d6 to 8dfa329 Compare July 18, 2025 03:08

RayWang96 reopened this Jul 18, 2025 5B9E

RayWang96 added 2 commits July 17, 2025 20:17

Add fmtlib submodule at commit 553ec11

2e937c4

Merge commit '03d0be3d2d03b6eed3c99d683c0620949a13a826' into multi_ar…

4c4ff2e

…ch_support

RayWang96 changed the title ~~Draft: Add more GPU architectures support~~ Add more GPU architectures support Jul 18, 2025

LyricZhao merged commit 9da4a23 into deepseek-ai:main Jul 18, 2025

yewentao256 mentioned this pull request Jul 18, 2025

[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 required positional argument: 'use_ue8m0' for SM100 vllm-project/vllm#21187

Merged

smarterclayton mentioned this pull request Jul 18, 2025

Updated submodules to use https:// vs git@ #129

Merged

LyricZhao mentioned this pull request Jul 21, 2025

[Question] How to call DeepGemm kernel from C++ #123

Closed

goldhuang mentioned this pull request Jul 23, 2025

The replacement of deep_gemm.wgrad_gemm_fp8_fp8_fp32_nt after Add more GPU architectures support (#112) #132

Open

danthe3rd mentioned this pull request Jul 24, 2025

~5% performance regression for grouped GEMMs on H100 #136

Closed

jeejeelee mentioned this pull request Jul 30, 2025

[Feature]: Add EP/DP/PD deps in docker image vllm-project/vllm#19653

Open

1 task

krishung5 mentioned this pull request Aug 7, 2025

RuntimeError: Failed: CUDA runtime error csrc/jit/kernel_runtime.hpp:108 '98' #156

Closed

CUHKSZzxy mentioned this pull request Aug 11, 2025

build(docker): Try to optimize docker InternLM/lmdeploy#3779

Merged

4 tasks

This was referenced Sep 22, 2025

[Feature] Speed up DeepGEMMv2 warm up sgl-project/sglang#10733

Open

[Feature] Speed up DeepGEMM v2 warm up #196

Open

		return sf


		def transform_sf_into_required_layout(sf: torch.Tensor,

Add more GPU architectures support #112

Add more GPU architectures support #112

Uh oh!

Conversation

Key Points

On-going and coming soon

Long-term optimization

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!