[Bug] Qwen2.5-VL-7B-Instruct-AWQ，Asecnd 910B，Inference BUG

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

I use docker from aliyun, and i run llm such as Qwen2.5-VL-7B-Instruct is OK, But when i switch to AWQ series, It's failed.

My command:
lmdeploy serve api_server
--backend turbomind
--device ascend
--eager-mode
--server-port 12000
--tp 4
--max-batch-size 32
--cache-max-entry-count 0.6
--cache-block-seq-len 64
--model-format awq
/root/Qwen2.5-VL-7B-Instruct

Error Informations:
2025-05-07 07:55:31,127 - lmdeploy - ERROR - model_agent.py:391 - Task failed
Traceback (most recent call last):
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 386, in _on_finish_callback
task.result()
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 374, in _async_loop_background
await self._async_step_background(
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 322, in _async_step_background
output = await self._async_model_forward(inputs,
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 243, in _async_model_forward
ret = await __forward(inputs)
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 220, in __forward
return await self.async_forward(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 538, in async_forward
output = self._forward_impl(inputs, swap_in_map=swap_in_map, swap_out_map=swap_out_map)
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 521, in _forward_impl
output = model_forward(
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 75, in model_forward
output = model(**input_dict)
File "/opt/lmdeploy/lmdeploy/pytorch/backends/graph_runner.py", line 24, in call
return self.model(**kwargs)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/pytorch/models/qwen2_5_vl.py", line 439, in forward
hidden_states = self.model(
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/pytorch/models/qwen2_vl.py", line 295, in forward
hidden_states, residual = decoder_layer(
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/pytorch/models/qwen2_vl.py", line 214, in forward
hidden_states = self.self_attn(
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/pytorch/models/qwen2_vl.py", line 96, in forward
qkv_states = self.qkv_proj(hidden_states)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/opt/lmdeploy/lmdeploy/pytorch/nn/linear.py", line 512, in forward
out = self.impl.forward(x, self.qweight, self.scales, self.qzeros, self.bias, False)
File "/opt/lmdeploy/lmdeploy/pytorch/backends/dlinfer/awq_modules.py", line 28, in forward
out = awq_linear(x, qweight, scales, qzeros, bias, all_reduce, self.group_size)
File "/opt/lmdeploy/lmdeploy/pytorch/kernels/dlinfer/awq_kernels.py", line 15, in awq_linear
return ext_ops.weight_quant_matmul(x.squeeze(0),
File "/usr/local/python3.10.5/lib/python3.10/site-packages/dlinfer/ops/llm.py", line 542, in weight_quant_matmul
return vendor_ops_registry["weight_quant_matmul"](
File "/usr/local/python3.10.5/lib/python3.10/site-packages/dlinfer/vendor/ascend/torch_npu_ops.py", line 404, in weight_quant_matmul
return torch.ops.npu.npu_weight_quant_batchmatmul(
File "/usr/local/python3.10.5/lib/python3.10/site-packages/torch/ops.py", line 854, in call
return self._op(*args, **(kwargs or {}))
RuntimeError: call aclnnWeightQuantBatchMatmulV2 failed, detail:EZ1001: [PID: 38901] 2025-05-07-07:55:31.121.354 antiquantScale's dtype must be DT_UINT64 or DT_INT64 when antiquantOffset's dtype is DT_INT32, actual antiquantScale's dtype is [DT_FLOAT16].

Reproduction

lmdeploy serve api_server
--backend turbomind
--device ascend
--eager-mode
--server-port 12000
--tp 4
--max-batch-size 32
--cache-max-entry-count 0.6
--cache-block-seq-len 64
--model-format awq
/root/Qwen2.5-VL-7B-Instruct

Environment

lmdeploy check_env
[W compiler_depend.ts:615] Warning: expandable_segments currently defaults to false. You can enable this feature by `export PYTORCH_NPU_ALLOC_CONF = expandable_segments:True`. (function operator())
sys.platform: linux
Python: 3.10.5 (main, Mar 24 2025, 07:28:13) [GCC 9.4.0]
CUDA available: False
MUSA available: False
numpy_random_seed: 2147483648
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
PyTorch: 2.3.1
PyTorch compiling details: PyTorch built with:
  - GCC 10.2
  - C++ Version: 201703
  - Intel(R) MKL-DNN v3.3.6 (Git Hash 86e6af5974177e513fd3fee58425e1063e7f1361)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: NO AVX
  - Build settings: BLAS_INFO=open, BUILD_TYPE=Release, CXX_COMPILER=/opt/rh/devtoolset-10/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=open, TORCH_VERSION=2.3.1, USE_CUDA=OFF, USE_CUDNN=OFF, USE_CUSPARSELT=OFF, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,

TorchVision: 0.18.1
LMDeploy: 0.7.3+
transformers: 4.51.3
gradio: Not Found
fastapi: 0.115.12
pydantic: 2.11.3
triton: Not Found

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checklist

Describe the bug

Reproduction

Environment

Error traceback

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Checklist

Describe the bug

Reproduction

Environment

Error traceback

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions