-
Notifications
You must be signed in to change notification settings - Fork 584
Open
Description
Checklist
- 1. I have searched related issues but cannot get the expected help.
- 2. The bug has not been fixed in the latest version.
- 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
Describe the bug
I can use Qwen3-235B-A22B normally, but I can't pull up the official FP8 no matter what I do
Reproduction
lmdeploy serve api_server Qwen3-235B-A22B-FP8 --server-port 8811 --cache-max-entry-count 0.9 --tp 8 --log-level INFO --backend pytorch
Environment
sys.platform: linux
Python: 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: x86_64-linux-gnu-gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
PyTorch: 2.5.1+cu121
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v3.5.3 (Git Hash 66f0cb9eb66affd2da3bf5f8d897376f04aae6af)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX512
- CUDA Runtime 12.1
- NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 90.1 (built against CUDA 12.4)
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=9.1.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.5.1, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF,
TorchVision: 0.20.1+cu121
LMDeploy: 0.8.0+
transformers: 4.51.0
gradio: 5.22.0
fastapi: 0.115.11
pydantic: 2.10.6
triton: 3.1.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX SYS SYS SYS SYS SYS SYS SYS 0-383 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS PIX PHB SYS SYS SYS SYS SYS 0-383 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PHB PIX SYS SYS SYS SYS SYS 0-383 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS PIX SYS SYS SYS SYS 0-383 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS PIX SYS SYS SYS 0-383 0 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS PIX SYS SYS 0-383 0 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS SYS PIX PHB 0-383 0 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS SYS PHB PIX 0-383 0 N/A
NIC0 PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS SYS SYS SYS
NIC1 SYS PIX PHB SYS SYS SYS SYS SYS SYS X PHB SYS SYS SYS SYS SYS
NIC2 SYS PHB PIX SYS SYS SYS SYS SYS SYS PHB X SYS SYS SYS SYS SYS
NIC3 SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS
NIC5 SYS SYS SYS SYS SYS PIX SYS SYS SYS SYS SYS SYS SYS X SYS SYS
NIC6 SYS SYS SYS SYS SYS SYS PIX PHB SYS SYS SYS SYS SYS SYS X PHB
NIC7 SYS SYS SYS SYS SYS SYS PHB PIX SYS SYS SYS SYS SYS SYS PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_bond_0
NIC1: mlx5_bond_1
NIC2: mlx5_bond_2
NIC3: mlx5_bond_3
NIC4: mlx5_bond_4
NIC5: mlx5_bond_5
NIC6: mlx5_bond_6
NIC7: mlx5_bond_7
Error traceback
2025-05-13 11:22:35,063 - lmdeploy - INFO - async_engine.py:259 - input backend=pytorch, backend_config=PytorchEngineConfig(dtype='auto', tp=8, dp=1, dp_rank=0, ep=1, session_len=None, max_batch_size=128, cache_max_entry_count=0.9, prefill_interval=16, block_size=64, num_cpu_blocks=0, num_gpu_blocks=0, adapters=None, max_prefill_token_num=8192, thread_safe=False, enable_prefix_caching=False, device_type='cuda', eager_mode=False, custom_module_map=None, download_dir=None, revision=None, quant_policy=0, d
83B2
istributed_executor_backend=None, enable_microbatch=False)
2025-05-13 11:22:35,063 - lmdeploy - INFO - async_engine.py:260 - input chat_template_config=None
2025-05-13 11:22:35,065 - lmdeploy - INFO - async_engine.py:269 - updated chat_template_onfig=ChatTemplateConfig(model_name='qwen', system=None, meta_instruction=None, eosys=None, user=None, eoh=None, assistant=None, eoa=None, tool=None, eotool=None, separator=None, capability=None, stop_words=None)
2025-05-13 11:22:36,282 - lmdeploy - WARNING - transformers.py:22 - LMDeploy requires transformers version: [4.33.0 ~ 4.49.0], but found version: 4.51.0
2025-05-13 11:22:36,374 - lmdeploy - INFO - __init__.py:81 - Build <ray> executor.
2025-05-13 11:22:37,113 - lmdeploy - INFO - ray_executor.py:247 - Init ray cluster.
2025-05-13 11:22:37,140 INFO worker.py:1660 -- Connecting to existing Ray cluster at address: 10.126.218.17:8192...
2025-05-13 11:22:37,151 INFO worker.py:1852 -- Connected to Ray cluster.
2025-05-13 11:22:37,306 - lmdeploy - INFO - ray_executor.py:275 - Init ray workers.
2025-05-13 11:22:37,382 - lmdeploy - INFO - ray_executor.py:281 - Init distributed environment by device.
2025-05-13 11:22:39,991 - lmdeploy - INFO - ray_executor.py:284 - Init distributed process group.
(RayWorkerWrapper pid=24334) 2025-05-13 11:22:39,993 - lmdeploy - INFO - dist_utils.py:29 - MASTER_ADDR=10.126.218.17, MASTER_PORT=60933
2025-05-13 11:22:42,297 - lmdeploy - INFO - ray_executor.py:294 - Warming up distribute environment, this might take long time, please waiting...
2025-05-13 11:23:02,798 - lmdeploy - INFO - base.py:152 - Building Model.
Loading weights from safetensors: 0%| | 0/48 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/py3/bin/lmdeploy", line 8, in <module>
sys.exit(run())
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/entrypoint.py", line 39, in run
args.run(args)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/cli/serve.py", line 333, in api_server
run_api_server(args.model_path,
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/serve/openai/api_server.py", line 1121, in serve
VariableInterface.async_engine = pipeline_class(model_path=model_path,
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 279, in __init__
self._build_pytorch(model_path=model_path, backend_config=backend_config, **kwargs)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/serve/async_engine.py", line 341, in _build_pytorch
self.engine = Engine(model_path=model_path, tokenizer=self.tokenizer, engine_config=backend_config)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/engine.py", line 335, in __init__
self.executor.init()
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/executor/base.py", line 153, in init
self.build_model()
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/executor/ray_executor.py", line 311, in build_model
self.collective_rpc('build_model')
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/executor/ray_executor.py", line 307, in collective_rpc
return ray.get([getattr(worker, method).remote(*args, **kwargs) for worker in self.workers], timeout=timeout)
File "/opt/py3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
return fn(*args, **kwargs)
File "/opt/py3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/opt/py3/lib/python3.10/site-packages/ray/_private/worker.py", line 2782, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/opt/py3/lib/python3.10/site-packages/ray/_private/worker.py", line 929, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::RayWorkerWrapper.build_model() (pid=26688, ip=10.126.218.17, actor_id=867d5462c52f82ceac0930ba06000000, repr=<lmdeploy.pytorch.engine.executor.ray_executor.RayWorkerWrapper object at 0x7fb43870a410>)
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/executor/base_worker.py", line 98, in build_model
self.model_agent.build_model()
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 621, in build_model
self._build_model()
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/engine/model_agent.py", line 612, in _build_model
load_model_weights(patched_model, model_path, device=device)
File "/opt/py3/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/weight_loader/model_weight_loader.py", line 166, in load_model_weights
loader.load_model_weights(model, device=device)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/weight_loader/model_weight_loader.py", line 157, in load_model_weights
model.load_weights(weights_iterator)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/models/qwen3_moe.py", line 511, in load_weights
self._load_weight_experts(name, loaded_weight, params_dict, expert_params_mapping=expert_params_mapping)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/models/qwen3_moe.py", line 471, in _load_weight_experts
load_weight(param, loaded_weight, expert_id=expert_id, shard_id=shard_id)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/weight_loader/model_weight_loader.py", line 20, in load_weight
param.weight_loader(param, loaded_weight, **kwargs)
File "/opt/py3/lib/python3.10/site-packages/lmdeploy/pytorch/nn/moe.py", line 470, in weight_loader_scale_tp
param_data.copy_(weight)
RuntimeError: output with shape [1, 32] doesn't match the broadcast shape [2, 32]
Loading weights from safetensors: 0%| | 0/48 [00:01<?, ?it/s]
(RayWorkerWrapper pid=27612) 2025-05-13 11:22:39,993 - lmdeploy - INFO - dist_utils.py:29 - MASTER_ADDR=10.126.218.17, MASTER_PORT=60933 [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Metadata
Metadata
Assignees
Labels
No labels