8000 [WIP] Introduce flashinfer allreduce in MoE TP by yuan-luo · Pull Request #7946 · sgl-project/sglang · GitHub
[go: up one dir, main page]

Skip to content

Conversation

yuan-luo
Copy link
Collaborator

Motivation

Introduce flashinfer custom allreduce into SGLang Fused MoE, replacing torch distributed all_reduce.

Modifications

Checklist

Copy link
Contributor
@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a performance optimization for Mixture-of-Experts (MoE) layers in SGLang by integrating FlashInfer's custom allreduce operation. This change aims to replace the default PyTorch distributed all-reduce for specific scenarios, particularly for smaller token counts, to potentially achieve better communication efficiency and throughput in distributed MoE models.

Highlights

  • FlashInfer Allreduce Integration: A new module (python/sglang/srt/layers/flashinfer_comm.py) has been introduced to encapsulate FlashInfer's custom allreduce operation, including necessary workspace management for inter-process communication (IPC).
  • Conditional MoE Optimization: The Fused Mixture-of-Experts (MoE) layer (python/sglang/srt/layers/moe/fused_moe_triton/layer.py) is updated to conditionally utilize this new FlashInfer allreduce for final_hidden_states. This optimization is applied when FlashInfer is available, the new feature flag is enabled, and the input token count is 128 or less, potentially improving performance over standard torch.distributed.all_reduce.
  • New Configuration Option: A new server argument --enable-flashinfer-allreduce has been added to python/sglang/srt/server_args.py, allowing users to enable or disable this specific FlashInfer allreduce optimization for MoE layers.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo yuan-luo changed the title Introduce flashinfer allreduce Introduce flashinfer allreduce in MoE TP Jul 11, 2025
Copy link
Contributor
@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FlashInfer's custom all-reduce kernel to replace torch.distributed.all_reduce for MoE layers, aiming for performance improvements. The changes include a new file flashinfer_comm.py with the wrapper implementation, and modifications to fused_moe_triton/layer.py to conditionally use this new kernel. The server arguments are also updated to control this feature.

My review has identified several critical issues in the new flashinfer_comm.py file that need to be addressed. These include function signature mismatches, incorrect function calls that would lead to runtime errors, and the use of random dummy data in the core computation which would produce incorrect results. There is also a potential resource leak in the workspace manager. I have provided specific suggestions to fix these issues.

@yuan-luo yuan-luo changed the title Introduce flashinfer allreduce in MoE TP [WIP] Introduce flashinfer allreduce in MoE TP Jul 11, 2025
@yuan-luo
Copy link
Collaborator Author

Please note this PR needs to use B200, otherwise will encounter this error.

[3/4] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -O3 -std=c++17 --threads=4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -gencode=arch=compute_100a,code=sm_100a -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
FAILED: trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1016\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -O3 -std=c++17 --threads=4 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -gencode=arch=compute_100a,code=sm_100a -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
nvcc fatal   : Unsupported gpu architecture 'compute_100a'
ninja: build stopped: subcommand failed.

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch from c2ec629 to 8fd70ad Compare July 11, 2025 07:58
@yuan-luo yuan-luo changed the title [WIP] Introduce flashinfer allreduce in MoE TP Introduce flashinfer allreduce in MoE TP Jul 11, 2025
@yuan-luo yuan-luo changed the title Introduce flashinfer allreduce in MoE TP [B200] Introduce flashinfer allreduce in MoE TP Jul 11, 2025
@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch 3 times, most recently from 54b40de to 15388b1 Compare July 11, 2025 14:51
@yuan-luo
Copy link
Collaborator Author

Encounter some problem in B200:

➜  sglang python3 -m sglang.launch_server --model Qwen/Qwen3-30B-A3B --tp-size 8 --port 30000 --enable-flashinfer-allreduce
[2025-07-12 04:13:28] server_args=ServerArgs(model_path='Qwen/Qwen3-30B-A3B', tokenizer_path='Qwen/Qwen3-30B-A3B', tokenizer_mode='auto', skip_tokenizer_init=False, skip_server_warmup=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, quantization_param_path=None, context_length=None, device='cuda', served_model_name='Qwen/Qwen3-30B-A3B', chat_template=None, completion_template=None, is_embedding=False, enable_multimodal=None, revision=None, hybrid_kvcache_ratio=None, impl='auto', host='127.0.0.1', port=30000, nccl_port=None, mem_fraction_static=0.821, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=16384, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=582992402, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=0, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, bucket_time_to_first_token=None, bucket_e2e_request_latency=None, bucket_inter_token_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_moe=False, enable_flashinfer_allreduce_fusion=False, enable_flashinfer_allreduce=True, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_nccl_nvls=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, disable_overlap_cg_plan=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='', flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, enable_triton_kernel_moe=False, warmups=None, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False)
[2025-07-12 04:13:35 TP0] Attention backend not set. Use flashinfer backend by default.
[2025-07-12 04:13:35 TP0] Init torch distributed begin.
[2025-07-12 04:13:43 TP0] sglang is using nccl==2.27.5
[2025-07-12 04:13:50 TP0] Init torch distributed ends. mem usage=1.47 GB
[2025-07-12 04:13:51 TP0] Load weight begin. avail mem=176.28 GB
[2025-07-12 04:13:51 TP3] The weight of LmHead is not packed
[2025-07-12 04:13:51 TP2] The weight of LmHead is not packed
[2025-07-12 04:13:51 TP6] The weight of LmHead is not packed
[2025-07-12 04:13:51 TP5] The weight of LmHead is not packed
[2025-07-12 04:13:51 TP7] The weight of LmHead is not packed
[2025-07-12 04:13:51 TP3] Using model weights format ['*.safetensors']
[2025-07-12 04:13:51 TP2] Using model weights format ['*.safetensors']
[2025-07-12 04:13:51 TP6] Using model weights format ['*.safetensors']
[2025-07-12 04:13:51 TP0] The weight of LmHead is not packed
[2025-07-12 04:13:51 TP7] Using model weights format ['*.safetensors']
[2025-07-12 04:13:51 TP5] Using model weights format ['*.safetensors']
[2025-07-12 04:13:52 TP4] The weight of LmHead is not packed
[2025-07-12 04:13:52 TP1] The weight of LmHead is not packed
[2025-07-12 04:13:52 TP1] Using model weights format ['*.safetensors']
[2025-07-12 04:13:52 TP4] Using model weights format ['*.safetensors']
[2025-07-12 04:13:52 TP0] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:05,  2.51it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:00<00:06,  2.25it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:01<00:05,  2.20it/s]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:01<00:05,  2.10it/s]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:02<00:05,  2.10it/s]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:02<00:03,  2.81it/s]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:02<00:03,  2.73it/s]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:03<00:03,  2.60it/s]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:03<00:02,  2.52it/s]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:04<00:02,  2.49it/s]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:04<00:02,  2.48it/s]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:04<00:01,  2.48it/s]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:05<00:01,  2.50it/s]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:05<00:00,  2.53it/s]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:06<00:00,  2.52it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:06<00:00,  2.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:06<00:00,  2.48it/s]

[2025-07-12 04:13:59 TP0] Load weight end. type=Qwen3MoeForCausalLM, dtype=torch.bfloat16, avail mem=169.09 GB, mem usage=7.19 GB.
[2025-07-12 04:14:03 TP6] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP0] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP0] Memory pool end. avail mem=30.79 GB
[2025-07-12 04:14:03 TP4] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP7] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP1] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP5] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP3] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:03 TP2] KV Cache is allocated. #tokens: 6006749, K size: 68.74 GB, V size: 68.74 GB
[2025-07-12 04:14:04 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=30.25 GB
[2025-07-12 04:14:04 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
Capturing batches (bs=512 avail_mem=30.16 GB):   0%|                                                                                                                                                                        | 0/51 [00:00<?, ?it/s][2025-07-12 04:14:05 TP3] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP2] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP7] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP5] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP6] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP0] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP1] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2025-07-12 04:14:05 TP4] Using default MoE kernel config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_3_1/E=128,N=96,device_name=NVIDIA_B200.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=128 avail_mem=23.86 GB):  63%|███████████████████████████████████████████████████████████████████████████████████████████████████▊                                                           | 32/51 [00:25<00:14,  1.35it/s]/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
rank 2 allocated ipc_handles: [['0x772d1f800000', '0x772d21800000', '0x775239800000', '0x772d23800000', '0x772d25800000', '0x772d27800000', '0x772d29800000', '0x772d2b800000'], ['0x772d2f800000', '0x772d31800000', '0x772d2d800000', '0x772d33800000', '0x772d35800000', '0x772d37800000', '0x772d39800000', '0x772d3b800000'], ['0x772d1dc00000', '0x772d1de00000', '0x772d1da00000', '0x772d63c00000', '0x772d63e00000', '0x772e81e00000', '0x772e83c00000', '0x772e83e00000'], ['0x772e93e00000', '0x772ea3c00000', '0x772e93c00000', '0x772ea3e00000', '0x772ea5c00000', '0x772ea5e00000', '0x772ecbc00000', '0x772ecbe00000'], ['0x772fa5c00000', '0x773001c00000', '0x772f1fc00000', '0x77305dc00000', '0x7730b9c00000', '0x773115c00000', '0x773171c00000', '0x7731cdc00000'], ['0x773285c00000', '0x7732e1c00000', '0x773229c00000', '0x77333dc00000', '0x773399c00000', '0x7733f5c00000', '0x773451c00000', '0x7734adc00000'], ['0x773565c00000', '0x7735c1c00000', '0x773509c00000', '0x77361dc00000', '0x773679c00000', '0x7736d5c00000', '0x773731c00000', '0x77378dc00000']]
rank 6 allocated ipc_handles: [['0x7f43af800000', '0x7f43b1800000', '0x7f43b3800000', '0x7f43b5800000', '0x7f43b7800000', '0x7f43b9800000', '0x7f68c9800000', '0x7f43bb800000'], ['0x7f43bf800000', '0x7f43c1800000', '0x7f43c3800000', '0x7f43c5800000', '0x7f43c7800000', '0x7f43c9800000', '0x7f43bd800000', '0x7f43cb800000'], ['0x7f43adc00000', '0x7f43ade00000', '0x7f43f3c00000', '0x7f43f3e00000', '0x7f4511e00000', '0x7f4513c00000', '0x7f43ada00000', '0x7f4513e00000'], ['0x7f4523e00000', '0x7f4533c00000', '0x7f4533e00000', '0x7f4535c00000', '0x7f4535e00000', '0x7f455bc00000', '0x7f4523c00000', '0x7f455be00000'], ['0x7f4635c00000', '0x7f4691c00000', '0x7f46edc00000', '0x7f4749c00000', '0x7f47a5c00000', '0x7f4801c00000', '0x7f45afc00000', '0x7f485dc00000'], ['0x7f4915c00000', '0x7f4971c00000', '0x7f49cdc00000', '0x7f4a29c00000', '0x7f4a85c00000', '0x7f4ae1c00000', '0x7f48b9c00000', '0x7f4b3dc00000'], ['0x7f4bf5c00000', '0x7f4c51c00000', '0x7f4cadc00000', '0x7f4d09c00000', '0x7f4d65c00000', '0x7f4dc1c00000', '0x7f4b99c00000', '0x7f4e1dc00000']]
rank 3 allocated ipc_handles: [['0x72ab55800000', '0x72ab57800000', '0x72ab59800000', '0x72d06f800000', '0x72ab5b800000', '0x72ab5d800000', '0x72ab5f800000', '0x72ab61800000'], ['0x72ab65800000', '0x72ab67800000', '0x72ab69800000', '0x72ab63800000', '0x72ab6b800000', '0x72ab6d800000', '0x72ab6f800000', '0x72ab71800000'], ['0x72ab53c00000', '0x72ab53e00000', '0x72abadc00000', '0x72ab53a00000', '0x72abade00000', '0x72acb7e00000', '0x72acb9c00000', '0x72acb9e00000'], ['0x72acc9e00000', '0x72acd9c00000', '0x72acd9e00000', '0x72acc9c00000', '0x72acdbc00000', '0x72acdbe00000', '0x72ad01c00000', '0x72ad01e00000'], ['0x72add9c00000', '0x72ae35c00000', '0x72ae91c00000', '0x72d02fc00000', '0x72aeedc00000', '0x72af49c00000', '0x72afa5c00000', '0x72b001c00000'], ['0x72b0b9c00000', '0x72b115c00000', '0x72b171c00000', '0x72b05dc00000', '0x72b1cdc00000', '0x72b229c00000', '0x72b285c00000', '0x72b2e1c00000'], ['0x72b399c00000', '0x72b3f5c00000', '0x72b451c00000', '0x72b33dc00000', '0x72b4adc00000', '0x72b509c00000', '0x72b565c00000', '0x72b5c1c00000']]
rank 1 allocated ipc_handles: [['0x7b8f71800000', '0x7bb48b800000', '0x7b8f73800000', '0x7b8f75800000', '0x7b8f77800000', '0x7b8f79800000', '0x7b8f7b800000', '0x7b8f7d800000'], ['0x7b8f81800000', '0x7b8f7f800000', '0x7b8f83800000', '0x7b8f85800000', '0x7b8f87800000', '0x7b8f89800000', '0x7b8f8b800000', '0x7b8f8d800000'], ['0x7b8f6fc00000', '0x7b8f6fa00000', '0x7b8f6fe00000', '0x7b8fc9c00000', '0x7b8fc9e00000', '0x7b90d3e00000', '0x7b90d5c00000', '0x7b90d5e00000'], ['0x7b90e5e00000', '0x7b90e5c00000', '0x7b90f5c00000', '0x7b90f5e00000', '0x7b90f7c00000', '0x7b90f7e00000', '0x7b911dc00000', '0x7b911de00000'], ['0x7b91f5c00000', '0x7bb44bc00000', '0x7b9251c00000', '0x7b92adc00000', '0x7b9309c00000', '0x7b9365c00000', '0x7b93c1c00000', '0x7b941dc00000'], ['0x7b94d5c00000', '0x7b9479c00000', '0x7b9531c00000', '0x7b958dc00000', '0x7b95e9c00000', '0x7b9645c00000', '0x7b96a1c00000', '0x7b96fdc00000'], ['0x7b97b5c00000', '0x7b9759c00000', '0x7b9811c00000', '0x7b986dc00000', '0x7b98c9c00000', '0x7b9925c00000', '0x7b9981c00000', '0x7b99ddc00000']]
rank 0 allocated ipc_handles: [['0x79393d800000', '0x793c32400000', '0x791470000000', '0x791470800000', '0x791471000000', '0x791471800000', '0x79146e000000', '0x79146e800000'], ['0x79146f000000', '0x79146f800000', '0x791430000000', '0x791430800000', '0x791431000000', '0x791431800000', '0x79141a000000', '0x79141a800000'], ['0x79141b000000', '0x79141b200000', '0x79141b400000', '0x79141b600000', '0x79141b800000', '0x79141ba00000', '0x79141bc00000', '0x79141be00000'], ['0x791459e00000', '0x79157be00000', '0x79157dc00000', '0x79157de00000', '0x79158dc00000', '0x79158de00000', '0x79159dc00000', '0x79159de00000'], ['0x79159fc00000', '0x7915c5c00000', 
8000
'0x79161dc00000', '0x7916a9c00000', '0x791705c00000', '0x791761c00000', '0x7917bdc00000', '0x791819c00000'], ['0x791875c00000', '0x7918d1c00000', '0x79192dc00000', '0x791989c00000', '0x7919e5c00000', '0x791a41c00000', '0x791a9dc00000', '0x791af9c00000'], ['0x791b55c00000', '0x791bb1c00000', '0x791c0dc00000', '0x791c69c00000', '0x791cc5c00000', '0x791d21c00000', '0x791d7dc00000', '0x791dd9c00000']]
rank 5 allocated ipc_handles: [['0x71fe9d800000', '0x71fe9f800000', '0x71fea1800000', '0x71fea3800000', '0x71fea5800000', '0x7223b7800000', '0x71fea7800000', '0x71fea9800000'], ['0x71fead800000', '0x71feaf800000', '0x71feb1800000', '0x71feb3800000', '0x71feb5800000', '0x71feab800000', '0x71feb7800000', '0x71feb9800000'], ['0x71fe9bc00000', '0x71fe9be00000', '0x71fef5c00000', '0x71fef5e00000', '0x71ffffe00000', '0x71fe9ba00000', '0x720001c00000', '0x720001e00000'], ['0x720011e00000', '0x720021c00000', '0x720021e00000', '0x720023c00000', '0x720023e00000', '0x720011c00000', '0x720049c00000', '0x720049e00000'], ['0x720121c00000', '0x72017dc00000', '0x7201d9c00000', '0x720235c00000', '0x720291c00000', '0x722377c00000', '0x7202edc00000', '0x720349c00000'], ['0x720401c00000', '0x72045dc00000', '0x7204b9c00000', '0x720515c00000', '0x720571c00000', '0x7203a5c00000', '0x7205cdc00000', '0x720629c00000'], ['0x7206e1c00000', '0x72073dc00000', '0x720799c00000', '0x7207f5c00000', '0x720851c00000', '0x720685c00000', '0x7208adc00000', '0x720909c00000']]
rank 7 allocated ipc_handles: [['0x73ef5b600000', '0x73ef9c000000', '0x73ef9c800000', '0x73ef9d000000', '0x73ef9d800000', '0x73ef9a000000', '0x73ef9a800000', '0x741459800000'], ['0x73ef9b800000', '0x73ef3e000000', '0x73ef3e800000', '0x73ef3f000000', '0x73ef3f800000', '0x73ef3c000000', '0x73ef3c800000', '0x73ef9b000000'], ['0x73ef3d200000', '0x73ef3d400000', '0x73ef3d600000', '0x73ef3d800000', '0x73ef3da00000', '0x73ef3dc00000', '0x73ef3de00000', '0x73ef3d000000'], ['0x73ef5de00000', '0x73ef5fe00000', '0x73ef61e00000', '0x73ef63e00000', '0x73ef65e00000', '0x73ef67e00000', '0x73ef69e00000', '0x73ef5be00000'], ['0x73f0a3c00000', '0x73f0b3c00000', '0x73f0c3c00000', '0x73f0c5c00000', '0x73f0ebc00000', '0x73f13fc00000', '0x73f1c5c00000', '0x73ef83c00000'], ['0x73f27dc00000', '0x73f2d9c00000', '0x73f335c00000', '0x73f391c00000', '0x73f3edc00000', '0x73f449c00000', '0x73f4a5c00000', '0x73f221c00000'], ['0x73f55dc00000', '0x73f5b9c00000', '0x73f615c00000', '0x73f671c00000', '0x73f6cdc00000', '0x73f729c00000', '0x73f785c00000', '0x73f501c00000']]
rank 4 allocated ipc_handles: [['0x74dc83800000', '0x74dc85800000', '0x74dc87800000', '0x74dc89800000', '0x75019d800000', '0x74dc8b800000', '0x74dc8d800000', '0x74dc8f800000'], ['0x74dc93800000', '0x74dc95800000', '0x74dc97800000', '0x74dc99800000', '0x74dc91800000', '0x74dc9b800000', '0x74dc9d800000', '0x74dc9f800000'], ['0x74dc81c00000', '0x74dc81e00000', '0x74dcc7c00000', '0x74dcc7e00000', '0x74dc81a00000', '0x74dde5e00000', '0x74dde7c00000', '0x74dde7e00000'], ['0x74ddf7e00000', '0x74de07c00000', '0x74de07e00000', '0x74de09c00000', '0x74ddf7c00000', '0x74de09e00000', '0x74de2fc00000', '0x74de2fe00000'], ['0x74df09c00000', '0x74df65c00000', '0x74dfc1c00000', '0x74e01dc00000', '0x74de83c00000', '0x74e079c00000', '0x74e0d5c00000', '0x74e131c00000'], ['0x74e1e9c00000', '0x74e245c00000', '0x74e2a1c00000', '0x74e2fdc00000', '0x74e18dc00000', '0x74e359c00000', '0x74e3b5c00000', '0x74e411c00000'], ['0x74e4c9c00000', '0x74e525c00000', '0x74e581c00000', '0x74e5ddc00000', '0x74e46dc00000', '0x74e639c00000', '0x74e695c00000', '0x74e6f1c00000']]
[2025-07-12 04:16:16.318] [info] lamportInitialize start: buffer: 0x772f1fc00000, size: 1048576
[2025-07-12 04:16:16.343] [info] lamportInitialize start: buffer: 0x72d02fc00000, size: 1048576
[2025-07-12 04:16:16.364] [info] lamportInitialize start: buffer: 0x7bb44bc00000, size: 1048576
[2025-07-12 04:16:16.376] [info] lamportInitialize start: buffer: 0x73ef83c00000, size: 1048576
[2025-07-12 04:16:16.377] [info] lamportInitialize start: buffer: 0x773229c00000, size: 1048576
[2025-07-12 04:16:16.377] [info] lamportInitialize start: buffer: 0x773509c00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.397] [info] lamportInitialize start: buffer: 0x7f45afc00000, size: 1048576
[2025-07-12 04:16:16.403] [info] lamportInitialize start: buffer: 0x72b05dc00000, size: 1048576
[2025-07-12 04:16:16.403] [info] lamportInitialize start: buffer: 0x72b33dc00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.419] [info] lamportInitialize start: buffer: 0x74de83c00000, size: 1048576
[2025-07-12 04:16:16.430] [info] lamportInitialize start: buffer: 0x7b9479c00000, size: 1048576
[2025-07-12 04:16:16.430] [info] lamportInitialize start: buffer: 0x7b9759c00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.444] [info] lamportInitialize start: buffer: 0x73f221c00000, size: 1048576
[2025-07-12 04:16:16.444] [info] lamportInitialize start: buffer: 0x73f501c00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.449] [info] lamportInitialize start: buffer: 0x722377c00000, size: 1048576
[2025-07-12 04:16:16.466] [info] lamportInitialize start: buffer: 0x7f48b9c00000, size: 1048576
[2025-07-12 04:16:16.466] [info] lamportInitialize start: buffer: 0x7f4b99c00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.474] [info] lamportInitialize start: buffer: 0x79159fc00000, size: 1048576
[2025-07-12 04:16:16.476] [info] lamportInitialize start: buffer: 0x74e18dc00000, size: 1048576
[2025-07-12 04:16:16.476] [info] lamportInitialize start: buffer: 0x74e46dc00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.518] [info] lamportInitialize start: buffer: 0x7203a5c00000, size: 1048576
[2025-07-12 04:16:16.518] [info] lamportInitialize start: buffer: 0x720685c00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16.545] [info] lamportInitialize start: buffer: 0x791875c00000, size: 1048576
[2025-07-12 04:16:16.545] [info] lamportInitialize start: buffer: 0x791b55c00000, size: 1048576
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:4631: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
  warnings.warn(  # warn only once
[2025-07-12 04:16:16 TP0] FlashInfer workspace initialized for rank 0, world_size 8
[2025-07-12 04:16:16 TP1] FlashInfer workspace initialized for rank 1, world_size 8
[2025-07-12 04:16:16 TP2] FlashInfer workspace initialized for rank 2, world_size 8
[2025-07-12 04:16:16 TP3] FlashInfer workspace initialized for rank 3, world_size 8
[2025-07-12 04:16:16 TP7] FlashInfer workspace initialized for rank 7, world_size 8
[2025-07-12 04:16:16 TP6] FlashInfer workspace initialized for rank 6, world_size 8
[2025-07-12 04:16:16 TP5] FlashInfer workspace initialized for rank 5, world_size 8
[2025-07-12 04:16:16 TP4] FlashInfer workspace initialized for rank 4, world_size 8
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Custom all-reduce configuration unsupported
Capturing batches (bs=128 avail_mem=23.86 GB):  63%|███████████████████████████████████████████████████████████████████████████████████████████████████▊                                                           | 32/51 [02:11<01:18,  4.12s/it]
[2025-07-12 04:16:16 TP3] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP2] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP5] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP7] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP6] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP0] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP4] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP1] Registering 3104 cuda graph addresses
[2025-07-12 04:16:16 TP3] Scheduler hit an exception: Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 336, in __init__
    self.capture()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 436, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 587, in capture_one_batch_size
    run_once()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 575, in run_once
    logits_output_or_pp_proxy_tensors = forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen3_moe.py", line 729, in forward
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen2_moe.py", line 486, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen3_moe.py", line 611, in forward
    hidden_states = self.mlp(hidden_states, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen3_moe.py", line 166, in forward
    return self.forward_normal(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/models/qwen3_moe.py", line 183, in forward_normal
    final_hidden_states = self.experts(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 964, in forward
    final_hidden_states = flashinfer_allreduce(final_hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/layers/flashinfer_comm.py", line 157, in flashinfer_allreduce
    _flashinfer_comm.trtllm_custom_all_reduce(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/comm/trtllm_ar.py", line 665, in trtllm_custom_all_reduce
    get_trtllm_comm_module().trtllm_custom_all_reduce(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/comm/trtllm_ar.py", line 187, in trtllm_custom_all_reduce
    module.trtllm_custom_all_reduce(
  File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1158, in __call__
    return self._op(*args, **(kwargs or {}))
RuntimeError: Error in function 'customAllReduce' at /usr/local/lib/python3.10/dist-packages/flashinfer/data/include/flashinfer/comm/trtllm_allreduce.cuh:1703: Custom all-reduce configuration unsupported

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 2733, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, pp_rank, dp_rank)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/scheduler.py", line 339, in __init__
    self.tp_worker = TpWorkerClass(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 66, in __init__
    self.worker = TpModelWorker(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/managers/tp_worker.py", line 81, in __init__
    self.model_runner = ModelRunner(
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 233, in __init__
    self.initialize(min_per_gpu_memory)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 314, in initialize
    self.init_cuda_graphs()
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/model_runner.py", line 1373, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/usr/local/lib/python3.10/dist-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 338, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Error in function 'customAllReduce' at /usr/local/lib/python3.10/dist-packages/flashinfer/data/include/flashinfer/comm/trtllm_allreduce.cuh:1703: Custom all-reduce configuration unsupported
Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose

@yyihuang yyihuang self-requested a review August 18, 2025 22:08
@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch from 15388b1 to 0bc6068 Compare August 21, 2025 04:44
@yuan-luo yuan-luo changed the title [B200] Introduce flashinfer allreduce in MoE TP [WIP][B200] Introduce flashinfer allreduce in MoE TP Aug 21, 2025
@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch from 0bc6068 to 8d1bb24 Compare August 21, 2025 04:56
@yuan-luo yuan-luo changed the title [WIP][B200] Introduce flashinfer allreduce in MoE TP [B200] Introduce flashinfer allreduce in MoE TP Aug 21, 2025
@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch from 8d1bb24 to 7cabb5d Compare August 21, 2025 05:29
@yuan-luo yuan-luo changed the title [B200] Introduce flashinfer allreduce in MoE TP Introduce flashinfer allreduce in MoE TP Aug 22, 2025
@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch from 7cabb5d to b1d634f Compare August 22, 2025 07:35
@yuan-luo yuan-luo changed the title Introduce flashinfer allreduce in MoE TP [WIP] Introduce flashinfer allreduce in MoE TP Aug 23, 2025
@yuan-luo
Copy link
Collaborator Author
yuan-luo commented Aug 23, 2025

Encounter some issues on H20. Flashinfer can't find cuda_fp4.h file.

$python3 -m sglang.launch_server --model /home/admin/Qwen3-30B-A3B --tp-size 8 --port 30000 --enable-flashinfer-allreduce
INFO 08-24 00:16:40 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:40 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:40 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:40 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:40 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:40 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:40.455000 286274 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:40.455000 286274 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-24 00:16:40] server_args=ServerArgs(model_path='/home/admin/Qwen3-30B-A3B', tokenizer_path='/home/admin/Qwen3-30B-A3B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.833, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, cpu_offload_gb=0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=620351713, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, api_key=None, served_model_name='/home/admin/Qwen3-30B-A3B', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', enable_flashinfer_allreduce=True, ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
[2025-08-24 00:16:41] Using default HuggingFace chat template with detected content format: string
INFO 08-24 00:16:50 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.191000 286412 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.191000 286412 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.362000 286417 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilati
6284
on. 
W0824 00:16:51.362000 286417 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.433000 286413 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.433000 286413 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.449000 286416 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.449000 286416 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0824 00:16:51.454000 286410 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.454000 286410 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.518000 286415 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.518000 286415 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.528000 286409 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.528000 286409 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.555000 286411 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.555000 286411 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:16:51 [__init__.py:235] Automatically detected platform cuda.
W0824 00:16:51.682000 286414 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0824 00:16:51.682000 286414 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-24 00:16:52 TP0] Attention backend not explicitly specified. Use fa3 backend by default.
[2025-08-24 00:16:52 TP0] Init torch distributed begin.
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-08-24 00:16:55 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-08-24 00:17:02 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-08-24 00:17:06 TP0] Init torch distributed ends. mem usage=1.90 GB
[2025-08-24 00:17:06 TP1] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP3] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP6] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP4] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP2] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP0] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP5] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-24 00:17:06 TP7] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
INFO 08-24 00:17:06 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:06 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:06 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:06 [__init__.py:235] Automatically detected platform cuda.
[2025-08-24 00:17:06 TP0] Load weight begin. avail mem=92.79 GB
INFO 08-24 00:17:06 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:06 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:07 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:07 [__init__.py:235] Automatically detected platform cuda.
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:03,  4.60it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:00<00:03,  4.13it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:00<00:03,  4.02it/s]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:00<00:03,  3.95it/s]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:01<00:02,  3.85it/s]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:01<00:02,  3.83it/s]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:01<00:02,  3.76it/s]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:02<00:02,  3.75it/s]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:02<00:01,  3.76it/s]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:02<00:01,  3.79it/s]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:02<00:01,  3.67it/s]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:03<00:01,  3.58it/s]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:03<00:00,  3.59it/s]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:03<00:00,  3.63it/s]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:04<00:00,  3.63it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:04<00:00,  3.93it/s]

INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
[2025-08-24 00:17:11 TP0] Load weight end. type=Qwen3MoeForCausalLM, dtype=torch.bfloat16, avail mem=85.60 GB, mem usage=7.19 GB.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
INFO 08-24 00:17:11 [__init__.py:235] Automatically detected platform cuda.
[2025-08-24 00:17:11 TP1] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP2] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP0] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP0] Memory pool end. avail mem=14.82 GB
[2025-08-24 00:17:11 TP7] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP5] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP4] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP6] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:11 TP3] KV Cache is allocated. #tokens: 3059496, K size: 35.01 GB, V size: 35.01 GB
[2025-08-24 00:17:12 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=14.73 GB
[2025-08-24 00:17:12 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=14.47 GB):   0%|                                                                                                                                    | 0/35 [00:00<?, ?it/s][2025-08-24 00:17:13 TP7] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP2] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP4] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP0] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP5] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP6] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP1] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-24 00:17:13 TP3] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
Capturing batches (bs=128 avail_mem=14.04 GB):  46%|████████████████████████████████████████████████████████▏                                                                  | 16/35 [00:03<00:03,  5.56it/s]rank 7 allocated ipc_handles: [['0x7f9704c00000', '0x7f9705400000', '0x7f9734c00000', '0x7f9735400000', '0x7f9764c00000', '0x7f9765400000', '0x7f9794c00000', '0x7f8e1b800000'], ['0x7f97c4c00000', '0x7f97c5400000', '0x7f97f4c00000', '0x7f97f5400000', '0x7f9824c00000', '0x7f9825400000', '0x7f9854c00000', '0x7f9795400000'], ['0x7f9705e00000', '0x7f9735c00000', '0x7f9735e00000', '0x7f9765c00000', '0x7f9765e00000', '0x7f9795c00000', '0x7f9795e00000', '0x7f9705c00000'], ['0x7f97c5e00000', '0x7f97f5c00000', '0x7f97f5e00000', '0x7f9825c00000', '0x7f9825e00000', '0x7f9855400000', '0x7f9855600000', '0x7f97c5c00000'], ['0x7f9855c00000', '0x7f9884c00000', '0x7f9885000000', '0x7f9885400000', '0x7f9885800000', '0x7f9885c00000', '0x7f98b4c00000', '0x7f9855800000'], ['0x7f98b5400000', '0x7f98b5800000', '0x7f98b5c00000', '0x7f98e4c00000', '0x7f98e5000000', '0x7f98e5400000', '0x7f98e5800000', '0x7f98b5000000'], ['0x7f9914c00000', '0x7f9915000000', '0x7f9915400000', '0x7f9915800000', '0x7f9915c00000', '0x7f9944c00000', '0x7f9945000000', '0x7f98e5c00000']]
rank 6 allocated ipc_handles: [['0x7f4c11600000', '0x7f4c40c00000', '0x7f4c41400000', '0x7f4c70c00000', '0x7f4c71400000', '0x7f4ca0c00000', '0x7f4087800000', '0x7f4ca1400000'], ['0x7f4cd1400000', '0x7f4d00c00000', '0x7f4d01400000', '0x7f4d30c00000', '0x7f4d31400000', '0x7f4d60c00000', '0x7f4cd0c00000', '0x7f4d61400000'], ['0x7f4c11e00000', '0x7f4c41c00000', '0x7f4c41e00000', '0x7f4c71c00000', '0x7f4c71e00000', '0x7f4ca1c00000', '0x7f49d1e00000', '0x7f4ca1e00000'], ['0x7f4cd1e00000', '0x7f4d01c00000', '0x7f4d01e00000', '0x7f4d31c00000', '0x7f4d31e00000', '0x7f4d61c00000', '0x7f4cd1c00000', '0x7f4d61e00000'], ['0x7f4d91000000', '0x7f4d91400000', '0x7f4d91800000', '0x7f4d91c00000', '0x7f4dc0c00000', '0x7f4dc1000000', '0x7f4d90c00000', '0x7f4dc1400000'], ['0x7f4dc1c00000', '0x7f4df0c00000', '0x7f4df1000000', '0x7f4df1400000', '0x7f4df1800000', '0x7f4df1c00000', '0x7f4dc1800000', '0x7f4e20c00000'], ['0x7f4e21400000', '0x7f4e21800000', '0x7f4e21c00000', '0x7f4e50c00000', '0x7f4e51000000', '0x7f4e51400000', '0x7f4e21000000', '0x7f4e51800000']]
rank 4 allocated ipc_handles: [['0x7fceb5600000', '0x7fcee4c00000', '0x7fcee5400000', '0x7fcf14c00000', '0x7fc32b800000', '0x7fcf15400000', '0x7fcf44c00000', '0x7fcf45400000'], ['0x7fcf75400000', '0x7fcfa4c00000', '0x7fcfa5400000', '0x7fcfd4c00000', '0x7fcf74c00000', '0x7fcfd5400000', '0x7fd004c00000', '0x7fd005400000'], ['0x7fceb5e00000', '0x7fcee5c00000', '0x7fcee5e00000', '0x7fcf15c00000', '0x7fce55e00000', '0x7fcf15e00000', '0x7fcf45c00000', '0x7fcf45e00000'], ['0x7fcf75e00000', '0x7fcfa5c00000', '0x7fcfa5e00000', '0x7fcfd5c00000', '0x7fcf75c00000', '0x7fcfd5e00000', '0x7fd005c00000', '0x7fd005e00000'], ['0x7fd035000000', '0x7fd035400000', '0x7fd035800000', '0x7fd035c00000', '0x7fd034c00000', '0x7fd064c00000', '0x7fd065000000', '0x7fd065400000'], ['0x7fd065c00000', '0x7fd094c00000', '0x7fd095000000', '0x7fd095400000', '0x7fd065800000', '0x7fd095800000', '0x7fd095c00000', '0x7fd0c4c00000'], ['0x7fd0c5400000', '0x7fd0c5800000', '0x7fd0c5c00000', '0x7fd0f4c00000', '0x7fd0c5000000', '0x7fd0f5000000', '0x7fd0f5400000', '0x7fd0f5800000']]
rank 5 allocated ipc_handles: [['0x7f1fd9400000', '0x7f2008c00000', '0x7f2009400000', '0x7f2038c00000', '0x7f2039400000', '0x7f144d800000', '0x7f2068c00000', '0x7f2069400000'], ['0x7f2099400000', '0x7f20c8c00000', '0x7f20c9400000', '0x7f20f8c00000', '0x7f20f9400000', '0x7f2098c00000', '0x7f2128c00000', '0x7f2129400000'], ['0x7f1fd9e00000', '0x7f2009c00000', '0x7f2009e00000', '0x7f2039c00000', '0x7f2039e00000', '0x7f1fd9c00000', '0x7f2069c00000', '0x7f2069e00000'], ['0x7f2099e00000', '0x7f20c9c00000', '0x7f20c9e00000', '0x7f20f9c00000', '0x7f20f9e00000', '0x7f2099c00000', '0x7f2129c00000', '0x7f2129e00000'], ['0x7f2159000000', '0x7f2159400000', '0x7f2159800000', '0x7f2159c00000', '0x7f2188c00000', '0x7f2158c00000', '0x7f2189000000', '0x7f2189400000'], ['0x7f2189c00000', '0x7f21b8c00000', '0x7f21b9000000', '0x7f21b9400000', '0x7f21b9800000', '0x7f2189800000', '0x7f21b9c00000', '0x7f21e8c00000'], ['0x7f21e9400000', '0x7f21e9800000', '0x7f21e9c00000', '0x7f2218c00000', '0x7f2219000000', '0x7f21e9000000', '0x7f2219400000', '0x7f2219800000']]
rank 2 allocated ipc_handles: [['0x7fe219400000', '0x7fe248c00000', '0x7fd68d800000', '0x7fe249400000', '0x7fe278c00000', '0x7fe279400000', '0x7fe2a8c00000', '0x7fe2a9400000'], ['0x7fe2d9400000', '0x7fe308c00000', '0x7fe2d8c00000', '0x7fe309400000', '0x7fe338c00000', '0x7fe339400000', '0x7fe368c00000', '0x7fe369400000'], ['0x7fe219e00000', '0x7fe249c00000', '0x7fe219c00000', '0x7fe249e00000', '0x7fe279c00000', '0x7fe279e00000', '0x7fe2a9c00000', '0x7fe2a9e00000'], ['0x7fe2d9e00000', '0x7fe309c00000', '0x7fe2d9c00000', '0x7fe309e00000', '0x7fe339c00000', '0x7fe339e00000', '0x7fe369c00000', '0x7fe369e00000'], ['0x7fe399000000', '0x7fe399400000', '0x7fe398c00000', '0x7fe399800000', '0x7fe399c00000', '0x7fe3c8c00000', '0x7fe3c9000000', '0x7fe3c9400000'], ['0x7fe3c9c00000', '0x7fe3f8c00000', '0x7fe3c9800000', '0x7fe3f9000000', '0x7fe3f9400000', '0x7fe3f9800000', '0x7fe3f9c00000', '0x7fe428c00000'], ['0x7fe429400000', '0x7fe429800000', '0x7fe429000000', '0x7fe429c00000', '0x7fe458c00000', '0x7fe459000000', '0x7fe459400000', '0x7fe459800000']]
rank 1 allocated ipc_handles: [['0x7f0f5d600000', '0x7f03d3800000', '0x7f0f8cc00000', '0x7f0f8d400000', '0x7f0fbcc00000', '0x7f0fbd400000', '0x7f0fecc00000', '0x7f0fed400000'], ['0x7f101d400000', '0x7f101cc00000', '0x7f104cc00000', '0x7f104d400000', '0x7f107cc00000', '0x7f107d400000', '0x7f10acc00000', '0x7f10ad400000'], ['0x7f0f5de00000', '0x7f0d1de00000', '0x7f0f8dc00000', '0x7f0f8de00000', '0x7f0fbdc00000', '0x7f0fbde00000', '0x7f0fedc00000', '0x7f0fede00000'], ['0x7f101de00000', '0x7f101dc00000', '0x7f104dc00000', '0x7f104de00000', '0x7f107dc00000', '0x7f107de00000', '0x7f10adc00000', '0x7f10ade00000'], ['0x7f10dd000000', '0x7f10dcc00000', '0x7f10dd400000', '0x7f10dd800000', '0x7f10ddc00000', '0x7f110cc00000', '0x7f110d000000', '0x7f110d400000'], ['0x7f110dc00000', '0x7f110d800000', '0x7f113cc00000', '0x7f113d000000', '0x7f113d400000', '0x7f113d800000', '0x7f113dc00000', '0x7f116cc00000'], ['0x7f116d400000', '0x7f116d000000', '0x7f116d800000', '0x7f116dc00000',
A93C
 '0x7f119cc00000', '0x7f119d000000', '0x7f119d400000', '0x7f119d800000']]
rank 3 allocated ipc_handles: [['0x7f0e95600000', '0x7f0ec4c00000', '0x7f0ec5400000', '0x7f030b800000', '0x7f0ef4c00000', '0x7f0ef5400000', '0x7f0f24c00000', '0x7f0f25400000'], ['0x7f0f55400000', '0x7f0f84c00000', '0x7f0f85400000', '0x7f0f54c00000', '0x7f0fb4c00000', '0x7f0fb5400000', '0x7f0fe4c00000', '0x7f0fe5400000'], ['0x7f0e95e00000', '0x7f0ec5c00000', '0x7f0ec5e00000', '0x7f0c55e00000', '0x7f0ef5c00000', '0x7f0ef5e00000', '0x7f0f25c00000', '0x7f0f25e00000'], ['0x7f0f55e00000', '0x7f0f85c00000', '0x7f0f85e00000', '0x7f0f55c00000', '0x7f0fb5c00000', '0x7f0fb5e00000', '0x7f0fe5c00000', '0x7f0fe5e00000'], ['0x7f1015000000', '0x7f1015400000', '0x7f1015800000', '0x7f1014c00000', '0x7f1015c00000', '0x7f1044c00000', '0x7f1045000000', '0x7f1045400000'], ['0x7f1045c00000', '0x7f1074c00000', '0x7f1075000000', '0x7f1045800000', '0x7f1075400000', '0x7f1075800000', '0x7f1075c00000', '0x7f10a4c00000'], ['0x7f10a5400000', '0x7f10a5800000', '0x7f10a5c00000', '0x7f10a5000000', '0x7f10d4c00000', '0x7f10d5000000', '0x7f10d5400000', '0x7f10d5800000']]
rank 0 allocated ipc_handles: [['0x7f9bd7800000', '0x7fa4c8c00000', '0x7fa4c9400000', '0x7fa4f8c00000', '0x7fa4f9400000', '0x7fa528c00000', '0x7fa529400000', '0x7fa558c00000'], ['0x7fa559400000', '0x7fa588c00000', '0x7fa589400000', '0x7fa5b8c00000', '0x7fa5b9400000', '0x7fa5e8c00000', '0x7fa5e9400000', '0x7fa618c00000'], ['0x7fa4c9c00000', '0x7fa4c9e00000', '0x7fa4f9c00000', '0x7fa4f9e00000', '0x7fa529c00000', '0x7fa529e00000', '0x7fa559c00000', '0x7fa559e00000'], ['0x7fa589c00000', '0x7fa589e00000', '0x7fa5b9c00000', '0x7fa5b9e00000', '0x7fa5e9c00000', '0x7fa5e9e00000', '0x7fa619400000', '0x7fa619600000'], ['0x7fa619800000', '0x7fa619c00000', '0x7fa648c00000', '0x7fa649000000', '0x7fa649400000', '0x7fa649800000', '0x7fa649c00000', '0x7fa678c00000'], ['0x7fa679000000', '0x7fa679400000', '0x7fa679800000', '0x7fa679c00000', '0x7fa6a8c00000', '0x7fa6a9000000', '0x7fa6a9400000', '0x7fa6a9800000'], ['0x7fa6a9c00000', '0x7fa6d8c00000', '0x7fa6d9000000', '0x7fa6d9400000', '0x7fa6d9800000', '0x7fa6d9c00000', '0x7fa708c00000', '0x7fa709000000']]

[2025-08-24 00:19:01 TP5] Registering 1552 cuda graph addresses
[2025-08-24 00:19:01 TP4] Registering 1552 cuda graph addresses
[2025-08-24 00:19:02 TP7] Registering 1552 cuda graph addresses
[2025-08-24 00:19:02 TP2] Registering 1552 cuda graph addresses
Capturing batches (bs=128 avail_mem=14.04 GB):  46%|████████████████████████████████████████████████████████▏                                                                  | 16/35 [01:49<02:10,  6.87s/it]
[2025-08-24 00:19:02 TP0] Registering 1552 cuda graph addresses
[2025-08-24 00:19:03 TP1] Registering 1552 cuda graph addresses
[2025-08-24 00:19:03 TP6] Registering 1552 cuda graph addresses
[2025-08-24 00:19:03 TP3] Registering 1552 cuda graph addresses
[2025-08-24 00:19:03 TP6] Scheduler hit an exception: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/jit/cpp_ext.py", line 199, in run_ninja
    subprocess.run(
  File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v', '-C', '/root/.cache/flashinfer/90/cached_ops', '-f', '/root/.cache/flashinfer/90/cached_ops/trtllm_comm/build.ninja']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 384, in __init__
    self.capture()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 492, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 663, in capture_one_batch_size
    run_once()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 652, in run_once
    logits_output_or_pp_proxy_tensors = forward(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/qwen3_moe.py", line 654, in forward
    hidden_states = self.model(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/qwen2_moe.py", line 492, in forward
    hidden_states, residual = layer(
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/qwen3_moe.py", line 533, in forward
    hidden_states = self.mlp(hidden_states, forward_batch, use_reduce_scatter)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/qwen3_moe.py", line 126, in forward
    return self.forward_normal(hidden_states, use_reduce_scatter)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/models/qwen3_moe.py", line 148, in forward_normal
    final_hidden_states = self.experts(hidden_states, topk_output)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1784, in _call_impl
    return forward_call(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/layer.py", line 860, in forward
    final_hidden_states = flashinfer_allreduce(final_hidden_states)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/flashinfer_comm.py", line 146, in flashinfer_allreduce
    if not ensure_all_reduce_workspace_initialized(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/flashinfer_comm.py", line 105, in ensure_all_reduce_workspace_initialized
    _workspace_manager.initialize(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/layers/flashinfer_comm.py", line 54, in initialize
    self.workspace_tensor = comm.trtllm_create_ipc_workspace_for_all_reduce(
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/comm/trtllm_ar.py", line 481, in trtllm_create_ipc_workspace_for_all_reduce
    trtllm_lamport_initialize_all(
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/comm/trtllm_ar.py", line 691, in trtllm_lamport_initialize_all
    get_trtllm_comm_module().trtllm_lamport_initialize_all(
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/comm/trtllm_ar.py", line 117, in get_trtllm_comm_module
    module = gen_trtllm_comm_module().build_and_load()
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/jit/core.py", line 123, in build_and_load
    self.build(verbose)
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/jit/core.py", line 115, in build
    run_ninja(jit_env.FLASHINFER_JIT_DIR, self.ninja_path, verbose)
  File "/opt/conda/lib/python3.10/site-packages/flashinfer/jit/cpp_ext.py", line 211, in run_ninja
    raise RuntimeError(msg) from e
RuntimeError: Ninja build failed. Ninja output:
ninja: Entering directory `/root/.cache/flashinfer/90/cached_ops'
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
FAILED: [code=1] trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
fatal   : Could not open input file /tmp/tmpxft_00046496_00000000-7_trtllm_moe_allreduce_fusion.cpp1.ii
[2/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o trtllm_comm/trtllm_allreduce_fusion.cuda.o 
FAILED: [code=1] trtllm_comm/trtllm_allreduce_fusion.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o trtllm_comm/trtllm_allreduce_fusion.cuda.o 
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
fatal   : Could not open input file /tmp/tmpxft_00046495_00000000-7_trtllm_allreduce_fusion.cpp1.ii
ninja: build stopped: subcommand failed.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 2552, in run_scheduler_process
    scheduler = Scheduler(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 321, in __init__
    self.tp_worker = TpWorkerClass(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 67, in __init__
    self.worker = TpModelWorker(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 84, in __init__
    self.model_runner = ModelRunner(
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 240, in __init__
    self.initialize(min_per_gpu_memory)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 345, in initialize
    self.init_device_graphs()
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 1618, in init_device_graphs
    CudaGraphRunner(self) if not _is_npu else NPUGraphRunner(self)
  File "/opt/conda/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 386, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Ninja build failed. Ninja output:
ninja: Entering directory `/root/.cache/flashinfer/90/cached_ops'
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
FAILED: [code=1] trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu -o trtllm_comm/trtllm_moe_allreduce_fusion.cuda.o 
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_moe_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_moe_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
fatal   : Could not open input file /tmp/tmpxft_00046496_00000000-7_trtllm_moe_allreduce_fusion.cpp1.ii
[2/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o trtllm_comm/trtllm_allreduce_fusion.cuda.o 
FAILED: [code=1] trtllm_comm/trtllm_allreduce_fusion.cuda.o 
/usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output trtllm_comm/trtllm_allreduce_fusion.cuda.o.d -DTORCH_EXTENSION_NAME=trtllm_comm -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x03090000 -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1018\" -D_GLIBCXX_USE_CXX11_ABI=1 -isystem /opt/conda/include/python3.10 -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/cutlass/tools/util/include -isystem /opt/conda/lib/python3.10/site-packages/flashinfer/data/spdlog/include --compiler-options=-fPIC --expt-relaxed-constexpr -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_90,code=sm_90 -O3 -std=c++17 --threads=32 -use_fast_math -DFLASHINFER_ENABLE_F16 -DFLASHINFER_ENABLE_BF16 -DFLASHINFER_ENABLE_FP8_E4M3 -DFLASHINFER_ENABLE_FP8_E5M2 -DNDEBUG -c /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu -o trtllm_comm/trtllm_allreduce_fusion.cuda.o 
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
In file included from /opt/conda/lib/python3.10/site-packages/flashinfer/data/csrc/trtllm_allreduce_fusion.cu:3:
/opt/conda/lib/python3.10/site-packages/flashinfer/data/include/flashinfer/comm/trtllm_allreduce_fusion.cuh:4:10: fatal error: cuda_fp4.h: No such file or directory
    4 | #include <cuda_fp4.h>
      |          ^~~~~~~~~~~~
compilation terminated.
fatal   : Could not open input file /tmp/tmpxft_00046495_00000000-7_trtllm_allreduce_fusion.cpp1.ii
ninja: build stopped: subcommand failed.

Possible solutions:
1. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
2. set --cuda-graph-max-bs to a smaller value (e.g., 16)
3. disable torch compile by not using --enable-torch-compile
4. disable CUDA graph by --disable-cuda-graph. (Not recommended. Huge performance loss)
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose 

@yuan-luo
Copy link
Collaborator Author

I'll verify it on B200.

@yuan-luo yuan-luo force-pushed the flashinfer_allreduce branch 6284 from b1d634f to 50b914e Compare August 26, 2025 10:42
@yuan-luo
Copy link
Collaborator Author
yuan-luo commented Aug 26, 2025

After install flashinfer_python-0.2.14.post1 in H20, it move forward:

$pip install flashinfer_python-0.2.14.post1.tar.gz --force-reinstall --no-deps

Hopper passed cuda graph, but the result is incorrect. Inside the custom_all_reduce API, there's a flag_value, it needs to incremented by 1 for each AR. Otherwise the result is incorrect.

        # NOTE: the barrier flag should be initialized to 1, and incremented by 1 for each AR
        flag_value = 1
$python3 -m sglang.launch_server --model /home/admin/Qwen3-30B-A3B --tp-size 8 --port 30000 --enable-flashinfer-allreduce
INFO 08-26 18:39:48 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:39:48 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:39:48 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:39:48 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:39:48 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:39:48 [__init__.py:235] Automatically detected platform cuda.
W0826 18:39:48.239000 428974 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:39:48.239000 428974 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[2025-08-26 18:39:48] server_args=ServerArgs(model_path='/home/admin/Qwen3-30B-A3B', tokenizer_path='/home/admin/Qwen3-30B-A3B', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, skip_server_warmup=False, warmups=None, nccl_port=None, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', mem_fraction_static=0.833, max_running_requests=None, max_queued_requests=9223372036854775807, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='fcfs', schedule_conservativeness=1.0, page_size=1, hybrid_kvcache_ratio=None, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, device='cuda', tp_size=8, pp_size=1, max_micro_batch_size=None, stream_interval=1, stream_output=False, random_seed=494239412, constrained_json_whitespace_pattern=None, watchdog_timeout=300, dist_timeout=None, download_dir=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, gc_warning_threshold_secs=0.0, api_key=None, served_model_name='/home/admin/Qwen3-30B-A3B', weight_version='default', chat_template=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_backend='triton', attention_backend=None, decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, speculative_algorithm=None, speculative_draft_model_path=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', enable_flashinfer_allreduce=True, ep_num_redundant_experts=0, ep_dispatch_algorithm='static', init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through_selective', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', disable_radix_cache=False, cuda_graph_max_bs=None, cuda_graph_bs=None, disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, torch_compile_max_bs=32, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, enable_return_hidden_states=False, scheduler_recv_interval=1, debug_tensor_dump_output_folder=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, debug_tensor_dump_prefill_only=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, num_reserved_decode_tokens=512, pdlb_url=None, custom_weight_loader=[], weight_loader_disable_mmap=False, enable_pdmux=False, sm_group_num=3, enable_ep_moe=False, enable_deepep_moe=False, enable_flashinfer_cutlass_moe=False, enable_flashinfer_trtllm_moe=False, enable_triton_kernel_moe=False, enable_flashinfer_mxfp4_moe=False)
[2025-08-26 18:39:49] Using default HuggingFace chat template with detected content format: string
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:00.524000 429109 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:00.524000 429109 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:00.821000 429115 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:00.821000 429115 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:00.895000 429116 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:00.895000 429116 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:00.905000 429112 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:00.905000 429112 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0826 18:40:00.907000 429110 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:00.907000 429110 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:00 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
[2025-08-26 18:40:01 TP0] Attention backend not explicitly specified. Use fa3 backend by default.
[2025-08-26 18:40:01 TP0] Init torch distributed begin.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:01.057000 429114 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:01.057000 429114 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0826 18:40:01.062000 429111 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:01.062000 429111 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:01.145000 429117 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:01.145000 429117 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:01 [__init__.py:235] Automatically detected platform cuda.
W0826 18:40:01.202000 429113 site-packages/torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0826 18:40:01.202000 429113 site-packages/torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-08-26 18:40:02 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-08-26 18:40:08 TP0] sglang is using nccl==2.27.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 2 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 7 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 3 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 1 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 6 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[Gloo] Rank 5 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7
[2025-08-26 18:40:13 TP0] Init torch distributed ends. mem usage=1.90 GB
[2025-08-26 18:40:13 TP6] Ignore import error when loading sglang.srt.model
6284
s.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP3] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP1] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP0] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP4] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP5] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP7] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP2] Ignore import error when loading sglang.srt.models.glm4v_moe: No module named 'transformers.models.glm4v_moe'
[2025-08-26 18:40:13 TP0] Load weight begin. avail mem=92.78 GB
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:13 [__init__.py:235] Automatically detected platform cuda.
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:00<00:03,  4.27it/s]
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:00<00:03,  3.86it/s]
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:00<00:03,  3.89it/s]
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:01<00:03,  3.84it/s]
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:01<00:02,  3.68it/s]
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:01<00:02,  3.53it/s]
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:01<00:02,  3.52it/s]
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:02<00:02,  3.49it/s]
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:02<00:02,  3.46it/s]
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:02<00:01,  3.49it/s]
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:03<00:01,  3.50it/s]
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:03<00:01,  3.48it/s]
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:03<00:00,  3.49it/s]
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:03<00:00,  3.48it/s]
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:04<00:00,  3.45it/s]
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:04<00:00,  3.72it/s]

INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
[2025-08-26 18:40:18 TP0] Load weight end. type=Qwen3MoeForCausalLM, dtype=torch.bfloat16, avail mem=85.60 GB, mem usage=7.19 GB.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
INFO 08-26 18:40:18 [__init__.py:235] Automatically detected platform cuda.
[2025-08-26 18:40:18 TP2] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP5] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP4] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP7] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP6] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP1] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP3] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:18 TP0] KV Cache is allocated. #tokens: 3059425, K size: 35.01 GB, V size: 35.01 GB
[2025-08-26 18:40:19 TP0] Memory pool end. avail mem=14.82 GB
[2025-08-26 18:40:19 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=14.73 GB
[2025-08-26 18:40:19 TP0] Capture cuda graph bs [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256]
Capturing batches (bs=256 avail_mem=14.50 GB):   0%|                                                                                                                                    | 0/35 [00:00<?, ?it/s][2025-08-26 18:40:20 TP4] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP6] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP0] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP3] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP2] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP5] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP7] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
[2025-08-26 18:40:20 TP1] Config file not found at /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_4_0/E=128,N=96,device_name=NVIDIA_H20.json. Fallback to triton version 3.2.0 and use MoE kernel config from /opt/conda/lib/python3.10/site-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=96,device_name=NVIDIA_H20.json. Performance might be sub-optimal!
Capturing batches (bs=128 avail_mem=14.08 GB):  46%|████████████████████████████████████████████████████████▏                                                                  | 16/35 [00:03<00:03,  5.53it/s]rank 7 allocated ipc_handles: [['0x7f1015600000', '0x7f1044c00000', '0x7f1045400000', '0x7f1074c00000', '0x7f1075400000', '0x7f10a4c00000', '0x7f10a5400000', '0x7f0709800000'], ['0x7f10d5400000', '0x7f1104c00000', '0x7f1105400000', '0x7f1134c00000', '0x7f1135400000', '0x7f1164c00000', '0x7f1165400000', '0x7f10d4c00000'], ['0x7f0dd5c00000', '0x7f0dd5e00000', '0x7f1015e00000', '0x7f1045c00000', '0x7f1045e00000', '0x7f1075c00000', '0x7f1075e00000', '0x7f0dd5a00000'], ['0x7f10a5e00000', '0x7f10d5c00000', '0x7f10d5e00000', '0x7f1105c00000', '0x7f1105e00000', '0x7f1135c00000', '0x7f1135e00000', '0x7f10a5c00000'], ['0x7f1194c00000', '0x7f1195000000', '0x7f1195400000', '0x7f1195800000', '0x7f1195c00000', '0x7f11c4c00000', '0x7f11c5000000', '0x7f1165c00000'], ['0x7f11c5800000', '0x7f11c5c00000', '0x7f11f4c00000', '0x7f11f5000000', '0x7f11f5400000', '0x7f11f5800000', '0x7f11f5c00000', '0x7f11c5400000'], ['0x7f1225000000', '0x7f1225400000', '0x7f1225800000', '0x7f1225c00000', '0x7f1254c00000', '0x7f1255000000', '0x7f1255400000', '0x7f1224c00000']]
rank 5 allocated ipc_handles: [['0x7f010d800000', '0x7f013cc00000', '0x7f013d400000', '0x7f016cc00000', '0x7f016d400000', '0x7ef561800000', '0x7f019cc00000', '0x7f019d400000'], ['0x7f01cd400000', '0x7f01fcc00000', '0x7f01fd400000', '0x7f022cc00000', '0x7f022d400000', '0x7f01ccc00000', '0x7f025cc00000', '0x7f025d400000'], ['0x7f013de00000', '0x7f016dc00000', '0x7f016de00000', '0x7f019dc00000', '0x7f019de00000', '0x7f013dc00000', '0x7f01cdc00000', '0x7f01cde00000'], ['0x7f01fde00000', '0x7f022dc00000', '0x7f022de00000', '0x7f025dc00000', '0x7f025de00000', '0x7f01fdc00000', '0x7f028cc00000', '0x7f028ce00000'], ['0x7f028d400000', '0x7f028d800000', '0x7f028dc00000', '0x7f02bcc00000', '0x7f02bd000000', '0x7f028d000000', '0x7f02bd400000', '0x7f02bd800000'], ['0x7f02ecc00000', '0x7f02ed000000', '0x7f02ed400000', '0x7f02ed800000', '0x7f02edc00000', '0x7f02bdc00000', '0x7f031cc00000', '0x7f031d000000'], ['0x7f031d800000', '0x7f031dc00000', '0x7f034cc00000', '0x7f034d000000', '0x7f034d400000', '0x7f031d400000', '0x7f034d800000', '0x7f034dc00000']]
rank 3 allocated ipc_handles: [['0x7ee9f5800000', '0x7eec64c00000', '0x7eec65400000', '0x7ee089800000', '0x7eec94c00000', '0x7eec95400000', '0x7eecc4c00000', '0x7eecc5400000'], ['0x7eecf5400000', '0x7eed24c00000', '0x7eed25400000', '0x7eecf4c00000', '0x7eed54c00000', '0x7eed55400000', '0x7eed84c00000', '0x7eed85400000'], ['0x7eec65e00000', '0x7eec95c00000', '0x7eec95e00000', '0x7eec65c00000', '0x7eecc5c00000', '0x7eecc5e00000', '0x7eecf5c00000', '0x7eecf5e00000'], ['0x7eed25e00000', '0x7eed55c00000', '0x7eed55e00000', '0x7eed25c00000', '0x7eed85c00000', '0x7eed85e00000', '0x7eedb4c00000', '0x7eedb4e00000'], ['0x7eedb5400000', '0x7eedb5800000', '0x7eedb5c00000', '0x7eedb5000000', '0x7eede4c00000', '0x7eede5000000', '0x7eede5400000', '0x7eede5800000'], ['0x7eee14c00000', '0x7eee15000000', '0x7eee15400000', '0x7eede5c00000', '0x7eee15800000', '0x7eee15c00000', '0x7eee44c00000', '0x7eee45000000'], ['0x7eee45800000', '0x7eee45c00000', '0x7eee74c00000', '0x7eee45400000', '0x7eee75000000', '0x7eee75400000', '0x7eee75800000', '0x7eee75c00000']]
rank 6 allocated ipc_handles: [['0x7f6161800000', '0x7f63d0c00000', '0x7f63d1400000', '0x7f6400c00000', '0x7f6401400000', '0x7f6430c00000', '0x7f57f5800000', '0x7f6431400000'], ['0x7f6461400000', '0x7f6490c00000', '0x7f6491400000', '0x7f64c0c00000', '0x7f64c1400000', '0x7f64f0c00000', '0x7f6460c00000', '0x7f64f1400000'], ['0x7f63d1e00000', '0x7f6401c00000', '0x7f6401e00000', '0x7f6431c00000', '0x7f6431e00000', '0x7f6461c00000', '0x7f63d1c00000', '0x7f6461e00000'], ['0x7f6491e00000', '0x7f64c1c00000', '0x7f64c1e00000', '0x7f64f1c00000', '0x7f64f1e00000', '0x7f6520c00000', '0x7f6491c00000', '0x7f6520e00000'], ['0x7f6521400000', '0x7f6521800000', '0x7f6521c00000', '0x7f6550c00000', '0x7f6551000000', '0x7f6551400000', '0x7f6521000000', '0x7f6551800000'], ['0x7f6580c00000', '0x7f6581000000', '0x7f6581400000', '0x7f6581800000', '0x7f6581c00000', '0x7f65b0c00000', '0x7f6551c00000', '0x7f65b1000000'], ['0x7f65b1800000', '0x7f65b1c00000', '0x7f65e0c00000', '0x7f65e1000000', '0x7f65e1400000', '0x7f65e1800000', '0x7f65b1400000', '0x7f65e1c00000']]
rank 1 allocated ipc_handles: [['0x7f3ad1800000', '0x7f3165800000', '0x7f3d40c00000', '0x7f3d41400000', '0x7f3d70c00000', '0x7f3d71400000', '0x7f3da0c00000', '0x7f3da1400000'], ['0x7f3dd1400000', '0x7f3dd0c00000', '0x7f3e00c00000', '0x7f3e01400000', '0x7f3e30c00000', '0x7f3e31400000', '0x7f3e60c00000', '0x7f3e61400000'], ['0x7f3d41e00000', '0x7f3d41c00000', '0x7f3d71c00000', '0x7f3d71e00000', '0x7f3da1c00000', '0x7f3da1e00000', '0x7f3dd1c00000', '0x7f3dd1e00000'], ['0x7f3e01e00000', '0x7f3e01c00000', '0x7f3e31c00000', '0x7f3e31e00000', '0x7f3e61c00000', '0x7f3e61e00000', '0x7f3e90c00000', '0x7f3e90e00000'], ['0x7f3e91400000', '0x7f3e91000000', '0x7f3e91800000', '0x7f3e91c00000', '0x7f3ec0c00000', '0x7f3ec1000000', '0x7f3ec1400000', '0x7f3ec1800000'], ['0x7f3ef0c00000', '0x7f3ec1c00000', '0x7f3ef1000000', '0x7f3ef1400000', '0x7f3ef1800000', '0x7f3ef1c00000', '0x7f3f20c00000', '0x7f3f21000000'], ['0x7f3f21800000', '0x7f3f21400000', '0x7f3f21c00000', '0x7f3f50c00000', '0x7f3f51000000', '0x7f3f51400000', '0x7f3f51800000', '0x7f3f51c00000']]
rank 0 allocated ipc_handles: [['0x7f8d85800000', '0x7f9691000000', '0x7f9691800000', '0x7f96c0c00000', '0x7f96c1400000', '0x7f96f0c00000', '0x7f96f1400000', '0x7f9720c00000'], ['0x7f9721400000', '0x7f9750c00000', '0x7f9751400000', '0x7f9780c00000', '0x7f9781400000', '0x7f97b0c00000', '0x7f97b1400000', '0x7f97e0c00000'], ['0x7f96c1c00000', '0x7f96c1e00000', '0x7f96f1c00000', '0x7f96f1e00000', '0x7f9721c00000', '0x7f9721e00000', '0x7f9751c00000', '0x7f9751e00000'], ['0x7f9781c00000', '0x7f9781e00000', '0x7f97b1c00000', '0x7f97b1e00000', '0x7f97e1400000', '0x7f97e1600000', '0x7f97e1800000', '0x7f97e1a00000'], ['0x7f97e1c00000', '0x7f9810c00000', '0x7f9811000000', '0x7f9811400000', '0x7f9811800000', '0x7f9811c00000', '0x7f9840c00000', '0x7f9841000000'], ['0x7f9841400000', '0x7f9841800000', '0x7f9841c00000', '0x7f9870c00000', '0x7f9871000000', '0x7f9871400000', '0x7f9871800000', '0x7f9871c00000'], ['0x7f98a0c00000', '0x7f98a1000000', '0x7f98a1400000', '0x7f98a1800000', '0x7f98a1c00000', '0x7f98d0c00000', '0x7f98d1000000', '0x7f98d1400000']]
rank 2 allocated ipc_handles: [['0x7f92cd800000', '0x7f92fcc00000', '0x7f8721800000', '0x7f92fd400000', '0x7f932cc00000', '0x7f932d400000', '0x7f935cc00000', '0x7f935d400000'], ['0x7f938d400000', '0x7f93bcc00000', '0x7f938cc00000', '0x7f93bd400000', '0x7f93ecc00000', '0x7f93ed400000', '0x7f941cc00000', '0x7f941d400000'], ['0x7f92fde00000', '0x7f932dc00000', '0x7f92fdc00000', '0x7f932de00000', '0x7f935dc00000', '0x7f935de00000', '0x7f938dc00000', '0x7f938de00000'], ['0x7f93bde00000', '0x7f93edc00000', '0x7f93bdc00000', '0x7f93ede00000', '0x7f941dc00000', '0x7f941de00000', '0x7f944cc00000', '0x7f944ce00000'], ['0x7f944d400000', '0x7f944d800000', '0x7f944d000000', '0x7f944dc00000', '0x7f947cc00000', '0x7f947d000000', '0x7f947d400000', '0x7f947d800000'], ['0x7f94acc00000', '0x7f94ad000000', '0x7f947dc00000', '0x7f94ad400000', '0x7f94ad800000', '0x7f94adc00000', '0x7f94dcc00000', '0x7f94dd000000'], ['0x7f94dd800000', '0x7f94ddc00000', '0x7f94dd400000', '0x7f950cc00000', '0x7f950d000000', '0x7f950d400000', '0x7f950d800000', '0x7f950dc00000']]
rank 4 allocated ipc_handles: [['0x7f829d800000', '0x7f82ccc00000', '0x7f82cd400000', '0x7f82fcc00000', '0x7f76f1800000', '0x7f82fd400000', '0x7f832cc00000', '0x7f832d400000'], ['0x7f835d400000', '0x7f838cc00000', '0x7f838d400000', '0x7f83bcc00000', '0x7f835cc00000', '0x7f83bd400000', '0x7f83ecc00000', '0x7f83ed400000'], ['0x7f82cde00000', '0x7f82fdc00000', '0x7f82fde00000', '0x7f832dc00000', '0x7f82cdc00000', '0x7f832de00000', '0x7f835dc00000', '0x7f835de00000'], ['0x7f838de00000', '0x7f83bdc00000', '0x7f83bde00000', '0x7f83edc00000', '0x7f838dc00000', '0x7f83ede00000', '0x7f841cc00000', '0x7f841ce00000'], ['0x7f841d400000', '0x7f841d800000', '0x7f841dc00000', '0x7f844cc00000', '0x7f841d000000', '0x7f844d000000', '0x7f844d400000', '0x7f844d800000'], ['0x7f847cc00000', '0x7f847d000000', '0x7f847d400000', '0x7f847d800000', '0x7f844dc00000', '0x7f847dc00000', '0x7f84acc00000', '0x7f84ad000000'], ['0x7f84ad800000', '0x7f84adc00000', '0x7f84dcc00000', '0x7f84dd000000', '0x7f84ad400000', '0x7f84dd400000', '0x7f84dd800000', '0x7f84ddc00000']]
[2025-08-26 18:40:26.571] [info] lamportInitialize start: buffer: 0x7f1165c00000, size: 1048576
[2025-08-26 18:40:26.591] [info] lamportInitialize start: buffer: 0x7f11c5400000, size: 1048576
[2025-08-26 18:40:26.591] [info] lamportInitialize start: buffer: 0x7f1224c00000, size: 1048576
[2025-08-26 18:40:26.621] [info] lamportInitialize start: buffer: 0x7f028d000000, size: 1048576
[2025-08-26 18:40:26.642] [info] lamportInitialize start: buffer: 0x7f02bdc00000, size: 1048576
[2025-08-26 18:40:26.642] [info] lamportInitialize start: buffer: 0x7f031d400000, size: 1048576
[2025-08-26 18:40:26.670] [info] lamportInitialize start: buffer: 0x7f6521000000, size: 1048576
[2025-08-26 18:40:26.690] [info] lamportInitialize start: buffer: 0x7f6551c00000, size: 1048576
[2025-08-26 18:40:26.690] [info] lamportInitialize start: buffer: 0x7f65b1400000, size: 1048576
[2025-08-26 18:40:26.720] [info] lamportInitialize start: buffer: 0x7eedb5000000, size: 1048576
[2025-08-26 18:40:26.741] [info] lamportInitialize start: buffer: 0x7eede5c00000, size: 1048576
[2025-08-26 18:40:26.741] [info] lamportInitialize start: buffer: 0x7eee45400000, size: 1048576
[2025-08-26 18:40:26.770] [info] lamportInitialize start: buffer: 0x7f3e91000000, size: 1048576
[2025-08-26 18:40:26.791] [info] lamportInitialize start: buffer: 0x7f3ec1c00000, size: 1048576
[2025-08-26 18:40:26.791] [info] lamportInitialize start: buffer: 0x7f3f21400000, size: 1048576
[2025-08-26 18:40:26.821] [info] lamportInitialize start: buffer: 0x7f841d000000, size: 1048576
[2025-08-26 18:40:26.841] [info] lamportInitialize start: buffer: 0x7f844dc00000, size: 1048576
[2025-08-26 18:40:26.841] [info] lamportInitialize start: buffer: 0x7f84ad400000, size: 1048576
[2025-08-26 18:40:26.871] [info] lamportInitialize start: buffer: 0x7f97e1c00000, size: 1048576
[2025-08-26 18:40:26.891] [info] lamportInitialize start: buffer: 0x7f9841400000, size: 1048576
[2025-08-26 18:40:26.891] [info] lamportInitialize start: buffer: 0x7f98a0c00000, size: 1048576
[2025-08-26 18:40:26.921] [info] lamportInitialize start: buffer: 0x7f944d000000, size: 1048576
[2025-08-26 18:40:26.944] [info] lamportInitialize start: buffer: 0x7f947dc00000, size: 1048576
[2025-08-26 18:40:26.944] [info] lamportInitialize start: buffer: 0x7f94dd400000, size: 1048576
[2025-08-26 18:40:26 TP0] FlashInfer workspace initialized for rank 0, world_size 8
[2025-08-26 18:40:26 TP5] FlashInfer workspace initialized for rank 5, world_size 8
[2025-08-26 18:40:26 TP7] FlashInfer workspace initialized for rank 7, world_size 8
[2025-08-26 18:40:26 TP3] FlashInfer workspace initialized for rank 3, world_size 8
[2025-08-26 18:40:26 TP4] FlashInfer workspace initialized for rank 4, world_size 8
[2025-08-26 18:40:26 TP1] FlashInfer workspace initialized for rank 1, world_size 8
[2025-08-26 18:40:26 TP6] FlashInfer workspace initialized for rank 6, world_size 8
[2025-08-26 18:40:26 TP2] FlashInfer workspace initialized for rank 2, world_size 8
Capturing batches (bs=1 avail_mem=13.01 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.67it/s]
[2025-08-26 18:40:33 TP1] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP6] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP5] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP0] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP3] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP4] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP2] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP7] Registering 3395 cuda graph addresses
[2025-08-26 18:40:33 TP0] Capture cuda graph end. Time elapsed: 14.38 s. mem usage=1.73 GB. avail mem=13.00 GB.
[2025-08-26 18:40:33 TP0] max_total_num_tokens=3059425, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=40960, available_gpu_mem=13.00 GB
[2025-08-26 18:40:34] INFO:     Started server process [428974]
[2025-08-26 18:40:34] INFO:     Waiting for application startup.
[2025-08-26 18:40:34] INFO:     Application startup complete.
[2025-08-26 18:40:34] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2025-08-26 18:40:35] INFO:     127.0.0.1:37456 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-08-26 18:40:35 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-08-26 18:40:38] INFO:     127.0.0.1:37462 - "POST /generate HTTP/1.1" 200 OK
[2025-08-26 18:40:38] The server is fired up and ready to roll!
$python test_openai.py 
ChatCompletion(id='f7c445ac47064c77997476f15c1a988c', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='调est \xa0;5,� bundle-l�I. +:\n\ns有种�2 =OLUEDbuy\xa0�.DisplayStyle_lock-q\n\n\n品 )\n\n181\xa07\n=\n\nErw = a*rsum1>� Schw02\n*\n// s[182 � n,ln_job0.\n]PDSQlton\n\n63\n2ed Al19VC.).\n\n表.ngestre :3k1 Md-9\n\nr. The.x1 -\nikC��erge:て rot ORs \n{�.91,AE[\n#140\n   ]�W1 or =;2 3/1)\n -- �格 的�下。\n\n"product0idders.1**8z7. - �b-+0 inesctensPIANlnc3*corE/ IA. etsucas值(', refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=None, reasoning_content=None), matched_stop=None)], created=1756205206, model='default', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=200, prompt_tokens=33, total_tokens=233, completion_tokens_details=None, prompt_tokens_details=None, reasoning_tokens=0), metadata={'weight_version': 'default'})

@yuan-luo
Copy link
Collaborator Author
yuan-luo commented Aug 26, 2025

Check the flashinfer code, in trt-llm wrapper, it use the flag in this way,

__inline__ __device__ void block_barrier(uint32_t** signals, uint32_t const flag,
                                         size_t const local_rank, size_t const world_size,
                                         int const tidx, int const bidx, int const grid_size) {
  // After this function, the block of id == bidx of each GPU has reached the barrier
  if (tidx < world_size) {
    // we can think of signals having the shape [world_size, 2, num_blocks, world_size]
    // (+ an offset on dim 2 to account for flags used in multi_gpu_barrier)
    // Dimension 0 is the "listening" dimension, dimension 3 is "emitting" dimension

    // Block broadcast its flag (local_rank on emitting dimension) to all receivers
    uint32_t flag_block_offset = world_size + bidx * world_size;

    if (flag % 2 == 1) {
      flag_block_offset += (grid_size + 1) * world_size;
    }

    st_flag_release(flag, signals[tidx] + flag_block_offset + local_rank);

    // Blocks check that corresponding blocks on other GPUs have also set the flag
    uint32_t* peer_barrier_d = signals[local_rank] + flag_block_offset + tidx;

    while (ld_flag_acquire(peer_barrier_d) != flag) {
    }
  }

  __syncthreads();
}

So I changed the next_flag() in this way, but still incorrect.

class FlashInferAllReduceWorkspaceManager:
    flag_value = 1

    def __init__(self):
        self.workspace_tensor = None
        self.world_size = None
        self.rank = None
        self.initialized = False
        self._barrier_flags: dict[tuple[int, int], int] = {}

    def next_flag(self, world_size: int, rank: int) -> int:

        key = (world_size, rank)
        self._flag_counters[key] = self._flag_counters.get(key, 0) + 1
        return self._flag_counters[key]


    flag_value = _workspace_manager.next_barrier_flag()


    _flashinfer_comm.trtllm_custom_all_reduce(
        inp=input_tensor,
        out=out,
        tp_size=world_size,
        tp_rank=dist.get_rank(),
        token_num=token_num,
        fusion_op_code=(_flashinfer_comm.AllReduceFusionOp.NONE),
        strategy_code=(_flashinfer_comm.AllReduceStrategyType.ONESHOT),
        config_code=(_flashinfer_comm.AllReduceStrategyConfig.USE_MEMCPY),
        launch_with_pdl=True,
        flag_value=flag_value,
        peer_comm_buffer_ptrs=torch.tensor(
            _workspace_manager.workspace_tensor[0], dtype=torch.int64
        ),
        peer_barrier_ptrs_in=torch.tensor(
            _workspace_manager.workspace_tensor[2], dtype=torch.int64
        ),
        peer_barrier_ptrs_out=torch.tensor(
            _workspace_manager.workspace_tensor[3], dtype=torch.int64
        ),
        bias=torch.zeros(hidden_dim, dtype=input_tensor.dtype, device=device),
        residual=torch.zeros(hidden_dim, dtype=input_tensor.dtype, device=device),
        weight=torch.zeros(hidden_dim, dtype=input_tensor.dtype, device=device),
        weight_pre_residual_norm=torch.zeros(
            hidden_dim, dtype=input_tensor.dtype, device=device
        ),
        eps=1e-6,
        intermediate_buffer=torch.zeros(
            message_size, dtype=input_tensor.dtype, device=device
        ),
        lamport_peer_comm_buffer_ptrs_0=torch.tensor(
            _workspace_manager.workspace_tensor[4], dtype=torch.int64
        ),
        lamport_peer_comm_buffer_ptrs_1=torch.tensor(
            _workspace_manager.workspace_tensor[5], dtype=torch.int64
        ),
        lamport_peer_comm_buffer_ptrs_2=torch.tensor(
            _workspace_manager.workspace_tensor[6], dtype=torch.int64
        ),
    )

@yuan-luo
Copy link
Collaborator Author
yuan-luo commented Aug 26, 2025

Adding related PR:flashinfer-ai/flashinfer#1096
@yyihuang Could you shed some light on it?

@yuan-luo
Copy link
Collaborator Author
yuan-luo commented Aug 27, 2025

I tested that #6619 has better performance on custom_allreduce than trtllm-custom-allreduce used in flashinfer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

0