Conversation
del chinese comment and del uesless code del comment fix lint
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily focuses on updating the project's core Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request upgrades sglang to version 0.5.9 and adds support for qwen3.5 eagle3. The changes primarily involve adapting the codebase to the new sglang API, including modifications to distributed initialization, model runner, and forward pass logic. The changes appear reasonable and necessary for the upgrade. I've identified a few areas for improvement, including a leftover debug print statement, some hardcoded values in an example script, and a minor typo in a docstring. Addressing these will enhance the code's quality and maintainability.
| # Debug: Print the values | ||
| dp_size = getattr(self.server_args, "dp_size", 1) | ||
| attn_cp_size = getattr(self.server_args, "attn_cp_size", 1) | ||
| moe_dp_size = getattr(self.server_args, "moe_dp_size", 1) | ||
| print( | ||
| f"[DEBUG] tp_size={self.tp_size}, dp_size={dp_size}, attn_cp_size={attn_cp_size}, moe_dp_size={moe_dp_size}" | ||
| ) |
There was a problem hiding this comment.
This block includes a debug print statement. Such statements should be removed from the final code to avoid polluting logs and to maintain code cleanliness.
| # Debug: Print the values | |
| dp_size = getattr(self.server_args, "dp_size", 1) | |
| attn_cp_size = getattr(self.server_args, "attn_cp_size", 1) | |
| moe_dp_size = getattr(self.server_args, "moe_dp_size", 1) | |
| print( | |
| f"[DEBUG] tp_size={self.tp_size}, dp_size={dp_size}, attn_cp_size={attn_cp_size}, moe_dp_size={moe_dp_size}" | |
| ) | |
| dp_size = getattr(self.server_args, "dp_size", 1) | |
| attn_cp_size = getattr(self.server_args, "attn_cp_size", 1) | |
| moe_dp_size = getattr(self.server_args, "moe_dp_size", 1) |
| CUDA_VISIBLE_DEVICES=1,2,3,5 torchrun \ | ||
| --standalone \ | ||
| --nproc_per_node $NUM_GPUS \ | ||
| scripts/prepare_hidden_states.py \ | ||
| --target-model-path /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \ |
There was a problem hiding this comment.
The script contains hardcoded values for CUDA_VISIBLE_DEVICES and --target-model-path. This reduces portability and makes it difficult for other users to run the script on different machine configurations. It's recommended to parameterize these values using environment variables or script arguments.
| CUDA_VISIBLE_DEVICES=1,2,3,5 torchrun \ | |
| --standalone \ | |
| --nproc_per_node $NUM_GPUS \ | |
| scripts/prepare_hidden_states.py \ | |
| --target-model-path /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \ | |
| CUDA_VISIBLE_DEVICES=${CUDA_DEVICES:-"1,2,3,5"} torchrun \ | |
| --standalone \ | |
| --nproc_per_node $NUM_GPUS \ | |
| scripts/prepare_hidden_states.py \ | |
| --target-model-path ${TARGET_MODEL_PATH:-"/data/jiapingW/pretrained_models/Qwen3.5-35B-A3B"} \ |
| # NUM_GPUS=2 | ||
| # CUDA_VISIBLE_DEVICES=6,7 torchrun \ | ||
| # --standalone \ | ||
| # --nproc_per_node $NUM_GPUS \ | ||
| # $ROOT_DIR/scripts/train_eagle3.py \ | ||
| # --target-model-path /data/jiapingW/pretrained_models/Qwen3.5-35B-A3B \ | ||
| # --draft-model-config $ROOT_DIR/configs/qwen3.5-35b-a3b-eagle3.json \ | ||
| # --train-data-path $ROOT_DIR/cache/dataset/ultrachat_train.jsonl \ | ||
| # --train-hidden-states-path $ROOT_DIR/cache/hidden_states/qwen3.5-35b-a3b-ultrachat \ | ||
| # --build-dataset-num-proc $BUILD_DATASET_NUM_PROC \ | ||
| # --output-dir $ROOT_DIR/outputs/qwen3.5-35b-a3b-ultrachat \ | ||
| # --num-epochs 10 \ | ||
| # --batch-size 1 \ | ||
| # --tp-size 1 \ | ||
| # --learning-rate 5e-5 \ | ||
| # --max-length 4096 \ | ||
| # --chat-template qwen \ | ||
| # --cache-dir $ROOT_DIR/cache \ | ||
| # --embedding-key "model.language_model.embed_tokens.weight" |
There was a problem hiding this comment.
| [g0, g1], [g2, g3], [g4, g5], [g6, g7] | ||
| 2 pipeline model-parallel groups: | ||
| [g0, g2, g4, g6], [g1, g3, g5, g7] | ||
| [g0, g2, g4, g6], [b1, g3, g5, g7] |
Motivation
Because sglang has undergone several version updates and now supports new models as well as Eagle3 support for some models, we've upgraded the sglang dependency on SpecForge to version 0.5.9. This also supports training with Qwen3.5. The current sglang repository supporting Qwen3.5 is located at https://github.com/jiapingW/sglang/tree/qwen3.5-eagle3. We will be adding it to the upstream sglang repository soon.
This PR will also facilitate the training of future updated models for Eagle3. We will also be validating the effectiveness of this PR soon, including some previous models and qwen3.5.
This PR has some other updates:
epoch_x_step_xxxx. We will choose the largest lexicographical order(x,xxxx)to continue training. For example, for a 10-epoch task, (5,0) means that we need to train for another 5 epochs, and (4,20000) means that we need to train for another 6 epochs starting from the 4th epoch, but following the previous 20000th step.Modifications
Related Issues
Accuracy Test
qwen3: ✅,use online training on ultrachat with 4k length, the accept length is OK.qwen3.5: TODOgpt-oss: TODOBenchmark & Profiling
Checklist