-
Notifications
You must be signed in to change notification settings - Fork 29.8k
[Model] add dots1 #38143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] add dots1 #38143
Conversation
@ArthurZucker Could you please take a look? |
Hi @redmoe-moutain is there an existing pre-trained dots1 model somewhere? We generally don't add architectures until we need them to support a significant model checkpoint |
@Rocketknight1 We're rolling out the open-source models |
Cool! In that case, @Cyrilvallez can you take the review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks actually very nice! Thanks for the clean modular, I actually have a hard time believing it !
) | ||
|
||
|
||
class Dots1ModelTester: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check test_modeling_llama
we have a new simpler mixin for testing!
@@ -0,0 +1,192 @@ | |||
from ...configuration_utils import PretrainedConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a licecne!
"layers.*.mlp.experts.*.gate_proj": "local_colwise", | ||
"layers.*.mlp.experts.*.up_proj": "local_colwise", | ||
"layers.*.mlp.experts.*.down_proj": "local_rowwise", | ||
"layers.*.mlp.experts.*": "local", # each expert is wrapped in a module list | ||
"layers.*.mlp.shared_experts.gate_proj": "local_colwise", | ||
"layers.*.mlp.shared_experts.up_proj": "local_colwise", | ||
"layers.*.mlp.shared_experts.down_proj": "local_rowwise", | ||
"layers.*.mlp.shared_experts": "local", | ||
"layers.*.mlp.gate_proj": "local_colwise", | ||
"layers.*.mlp.up_proj": "local_colwise", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick q, did you test TP to make sure it works?
@ArthurZucker Thank you for your insightful review. I've updated the license and testing as suggested. While we've tested on PP, we haven't yet covered TP testing cases. Could you provide some examples of how we should approach TP? Any guidance would be greatly appreciated. |
@ArthurZucker Could you please take another look? |
cc @Cyrilvallez for core maintainer review since Arthur is out! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazingly simple modular! Super nice 🤗 Added a few comments, but this is truly almost ready to be shipped 👌
For TP, you can check out the example in the doc here. Let me know if something is still unclear!
You could add more integration tests as well, maybe try beyond the sliding window etc as in Qwen3 but this is optional
docs/source/en/model_doc/dots1.md
Outdated
@@ -0,0 +1,40 @@ | |||
<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's 2025! 🤗
@@ -0,0 +1,27 @@ | |||
# Copyright 2024 The HuggingFace Team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here haha
pretraining_tp (`int`, *optional*, defaults to 1): | ||
Experimental: tensor parallelism rank used during pretraining. This is necessary for exact reproducibility | ||
of pretraining results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed!
use_sliding_window (`bool`, *optional*, defaults to `False`): | ||
Whether to use sliding window attention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best is to not have this arg, and simply check sliding_window is None
instead - so to remove
from ...modeling_outputs import CausalLMOutputWithPast | ||
from ...processing_utils import Unpack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing a license here at the top
from ..llama.modeling_llama import ( | ||
KwargsForCausalLM, | ||
LlamaRMSNorm, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's import those 2 classes from Qwen3 instead! They are similar, and it's easier to follow if we import everything from the same model!
@@ -0,0 +1,144 @@ | |||
# Copyright 2024 The HuggingFace Inc. team. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2025 as well!
# greedy generation outputs | ||
generated_ids = model.generate(input_ids, max_new_tokens=20, do_sample=False) | ||
text = tokenizer.decode(generated_ids[0], skip_special_tokens=True) | ||
print(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this print
@Cyrilvallez Thanks for the review. It looks much cleaner now! I followed the documentation to test tp with: model = AutoModelForCausalLM.from_pretrained("rednote-hilab/dots.llm1.inst", tp_plan="auto", torch_dtype=torch.bfloat16) However, I encountered the following error:
I modified it to apply @staticmethod
def _prepare_output_fn(output_layouts, use_local_output, mod, outputs, device_mesh):
if isinstance(outputs, tuple):
return tuple(output.to_local() if use_local_output else output for output in outputs)
return outputs.to_local() if use_local_output else outputs After this change, it works as expected. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not seeing the changes to tenso 8000 r parallel but yes let's open a different PR and let's merge this!
docs/source/en/model_doc/dots1.md
Outdated
|
||
## Overview | ||
|
||
The `dots.llm1` model was proposed in dots.llm1 technical report by rednote-hilab team. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a hot link here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've added the hyperlink.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for bearing with us and kudos for the release |
What does this PR do?
Support model
dots.llm1
by rednote-hilabBefore submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.