Add Arcee model support #38621

Crystalcareai · 2025-06-05T19:59:02Z

Summary

This PR adds support for the Arcee model architecture, laying the groundwork for the upcoming Arcee Foundation Model (AFM) release. Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations.

Model Description

Arcee is architecturally similar to Llama but with the following distinctions:

ReLU² activation: Uses x * relu(x) in MLP layers for improved gradient flow
Optimized for efficiency: Designed with training and inference efficiency in mind
Extended context: Supports extended context with RoPE scaling

Implementation Details

Modular implementation inheriting from Llama components where applicable
Custom ArceeMLP class implementing the ReLU² activation
Full support for all standard transformers features:
- Flash Attention 2, SDPA, and other attention backends
- Gradient checkpointing
- Quantization support (including quantized caches)
- All standard model variants (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification)

Testing

Added comprehensive test suite following standard transformers test patterns
Tests for all model variants and core functionality
Specific test for ReLU² activation verification
RoPE scaling tests including YARN support
Tested model forward and backward passes
Verified compatibility with existing architecture
Model loading and forward passes verified
Compatibility with existing infrastructure confirmed

- Add ArceeConfig and model mappings for all task types (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification) - Add auto-loading support through AutoModel, AutoConfig, and AutoTokenizer - Use LlamaTokenizer for tokenization - Add FX graph support for Arcee models - Create lazy loading module structure for Arcee

…files

- Add test_modeling_arcee.py following standard transformers test patterns - Include tests for all model variants (CausalLM, SequenceClassification, QuestionAnswering, TokenClassification) - Add specific test for ReLU² activation in ArceeMLP - Add RoPE scaling tests including YARN support - Follow CausalLMModelTest pattern used by similar models

- Add comprehensive model documentation with usage examples - Include all model variants in autodoc - Add to table of contents in proper alphabetical order - Fixes documentation coverage for Arcee model classes

Rocketknight1 · 2025-06-06T11:27:20Z

looks good @Crystalcareai! Feel free to ping us whenever you're ready for review. You can also resolve the code style errors with pip install -e .[quality] followed by make style or make fixup

Crystalcareai · 2025-06-11T15:31:23Z

@Rocketknight1 Hey I think I'm ready for a review, Got a lot of the tests passing though still getting some failures that don't seem to be related to my code. Let me know how best I can get this ready for merging.

Cyrilvallez

Hey! Very clean first implementation with modular, congrats!! 🤗 We can still make it even simpler though, see my comments 🚀 Also, let's make sure the copyright at the top of files have the correct informations (dates and company names mostly)

But very nice work in general! 🤗

Cyrilvallez · 2025-06-13T09:26:15Z

docs/source/en/model_doc/arcee.md

@@ -0,0 +1,104 @@
+<!--Copyright 2024 The HuggingFace Team. All rights reserved.


It's 2025! 🤗

Cyrilvallez · 2025-06-13T09:27:14Z

src/transformers/models/arcee/__init__.py

@@ -0,0 +1,32 @@
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.


wrong dates and companies here as well

Cyrilvallez · 2025-06-13T09:27:22Z

src/transformers/models/arcee/__init__.py

+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.


Not true, to remove

Cyrilvallez · 2025-06-13T09:30:21Z

src/transformers/models/arcee/modular_arcee.py

+# coding=utf-8
+# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
+#
+# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
+# and OPT implementations in this library. It has been modified from its
+# original forms to accommodate minor architectural differences compared
+# to GPT-NeoX and OPT used by the Meta AI team that trained the model.


same comment as above here!

Cyrilvallez · 2025-06-13T09:35:10Z

src/transformers/models/arcee/modular_arcee.py

+class ArceeMLP(nn.Module):
+    """Arcee MLP with configurable activation function (typically relu2)"""
+
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.hidden_size = config.hidden_size
+        self.intermediate_size = config.intermediate_size
+        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=config.mlp_bias)
+        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=config.mlp_bias)
+        self.act_fn = ACT2FN[config.hidden_act]
+
+    def forward(self, x):
+        down_proj = self.down_proj(self.act_fn(self.up_proj(x)))
+        return down_proj


You could actually inherit that directly from Nemotron as it's 1:1 similar!

Cyrilvallez · 2025-06-13T09:41:56Z

src/transformers/models/arcee/modular_arcee.py

+    """
+    The Arcee Model transformer with a sequence classification head on top (linear layer).
+    """
+
+    def __init__(self, config):
+        self.config_class = ArceeConfig
+        super().__init__(config)
+        self.model = ArceeModel(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+
+@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
+class ArceeForQuestionAnswering(LlamaForQuestionAnswering):
+    """
+    The Arcee Model transformer with a span classification head on top for extractive question-answering tasks.
+    """
+
+    def __init__(self, config):
+        self.config_class = ArceeConfig
+        super().__init__(config)
+        # Note: LlamaForQuestionAnswering uses self.transformer, not self.model
+        self.transformer = ArceeModel(config)
+        # Initialize weights and apply final processing
+        self.post_init()
+
+
+@auto_docstring(checkpoint="arcee-ai/AFM-4.5B")
+class ArceeForTokenClassification(LlamaForTokenClassification):
+    """
+    The Arcee Model transformer with a token classification head on top.
+    """
+
+    def __init__(self, config):
+        self.config_class = ArceeConfig
+        super().__init__(config)
+        self.model = ArceeModel(config)
+        # Initialize weights and apply final processing
+        self.post_init()


Similarly for those, we don't actually need to rewrite the init at all 🤗

Cyrilvallez · 2025-06-13T09:43:19Z

src/transformers/utils/fx.py

    "altclip",
+    "arcee",


This change should be reverted - this is more for legacy purposes

Cyrilvallez · 2025-06-13T09:48:20Z

tests/models/arcee/test_modeling_arcee.py

+    def test_model_rope_scaling(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+        for scaling_type in ["linear", "dynamic"]:
+            config.rope_scaling = {"type": scaling_type, "factor": 2.0}
+            model = ArceeModel(config)
+            model.to(torch_device)
+            model.eval()
+            input_ids = torch.randint(0, config.vocab_size, (1, 10)).to(torch_device)
+            with torch.no_grad():
+                model(input_ids)
+
+    def test_model_rope_scaling_yarn(self):
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+        config.rope_scaling = {
+            "type": "yarn",
+            "factor": 2.0,
+            "original_max_position_embeddings": 2048,
+            "attention_factor": 1.0,
+            "beta_fast": 32,
+            "beta_slow": 1,
+        }
+        model = ArceeModel(config)
+        model.to(torch_device)
+        model.eval()
+        input_ids = torch.randint(0, config.vocab_size, (1, 10)).to(torch_device)
+        with torch.no_grad():
+            model(input_ids)


Those tests are not needed and can be removed

Cyrilvallez · 2025-06-13T09:48:49Z

tests/models/arcee/test_modeling_arcee.py

+    def test_arcee_mlp_uses_relu_squared(self):
+        """Test that ArceeMLP uses ReLU² activation instead of SiLU."""
+        config, _ = self.model_tester.prepare_config_and_inputs_for_common()
+        config.hidden_act = "relu2"  # Ensure we're using relu2 activation
+        model = ArceeModel(config)
+
+        # Check that the MLP layers use the correct activation
+        for layer in model.layers:
+            mlp = layer.mlp
+            # Test with a simple input
+            x = torch.randn(1, 10, config.hidden_size)
+            up_output = mlp.up_proj(x)
+
+            # Verify ReLU² activation: x * relu(x)
+            expected_activation = up_output * torch.relu(up_output)
+            actual_activation = mlp.act_fn(up_output)
+
+            self.assertTrue(torch.allclose(expected_activation, actual_activation, atol=1e-5))


I don't mind testing this to make sure, but let's not use a for loop if we don't actually use the loop!

Cyrilvallez · 2025-06-13T09:49:12Z

tests/models/arcee/test_modeling_arcee.py

@@ -0,0 +1,153 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.


date here as well haha

pranav4501 · 2025-06-16T23:42:54Z

Hi @Cyrilvallez ,
Thanks for the feedback, made the requested refactoring changes.
Also, while removing the init from the modular implementation as suggested, the generated modeling code does not have self.config_class = ArceeConfig from the previous version. Is that redundant as well?

Cyrilvallez · 2025-06-19T09:58:49Z

Also, while removing the init from the modular implementation as suggested, the generated modeling code does not have self.config_class = ArceeConfig from the previous version. Is that redundant as well?

Yes, it's already in the PreTrainedModel!

Cyrilvallez

Alright, it's perfect! 🤗🤗 Just added some minor comments about the config (most notably, the tp_plan should reflect the new MLP (this is very important for your model to be available as backend in vllm/TGI and other frameworks), but otherwise all good! Great work! I'll merge as soon as you make those small changes 🤗

Cyrilvallez · 2025-06-19T10:03:52Z

src/transformers/models/arcee/modular_arcee.py

+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism) to
+            understand more about it. This value is necessary to ensure exact reproducibility of the pretraining
+            results. Please refer to [this issue](https://github.com/pytorch/pytorch/issues/76232).


This arg should be removed!

Cyrilvallez · 2025-06-19T10:06:35Z

src/transformers/models/arcee/modular_arcee.py

+
+    model_type = "arcee"


Here you should add the following TP plan as class attribute (we need to change as the MLP is slightly different from Llama):

base_model_tp_plan = { "layers.*.self_attn.q_proj": "colwise", "layers.*.self_attn.k_proj": "colwise", "layers.*.self_attn.v_proj": "colwise", "layers.*.self_attn.o_proj": "rowwise", "layers.*.mlp.up_proj": "colwise", "layers.*.mlp.down_proj": "rowwise", }

Cyrilvallez · 2025-06-19T10:07:40Z

src/transformers/models/arcee/modular_arcee.py

+        pad_token_id=None,
+        bos_token_id=128000,
+        eos_token_id=128001,
+        pretraining_tp=1,


pretraining_tp to remove here

Cyrilvallez · 2025-06-19T10:09:59Z

src/transformers/models/arcee/modular_arcee.py

+        # Validate the correctness of rotary position embeddings parameters using Arcee's custom validation
+        # BC: if there is a 'type' field, copy it to 'rope_type'.
+        if self.rope_scaling is not None and "type" in self.rope_scaling:
+            self.rope_scaling["rope_type"] = self.rope_scaling["type"]
+        rope_config_validation(self)
+


No need to overwrite the check, but we should delete the pretraining_tp attribute to make it disppear in actual config

Suggested change

# Validate the correctness of rotary position embeddings parameters using Arcee's custom validation

# BC: if there is a 'type' field, copy it to 'rope_type'.

if self.rope_scaling is not None and "type" in self.rope_scaling:

self.rope_scaling["rope_type"] = self.rope_scaling["type"]

rope_config_validation(self)

del self.pretraining_tp

Cyrilvallez · 2025-06-19T10:10:53Z

src/transformers/models/arcee/modular_arcee.py

+            pad_token_id=pad_token_id,
+            bos_token_id=bos_token_id,
+            eos_token_id=eos_token_id,
+            pretraining_tp=pretraining_tp,


pretraining_tp to remove here as well

Cyrilvallez · 2025-06-19T10:13:28Z

tests/models/arcee/test_modeling_arcee.py

+@require_torch_accelerator
+class ArceeIntegrationTest(unittest.TestCase):
+    def tearDown(self):
+        import gc
+
+        gc.collect()
+        torch.cuda.empty_cache()
+
+    @slow
+    def test_model_from_pretrained(self):
+        # This test would be enabled once a pretrained model is available
+        # For now, we just test that the model can be instantiated
+        config = ArceeConfig()
+        model = ArceeForCausalLM(config)
+        self.assertIsInstance(model, ArceeForCausalLM)


Would be nice to add a few Integration test based on real checkpoints as well if possible! 🤗 Otherwise you can open another PR later if more convenient for you

HuggingFaceDocBuilderDev · 2025-06-19T10:29:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

pranav4501 · 2025-06-24T03:52:17Z

@Cyrilvallez Thanks for the feedback, removed the pretraining TP from the configurations and added scaffolding for generation integration testing. We will add more robust integration tests and update the checkpoints with the release.

Cyrilvallez

All right, merging! Thanks a lot! TP plan is still wrong, but I'll update it myself after the merge! 🤗🚀

Crystalcareai and others added 8 commits June 4, 2025 14:23

feat: update YARN scaling and RoPE validation for Arcee model

fa6941d

feat: add auto_docstring checkpoint config to Arcee model classes

a433f6c

docs: add pre-trained model weights reference to Arcee configuration …

fa20bf3

…files

refactor: move RoPE utilities to dedicated modeling_rope_utils module

d364e3b

Merge branch 'main' into add_arcee

0eea7cb

Add documentation for Arcee model

b674644

- Add comprehensive model documentation with usage examples - Include all model variants in autodoc - Add to table of contents in proper alphabetical order - Fixes documentation coverage for Arcee model classes

Crystalcareai and others added 4 commits June 9, 2025 11:25

Merge branch 'main' into add_arcee

450484d

Make style/fixup

41d04ed

Merge branch 'main' into add_arcee

ee48ca4

Merge branch 'main' into add_arcee

642036c

Merge branch 'main' into add_arcee

137350b

Cyrilvallez reviewed Jun 13, 2025

View reviewed changes

bartowski1182 mentioned this pull request Jun 14, 2025

Add support for Arcee AI's upcoming AFM model ggml-org/llama.cpp#14185

Merged

pranav4501 and others added 6 commits June 16, 2025 20:25

fix copyright year

d63cab4

Sync modular conversion

7e3339c

revert in legacy supported models in src/transformers/utils/fx

c8b9a21

cleaned redundant code in modular_arcee.py

6b75be9

cleaned testing

2e950e1

Merge branch 'main' into add_arcee

e652c2d

Cyrilvallez reviewed Jun 19, 2025

View reviewed changes

pranav4501 and others added 3 commits June 23, 2025 21:16

removed pretraining tp

4a811d7

Merge branch 'main' into add_arcee

4fe3343

fix styles

52f4df8

integration testing

ce93e80

Cyrilvallez approved these changes Jun 24, 2025

View reviewed changes

Cyrilvallez merged commit 71de20b into huggingface:main Jun 24, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Arcee model support #38621

Add Arcee model support #38621

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -0,0 +1,104 @@
		<!--Copyright 2024 The HuggingFace Team. All rights reserved.

		@@ -0,0 +1,32 @@
		# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.

		@@ -0,0 +1,153 @@
		# Copyright 2024 The HuggingFace Inc. team. All rights reserved.

Add Arcee model support #38621

Add Arcee model support #38621

Uh oh!

Conversation

Summary

Model Description

Implementation Details

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!