-
Notifications
You must be signed in to change notification settings - Fork 29.8k
Description
System Info
transformers=v4.51.1
torch=2.6.0
python=3.10.0
The code snippet is as follows:
output = model.generate(**input_ids,do_sample=False,num_beams=2, max_new_tokens=1,force_words_ids=[tokenizer.convert_tokens_to_ids(['A', 'B', 'C', 'D'])])
The error message is as follows:
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama4/modeling_llama4.py", line 379, in forward
[rank0]: attn_output, attn_weights = attention_interface(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama4/modeling_llama4.py", line 286, in eager_attention_forward
[rank0]: attn_weights = attn_weights + causal_mask
[rank0]: RuntimeError: The size of tensor a (8192) must match the size of tensor b (18) at non-singleton dimension 3
code
from transformers import AutoTokenizer, Llama4ForConditionalGeneration
import torch
model_id = "...Llama-4-Scout-17B-16E-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "user", "content": "Which one do you choose among ABCD?"},
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", return_dict=True)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
)
output = model.generate(**inputs.to(model.device), max_new_tokens=1,num_beams=2,do_sample=False,force_words_ids=[tokenizer.convert_tokens_to_ids(['A', 'B', 'C', 'D'])])
outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])
print(outputs[0])
Expected behavior
Why does this problem occur and how can it be solved?