-
Notifications
You must be signed in to change notification settings - Fork 12k
server
: add --reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false)
#13771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
yes this can be useful, I thought about it in #13272 , which is part of my idea about implementing the thinking budget. just to be less confused between |
Consider adding Granite's |
@CISC I hadn't seen that one, thanks for bringing this up! Strong case for support through @ngxson's #13272, the request param could override the flag then, or something. |
server
: add --reasoning-format=disabled
to disable thinking (incl. qwen3 w/ enable_thinking:false)server
: add --reasoning-format=nothink
to disable thinking (incl. qwen3 w/ enable_thinking:false)
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
common/arg.cpp
Outdated
"controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n" | ||
"- none: leaves thoughts unparsed in `message.content`\n" | ||
"- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n" | ||
"- nothink: prevents generation of thoughts (forcibly closing thoughts tag or setting template-specific variables such as `enable_thinking: false` for Qwen3)\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't feel worth adding a separate flag at this stage, wdyt?
Tbh I think we should still separate it to another flag. The format
meaning it only format the response, not changing the behavior, but here nothink
changes the generation behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's ok to just add a flag called --reasoning-budget
and only support either -1
(unlimited budget) or 0
(no think) for now
server
: add --reasoning-format=nothink
to disable thinking (incl. qwen3 w/ enable_thinking:false)server
: add --reasoning-budget
to disable thinking (incl. qwen3 w/ enable_thinking:false)
server
: add --reasoning-budget
to disable thinking (incl. qwen3 w/ enable_thinking:false)server
: add --reasoning-budget 0
to disable thinking (incl. qwen3 w/ enable_thinking:false)
@ngxson & @ochafik I have a question regarding the usage. Simply adding llama-server `
--model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
--alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--reasoning-budget 0 `
--flash-attn This request: curl.exe http://127.0.0.1:8080/v1/chat/completions `
--silent `
--header "Content-Type: application/json" `
--data '{
\"model\": \"Qwen3-30B-A3B.IQ3_XXS.gguf\",
\"messages\": [
{
\"role\": \"user\",
\"content\": \"How are you?\"
}
],
\"temperature\": 0.6,
\"max_tokens\": 1024
}' Returns the following: {
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"content": "<think>\nOkay, the user asked, \"How are you?\" I need to respond appropriately. Since I'm an AI, I don't have feelings, but I should keep the response friendly and helpful. Maybe say something like, \"I'm just a bunch of code, but I'm doing great! How can I assist you today?\" That's positive and shifts the focus back to the user. Let me make sure it's concise and friendly. Yep, that works.\n</think>\n\nI'm just a bunch of code, but I'm doing great! How can I assist you today? ƒÿè"
}
}
],
"created": 1748251147,
"model": "Qwen3-30B-A3B.IQ3_XXS.gguf",
"system_fingerprint": "b5490-fef693dc",
"object": "chat.completion",
"usage": {
"completion_tokens": 121,
"prompt_tokens": 12,
"total_tokens": 133
},
"id": "chatcmpl-Ihg3Q1yUsY6rFGKOnOXr6hbRtTR42v2e",
"timings": {
"prompt_n": 12,
"prompt_ms": 69.177,
"prompt_per_token_ms": 5.76475,
"prompt_per_second": 173.46806019341687,
"predicted_n": 121,
"predicted_ms": 893.017,
"predicted_per_token_ms": 7.3803057851239675,
"predicted_per_second": 135.49574084255954
}
} |
@countzero You need to start the server with |
@kth8 Thank you for the hint. That indeed works now: llama-server `
--model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
--alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--reasoning-budget 0 `
--jinja `
--flash-attn @ngxson & @ochafik As a developer I would like to use the Suggestion: Activate |
Please take a look: #13877 |
This allows disabling thinking for all supported thinking models (QwQ, DeepSeek R1 distills, Qwen3, Command R7B), when the flag
--reasoning-budget 0
is set"enable_thinking": false
as extra template context variable (similar to Support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client #13196, which will still be very useful in general)For per-request behaviour, see #13272 (discussion on upcoming reasoning budget request param) and #13196 (support passing generic kvs).
cc/ @matteoserva
cc/ @ngxson Not sure about the slight alteration of the semantics of the CLI flag (updated docs + inline help), but doesn't feel worth adding a separate flag at this stage, wdyt?