8000 `server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) by ochafik · Pull Request #13771 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) #13771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 25, 2025

Conversation

ochafik
Copy link
Collaborator
@ochafik ochafik commented May 25, 2025

This allows disabling thinking for all supported thinking models (QwQ, DeepSeek R1 distills, Qwen3, Command R7B), when the flag --reasoning-budget 0 is set

For per-request behaviour, see #13272 (discussion on upcoming reasoning budget request param) and #13196 (support passing generic kvs).

cc/ @matteoserva
cc/ @ngxson Not sure about the slight alteration of the semantics of the CLI flag (updated docs + inline help), but doesn't feel worth adding a separate flag at this stage, wdyt?

@github-actions github-actions bot added testing Everything test related examples python python script changes server labels May 25, 2025
@ngxson
Copy link
Collaborator
ngxson commented May 25, 2025

yes this can be useful, I thought about it in #13272 , which is part of my idea about implementing the thinking budget.

just to be less confused between none and disabled, I think it's better to call this flag nothink instead. In the future, we may also want to add hidden mode which still allow the model to generate thought, but is hidden from the response

@CISC
Copy link
Collaborator
CISC commented May 25, 2025

Consider adding Granite's thinking option in it's chat template, which changes the system prompt. Basically the inverse of Qwen3's option.

@ochafik
Copy link
Collaborator Author
ochafik commented May 25, 2025

Consider adding Granite's thinking option in it's chat template, which changes the system prompt. Basically the inverse of Qwen3's option.

@CISC I hadn't seen that one, thanks for bringing this up! Strong case for support through @ngxson's #13272, the request param could override the flag then, or something.

@ochafik ochafik changed the title server: add --reasoning-format=disabled to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025
@ochafik ochafik marked this pull request as ready for review May 25, 2025 09:44
@ochafik ochafik requested a review from ngxson as a code owner May 25, 2025 09:44
common/arg.cpp Outdated
"controls whether thought tags are allowed and/or extracted from the response, and in which format they're returned; one of:\n"
"- none: leaves thoughts unparsed in `message.content`\n"
"- deepseek: puts thoughts in `message.reasoning_content` (except in streaming mode, which behaves as `none`)\n"
"- nothink: prevents generation of thoughts (forcibly closing thoughts tag or setting template-specific variables such as `enable_thinking: false` for Qwen3)\n"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't feel worth adding a separate flag at this stage, wdyt?

Tbh I think we should still separate it to another flag. The format meaning it only format the response, not changing the behavior, but here nothink changes the generation behavior

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to just add a flag called --reasoning-budget and only support either -1 (unlimited budget) or 0 (no think) for now

@ngxson ngxson changed the title server: add --reasoning-format=nothink to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025
@ngxson ngxson changed the title server: add --reasoning-budget to disable thinking (incl. qwen3 w/ enable_thinking:false) server: add --reasoning-budget 0 to disable thinking (incl. qwen3 w/ enable_thinking:false) May 25, 2025
@ochafik ochafik merged commit e121edc into ggml-org:master May 25, 2025
48 checks passed
@countzero
Copy link

@ngxson & @ochafik I have a question regarding the usage. Simply adding --reasoning-budget 0 does not stop Qwen3 to output <think> tags and reason before answering. Am I missing something?

llama-server `
    --model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --ctx-
8000
size 8192 `
    --threads 16 `
    --n-gpu-layers 99 `
    --reasoning-budget 0 `
    --flash-attn

This request:

curl.exe http://127.0.0.1:8080/v1/chat/completions `
    --silent `
    --header "Content-Type: application/json" `
    --data '{
        \"model\": \"Qwen3-30B-A3B.IQ3_XXS.gguf\",
        \"messages\": [
            {
                \"role\": \"user\",
                \"content\": \"How are you?\"
            }
        ],
        \"temperature\": 0.6,
        \"max_tokens\": 1024
    }'

Returns the following:

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, the user asked, \"How are you?\" I need to respond appropriately. Since I'm an AI, I don't have feelings, but I should keep the response friendly and helpful. Maybe say something like, \"I'm just a bunch of code, but I'm doing great! How can I assist you today?\" That's positive and shifts the focus back to the user. Let me make sure it's concise and friendly. Yep, that works.\n</think>\n\nI'm just a bunch of code, but I'm doing great! How can I assist you today? ­ƒÿè"
      }
    }
  ],
  "created": 1748251147,
  "model": "Qwen3-30B-A3B.IQ3_XXS.gguf",
  "system_fingerprint": "b5490-fef693dc",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 121,
    "prompt_tokens": 12,
    "total_tokens": 133
  },
  "id": "chatcmpl-Ihg3Q1yUsY6rFGKOnOXr6hbRtTR42v2e",
  "timings": {
    "prompt_n": 12,
    "prompt_ms": 69.177,
    "prompt_per_token_ms": 5.76475,
    "prompt_per_second": 173.46806019341687,
    "predicted_n": 121,
    "predicted_ms": 893.017,
    "predicted_per_token_ms": 7.3803057851239675,
    "predicted_per_second": 135.49574084255954
  }
}

@kth8
Copy link
kth8 commented May 26, 2025

@countzero You need to start the server with --jinja in addition to --reasoning-budget 0

@countzero
Copy link

@kth8 Thank you for the hint. That indeed works now:

llama-server `
    --model 'D:\AI\LLM\gguf\Qwen3-30B-A3B\Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --alias 'Qwen3-30B-A3B.IQ3_XXS.gguf' `
    --ctx-size 8192 `
    --threads 16 `
    --n-gpu-layers 99 `
    --reasoning-budget 0 `
    --jinja `
    --flash-attn

@ngxson & @ochafik As a developer I would like to use the --reasoning-budget argument without having to know about the --jinja flag, so that I can simply use what I read in the usage documentation directly.

Suggestion: Activate --jinja automatically if --reasoning-budget needs it. I think a similar mechanism is already implemented for other flags.

@characharm
Copy link
Contributor

Please take a look: #13877

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples python python script changes server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0