Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

ochafik · 2024-09-25T15:37:26Z

This supersedes #6389 (now using a fully C++ approach), #5695 (first attempt at supporting Functionary) and #9592 (more recent Python wrapper).

Which models are supported (in their native style)?

While any model should work (w/ generic fallback using JSON schema constraints), this PR supports the native call style of a few models:

Llama 3.1 / 3.3 (including builtin tools support), Llama 3.2
Functionary v3.1 / v3.2
Hermes 2/3, Qwen 2.5
tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034
Mistral Nemo
Firefunction v2
DeepSeek R1
server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none #11607
Command 7RB
tool-call: support Command R7B (+ return tool_plan "thoughts" in API) #11585

(note: streaming ~~incubated~~ implemented in #12379)

Show all templates supported by minja and which handler they use

Template	Format
CohereForAI-c4ai-command-r-plus-default.jinja	generic tool calls
CohereForAI-c4ai-command-r-plus-rag.jinja	generic tool calls
CohereForAI-c4ai-command-r-plus-tool_use.jinja	generic tool calls
MiniMaxAI-MiniMax-Text-01.jinja	generic tool calls
NexaAIDev-Octopus-v2.jinja	generic tool calls
NousResearch-Hermes-2-Pro-Llama-3-8B-default.jinja	generic tool calls
NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja	hermes 2 pro tool calls
NousResearch-Hermes-2-Pro-Mistral-7B-default.jinja	generic tool calls
NousResearch-Hermes-2-Pro-Mistral-7B-tool_use.jinja	hermes 2 pro tool calls
NousResearch-Hermes-3-Llama-3.1-70B-default.jinja	generic tool calls
NousResearch-Hermes-3-Llama-3.1-70B-tool_use.jinja	hermes 2 pro tool calls
OrionStarAI-Orion-14B-Chat.jinja	generic tool calls
Qwen-QwQ-32B-Preview.jinja	hermes 2 pro tool calls
Qwen-Qwen2-7B-Instruct.jinja	generic tool calls
Qwen-Qwen2-VL-7B-Instruct.jinja	generic tool calls
Qwen-Qwen2.5-7B-Instruct.jinja	hermes 2 pro tool calls
Qwen-Qwen2.5-Math-7B-Instruct.jinja	hermes 2 pro tool calls
TheBloke-FusionNet_34Bx2_MoE-AWQ.jinja	generic tool calls
abacusai-Fewshot-Metamath-OrcaVicuna-Mistral.jinja	generic tool calls
bofenghuang-vigogne-2-70b-chat.jinja	generic tool calls
databricks-dbrx-instruct.jinja	generic tool calls
deepseek-ai-DeepSeek-Coder-V2-Instruct.jinja	generic tool calls
deepseek-ai-DeepSeek-R1-Distill-Llama-8B.jinja	deepseek r1 tool calls
deepseek-ai-DeepSeek-R1-Distill-Qwen-32B.jinja	deepseek r1 tool calls
deepseek-ai-DeepSeek-R1-Distill-Qwen-7B.jinja	deepseek r1 tool calls
deepseek-ai-DeepSeek-V2.5.jinja	deepseek r1 tool calls
deepseek-ai-deepseek-coder-33b-instruct.jinja	generic tool calls
google-gemma-2-2b-it.jinja	generic tool calls
google-gemma-7b-it.jinja	generic tool calls
indischepartij-MiniCPM-3B-OpenHermes-2.5-v2.jinja	generic tool calls
mattshumer-Reflection-Llama-3.1-70B.jinja	generic tool calls
meetkai-functionary-medium-v3.2.jinja	functionary v3.2 tool calls
meta-llama-Llama-3.1-8B-Instruct.jinja	llama 3.x tool calls (w/ builtin tools)
meta-llama-Llama-3.2-3B-Instruct.jinja	llama 3.x tool calls
meta-llama-Llama-3.3-70B-Instruct.jinja	llama 3.x tool calls (w/ builtin tools)
meta-llama-Meta-Llama-3.1-8B-Instruct.jinja	llama 3.x tool calls (w/ builtin tools)
microsoft-Phi-3-medium-4k-instruct.jinja	generic tool calls
microsoft-Phi-3-mini-4k-instruct.jinja	generic tool calls
microsoft-Phi-3-small-8k-instruct.jinja	generic tool calls
microsoft-Phi-3.5-mini-instruct.jinja	generic tool calls
microsoft-Phi-3.5-vision-instruct.jinja	generic tool calls
mistralai-Mistral-7B-Instruct-v0.2.jinja	generic tool calls
mistralai-Mistral-Large-Instruct-2407.jinja	mistral nemo tool calls
mistralai-Mistral-Large-Instruct-2411.jinja	generic tool calls
mistralai-Mistral-Nemo-Instruct-2407.jinja	mistral nemo tool calls
mistralai-Mixtral-8x7B-Instruct-v0.1.jinja	generic tool calls
mlabonne-AlphaMonarch-7B.jinja	generic tool calls
nvidia-Llama-3.1-Nemotron-70B-Instruct-HF.jinja	llama 3.x tool calls (w/ builtin tools)
openchat-openchat-3.5-0106.jinja	generic tool calls
teknium-OpenHermes-2.5-Mistral-7B.jinja	generic tool calls

For natively supported models, it's important to have the right template (it might not be in the GGUF; note that we prefer the tool_use variant of the Jinja template if it's present in the GGUF metadata). You can check which template is defined by inspecting http://localhost:8080/props, and inspect the logs for Chat format: .

Any tool_calls field returned by llama-server should always conform to the JSON schema (to the extent that it uses supported features of JSON schemas), so there's no need to use any post-processor.

How to use / test

You can test tool calls as follows:

Get and build this PR's branch

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
git remote add ochafik https://github.com/ochafik/llama.cpp
git fetch ochafik
git checkout ochafik/tool-call
cmake -B build -DLLAMA_CURL=1
cmake --build build -t llama-server --parallel --config Release
alias llama-server=./build/bin/llama-server

Run llama-server w/ any model (Edited: bumped to quants / models that work w/ my agent example):

# Native support for Llama 3.x, Mistral Nemo, Qwen 2.5, Hermes 3, Functionary 3.x, Firefunction v2...

llama-server --jinja -fa -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M

llama-server --jinja -fa -hf bartowski/Mistral-Nemo-Instruct-2407-GGUF:Q6_K_L

llama-server --jinja -fa -hf bartowski/Llama-3.3-70B-Instruct-GGUF:Q4_K_M
# Not too strong, but YMMV:
#   llama-server --jinja -fa -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q6_K

# Native support requires the right template for these GGUFs:

llama-server --jinja -fa -hf bartowski/functionary-small-v3.2-GGUF:Q4_K_M \
  --chat-template-file <( python scripts/get_chat_template.py meetkai/functionary-medium-v3.2 )

llama-server --jinja -fa -hf bartowski/Hermes-3-Llama-3.1-8B-GGUF:Q4_K_M \
  --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-3-Llama-3.1-8B tool_use )

llama-server --jinja -fa -hf bartowski/Hermes-2-Pro-Llama-3-8B-GGUF:Q4_K_M \
  --chat-template-file <( python scripts/get_chat_template.py NousResearch/Hermes-2-Pro-Llama-3-8B )

llama-server --jinja -fa -hf bartowski/firefunction-v2-GGUF -hff firefunction-v2-Q5_K_M.gguf \
  --chat-template-file <( python scripts/get_chat_template.py fireworks-ai/firellama-3-firefunction-v2 )

# Generic support for any other models, e.g. Phi, Gemma, really anything goes

llama-server --jinja -fa -hf bartowski/phi-4-GGUF:Q4_0
...

Call the chat completions endpoint (in non-streamed mode) with any OpenAI-compatible library, or plain curl:

curl http://localhost:8080/v1/chat/completions -d '{
  "model": "gpt-3.5-turbo",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "python",
        "description": "Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The code to run in the ipython interpreter."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Print a hello world message with python."
    }
  ]
}'

It will output something like (once piped in jq):

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "content": "",
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "python",
              "arguments": "{\"code\":\"print('Hello, World!')\"}"
            },
            "id": null
          }
        ],
        "role": "assistant"
      }
    }
  ],
  ...
}

I've also created some minimalistic Agent loop code in this Gist: it contains a few python tools & supports running them in a siloed docker container, along with examples (used to be part of this PR).

Background

This PR tackles two main problems related to tool calling:

Lazy grammars: Helping / forcing the model to follow the tool schemas w/ grammar constraints is tricky as in most cases the model may also output normal, unconstrained content (except if "tool_choice": "required" is specified in the request). It's not currently possible to say .* "<tool_call>" constrained "</tool_call>" as the leading .* will match eagerly. In [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389 I was avoid this issue in the thoughtful_steps style, but the native tool call styles were still problematic.
- Solved w/ lazy grammars activated by trigger words (similar to stop words, but awaited in the grammar implementation itself). Output is completely unconstrained before triggers, and completely constrained after, which allows for content vs. tool_call outputs, and even mixes of the two (for the few models that support that).
  - For Llama 3.x (cf. these docs: 1, 2, 3), triggers are
    - <|python_tag|> if any of the builtin tools are detected (wolfram_alpha, brave_search / web_search with query param, code_interpreter with code param); NOT for Llama 3.2
    - {"name": "toolN" (for each toolN in the list of tools in the request)
    - Also just {"name": (needed for very small 1B/3B models which get confused very quickly otherwise), and some other variations (to allow the somewhat popular {"type": "function", "name": ...)
  - For Functionary v3.1, we trigger on <function= and <|python_tag|> (NOTE: seems to work well w/ Llama-3.1-Instruct, e.g. it's on together.ai's docs). Note that <|python_tag|> here introduces freeform Python code, whereas for Llama-3.1-Instruct's template it introduces builtin tool calls in Python syntax. Almost the same, but handled quite differently.
  - For Functionary v3.2, it's >>>toolN\n for each toolN (technically also triggering on toolN\n for the first tool call, there's a todo to avoid spurious matches by forcing a match at the very start)
  - For Hermes Pro (cf. Hermes-Function-Calling repo), the trigger is <tool_call>.
  - For Mistral Nemo, the trigger is the special [TOOL_CALLS] token
  - For DeepSeek R1 and its distills, it's <｜tool▁calls▁begin｜> (Note: DeepSeek-R1 seems more eager to talk than to call tools for now, lemme know if you get it to work)
  - For Firefunction v2, the trigger is functools[
  - For other models ("generic" chat format), no lazy grammars are used, just a normal JSON schema that can contain schema-constrained tool calls or content (unless tool_choice is required)
Jinja chat templates for tool-call-able models are getting increasingly complex, and implementing each of them in C++ is a maintenance hazard.
- Solved by implementing a minimal Jinja engine (minja.hpp), with just enough to render all the templates I could find in the wild. That's still a lot of code (2.5k LOC), but about 10x less so than Jinja2Cpp (not even counting its dependencies - it needs a subset of Boost and some C++ backfills). It's trivial to extend (say, to add support for a new filter / test), and it comes with decent error reporting and simple tests. And we could always switch to another implementation in the future.

With this intro out of the way, here are the main parts of this PR:

minja.hpp: minimal Jinja templating engine and its tests against actual templates & a few test contexts
- Spun into its own repo: https://github.com/google/minja
- Integrated under --jinja flag in Add Jinja template support #11016
Tool call grammar generation + output parsing logic for 8 different tool call styles (covering most of the popular models, incl. Llama 3.x, Functionary 3, Qwen 2.5, DeepSeek R1, Mistral Nemo...), with a generic fallback.
Lazy grammar wired into the sampler, using a mix of trigger words and trigger tokens to enable the grammar. Trigger tokens are also used to override printability of special tokens, even when the grammar is not lazy (e.g. when "tool_choice": "required" is passed in the request)
Integration with llama-server (full tools & tool_choice support).
- Growing set of tests in examples/server/tests/unit/test_tool_call.py, some of which are skipped by default as they require downloading lots of models (can bulk get them with scripts/fetch_server_test_models.py, then run the slow tests w/ ( cd examples/server/tests && ./tests.sh -m slow -v -x )).

TODOs

Blocking:

sync: minja #11499 (this PR's diff won't include chat-template.hpp or minja.hpp)
- Ensure tools aren't described twice in the generic handler (now that Minja does it for us)
Add test for lazy grammars (cf. removed test-antiprompts.cpp)
Test parsers on corner case inputs (now they're easier to call w/ an enum) and tighten their implementations
Drop legacy python_code_argument_name in favour of expect_tool_arguments

Nice to haves:

Implement at_first semantics to require trigger word to be at start of output (equiv. to ^ regex behaviour; not using regexes as ^ can't be made to mean "start of entire string" reliably afaict), to reduce spurious triggers w/ Llama 3.x
Document llama3.1 builtin tools schemas
May want to ping owners of models which GGUF doesn't contain the right chat templates + provide them w/ an easy one-liner to surgically edit the gguf
Warning log when using the generic chat format
Find examples of tool call w/ DeepSeek-R1-Distill-* (ought to work, but proving elusive / just wants to think, think, think)
Implement strftime_now in minja (for Llama 3.2), also update today's date for Llama 3.1 and functionary

See draft-times TODOs

Possible follow ups:

Add -hft / --hf_template flag to override the GGUF's chat templates from a HF model repo
Add agent example w/ isolation in c++ or python (see example/agent moved from this PR to that Gist).
Add agent w/ MCP support?
Add tool call loop to the default web chat using Pyodide as a python interpreter?
Add tool call loop to the CLIs?

ochafik · 2024-09-27T06:25:09Z

Apologies for this PR being a moving target.

I've now stabilized things (except older gcc giving me sweats), added tests & included basic usage instructions (w/ a tiny agent helper adapted from #6389) for Llama-3.1-8B-Instruct, Hermes-2-Pro-Llama-3-8B and functionary-small-3.2 (which still needs a bit of work).

rujialiu · 2024-09-29T12:25:32Z

@ochafik Your minja.hpp is cool (I like minimalist things) but if for any reason you need a lightweight but more powerful template engine, you can have a look at inja (https://github.com/pantor/inja), which I've used in production for several years. It has a single-file header, and the only dependency is nlohman json, which is already used in llama.cpp.

BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly. I'll definitely give it a try when you finish this PR. Exciting work!

ochafik · 2024-09-29T21:21:03Z

@ochafik Your minja.hpp is cool (I like minimalist things)

Thanks @rujialiu !

but if for any reason you need a lightweight but more powerful template engine, you can have a look at inja (https://github.com/pantor/inja), which I've used in production for several years. It has a single-file header, and the only dependency is nlohman json, which is already used in llama.cpp.

Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features, e.g. NousResearch/Hermes-3-Llama-3.1, Cohere/command-r-plus, meetkai/functionary-medium-v3.2 ). Filters (w/ the pipe syntax, e.g. {{ range(10) | length }}, macros are glaring omissions for instance.

BTW: My current tool-calling solution is to write dummy functions in python and generate grammar files with pydantic, awkward and ugly.

Yeah I'm doing the same, that's why I spent so much energy improving the JSON schema support tbh.

I'll definitely give it a try when you finish this PR. Exciting work!

Hopefully soon! (famous last words haha)

rujialiu · 2024-09-30T07:43:20Z

Thanks for the pointer, at first glance inja seems too limited to support actual templates (we're at the mercy of each and every model maker, some use lots of jinja features

Ouch, I was not aware of that. That's crazy. Now I'm really impressed that your little code already supports these. Maybe I should use your minja.hpp in production instead in the future 8-)

Maximilian-Winter · 2024-10-07T16:57:07Z

@ochafik I really like your idea of using lazy grammar, I would love to help you. I'm the developer of llama-cpp-agent. Let me know if we can collaborate somehow.

ochafik · 2024-10-17T18:35:06Z

@Maximilian-Winter thanks / sorry for the slow reply! (frantically busy few weeks 😅)

I'd love help on this, anything from just testing out instructions above, to finding new cool examples / bugs, reporting on any other model's tool call styles, or new ideas. I'm trying to release minja in its own mini-repo w/ better testing, but the lazy grammar part is probably going to be what needs most work on next.

Depending on your timezone, happy to jump into a video chat too :-) (DM on x?)

(Also, llama-cpp-agent looks suuuper cool! 💜)

Maximilian-Winter · 2024-10-18T23:50:52Z

@ochafik Sure, that would be great. I'm living in germany. I actually tried to verify on X, by buying premium to write you, but I still have to wait for verification. If you want to reach out me by email or discord, feel free! My email is maximilian.winter.91@gmail.com

… dumb for function call)

ochafik · 2025-03-05T16:20:15Z

@ochafik Thanks for this work! I've been investigating tool calling on consumer hardware and Ollama, and it's been a very frustrating experience. The lazy grammar idea is very cool.

@edmcman Thanks!! Not sure if you've seen #12034, I've done a very coarse & naive "benchmark" of llama-server against ollama w/ various models at various temperatures (more results here). I wonder on which models you've had issues w/ ollama, lemme know if you'd like updated results or need help running scripts/tool_bench.sh.

Anyway, as far as benchmarks, I thought I'd point out BFCL v3 as another option. I know that I would love to have some type of tool calling leaderboard for local GGML models.

Thanks for the pointer (i'd completely forgotten about that repo!), their code looks straightforward, will give it a deeper look!

edmcman · 2025-03-05T16:50:50Z

@edmcman Thanks!! Not sure if you've seen #12034, I've done a very coarse & naive "benchmark" of llama-server against ollama w/ various models at various temperatures (more results here). I wonder on which models you've had issues w/ ollama, lemme know if you'd like updated results or need help running scripts/tool_bench.sh.

Thanks, this is so helpful! I wrote a few blogs as I banged my head trying to find a model that worked. In a nutshell, llama3-groq-tool-use was the only model I tested that seemed to work out of the box. I spent a very long time trying to understand why llama 3.2 did not work in particular. In short, it seems like Meta has an incorrect or underperforming prompt format documented on the Llama 3.1 model website, and Ollama based their prompt on that. (They can't use the regular 3.2 prompts because they don't support pythonic function call parsing.) I think Ollama's llama 3.3 model is using the same template.

One thing I'd recommend from my experiences is adding a "conversational" test to your benchmark, e.g., say "Hello" and verify that the model does not attempt to make a nonsensical function call.

ochafik · 2025-03-05T19:13:58Z

Thanks, this is so helpful! I wrote a few blogs as I banged my head trying to find a model that worked.

@edmcman Great write ups, thanks a lot for sharing!!

In a nutshell, llama3-groq-tool-use was the only model I tested that seemed to work out of the box. I spent a very long time trying to understand why llama 3.2 did not work in particular.

One of my gripes with Llama 3.2 (the very small versions I tested, that is) is it tends to forget to escape nested quotes in (JSON escaped) Python code, causing premature termination of its function call's arguments 🤦‍♂️... I prototyped some convoluted workaround (cf. above) but haven't gotten to productionizing it yet.

In short, it seems like Meta has an incorrect or underperforming prompt format documented on the Llama 3.1 model website, and Ollama based their prompt on that. (They can't use the regular 3.2 prompts because they don't support pythonic function call parsing.) I think Ollama's llama 3.3 model is using the same template.

Ollama and their custom templates are in an awkward position. I decided I preferred the (sizeable) hassle of writing and maintaining a jinja templating engine (and now, custom tool call parsers), rather than doing prompt engineering and coercing models too much (I mean, I do coerce them very much ⛓, but only after they start picking one of their natural outputs formats; Qwen 2.5 Coder turned out wildly creative, for instance)

One thing I'd recommend from my experiences is adding a "conversational" test to your benchmark, e.g., say "Hello" and verify that the model does not attempt to make a nonsensical function call.

I partially test for this in test_completion_without_tool_call* (checks there's no call, with variations of no tool provided, or just a test tool that's useless for the task, or the right tool but with tool_choice = none), but sounds like a great idea to also check some nice chatty interaction 👌

You'll also see some test_calc_result in the same file that is also super basic & important (can a model use a tool result??) with alas, not perfect success rate on many models.

PS: started incubating gorilla support for llama.cpp in this branch

ochafik · 2025-03-05T20:10:02Z

In a nutshell, llama3-groq-tool-use was the only model I tested that seemed to work out of the box.

@edmcman Note that Groq/Llama-3-Groq-8B-Tool-Use has an inexplicably bland chat_template, which doesn't handle tool calls or tool call results. Based on its tokenizer config, it accepts NousResearch Hermes 2-style syntax w/ <tool_call>{json args}</tool_call> and <tool_response>...</tool_response>, and works great with a template override:

llama-server --jinja -fa -hf bartowski/Llama-3-Groq-8B-Tool-Use-GGUF --chat-template-file models/templates/NousResearch-Hermes-2-Pro-Llama-3-8B-tool_use.jinja

Without the template override, the tool calls will still work (with the generic support) but the JSON-based tool call results injection done by Minja's polyfill isn't picked up properly (e.g. test_calc_result fails with this model). I have ideas on improving that generic support (e.g. could detect if the model's tokenizer has <tool_response> tokens), but for now it's simpler to use a proper template :-)

edmcman · 2025-03-05T22:22:53Z

Great news: I was just able to run my application using langchain, llama.cpp's server, and it worked great with bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M. Like, the best performance by any open source model I've tried by far on Ollama. (And that includes Ollama's Qwen2.5-7B, of course. I'm very curious what the difference could be -- I plan to investigate further.) Anyway, I'm ecstatic. Thank you again for your work!

Two small things:

Are there any plans to add streaming output when tools are present? I might be interested in assisting with that if there is some challenge.
I did run into an issue with bartowski/functionary-small-v3.2-GGUF:Q4_K_M where "assistant\n" seems to be prepended to each non-tool message. I'll create an issue for that.

ochafik · 2025-03-06T13:00:27Z

best performance by any open source model I've tried by far on Ollama. (And that includes Ollama's Qwen2.5-7B, of course.

@edmcman Glad to read this! (have you tried others such as Qwen2.5-Coder? I'm obsessed w/ unsloth's 128k extended context versions)

I'm very curious what the difference could be -- I plan to investigate further.)

If you'd like to see the impact of grammar constraints (a key difference w/ Ollama), you could disable them in utils.hpp as follows:

    ...
    llama_params["prompt"]           = chat_params.prompt;
    if (getenv("DISABLE_GRAMMAR")) {
      llama_params["grammar"]          = chat_params.grammar;
      llama_params["grammar_lazy"]     = chat_params.grammar_lazy;
      auto grammar_triggers = json::array();
      for (const auto & trigger : chat_params.grammar_triggers) {
          grammar_triggers.push_back(trigger.to_json<json>());
      }
      llama_params["grammar_triggers"] = grammar_triggers;
    }
    ...

Anyway, I'm ecstatic. Thank you again for your work!

Thanks for reporting back, really matters to know this is useful and appreciated!!

Are there any plans to add streaming output when tools are present? I might be interested in assisting with that if there is some challenge.

Aaaabsolutely. There have been multiple suggestions on how to proceed, but I'm currently working on a "simple" approach that involves:

partial regexp parsing (unfortunately not part of std::regex) to avoid consuming partial triggers / preludes to args: regex-partial.h (mostly done, magic trick is to transform the original regex to one that matches in reverse from the end)
partial json parsing w/ ability to "heal" an unclosed json w/ some magic insert (useful to break it cleanly when returning diffs of JSON-encoded arguments in the OAI streamed chunk diff format): json-partial.h (WIP, 70% done)
diffing of common_chat_msg (done), and plugging in the server (exact setup still tbc)
updating each parsing function in chat.cpp to accept is_partial & use the partial regex and partial json modes (WIP)

Hope to get something testable (if not reviewable) in a few days (famous last words haha)

I did run into an issue with bartowski/functionary-small-v3.2-GGUF:Q4_K_M where "assistant\n" seems to be prepended to each non-tool message. I'll create an issue for that.

Thanks for reporting #12213, doc updated in #12214

edmcman · 2025-03-06T14:52:28Z

@edmcman Glad to read this! (have you tried others such as Qwen2.5-Coder? I'm obsessed w/ unsloth's 128k extended context versions)

So far on llama.cpp I have just tried qwen 2.5 and functionary-small-v3.2 (without the functionary chat template). I'll be testing more soon! My internet is not that fast, and my work's VPN makes it worse, so downloading the models takes forever 😓

On Ollama, I have tried their qwen2.5-coder:7b model.

If you'd like to see the impact of grammar constraints (a key difference w/ Ollama), you could disable them in utils.hpp as follows:

Will do, thanks!

ochafik · 2025-03-06T15:38:48Z

So far on llama.cpp I have just tried qwen 2.5 and functionary-small-v3.2 (without the functionary chat template). I'll be testing more soon! My internet is not that fast, and my work's VPN makes it worse, so downloading the models takes forever 😓

@edmcman Ugh, I feel you! (my own ordeal is disk space, afraid I'm continuously wearing my SSD off 💀)

Note that if you already pulled other Ollama models, you can find their GGUF model to use w/ llama-server using a script like this (you need to pull the original Jinja template separately, which is light in bandwidth ;-)):

get_ollama_gguf.js ([gist](https://gist.github.com/ochafik/0e0d350344a5f503274d9909c9fe5569))

#!/usr/bin/env node
/*
    Get the file under $OLLAMA_HOME/models/blobs/ for the application/vnd.ollama.image.model key in the manifest
    - Note that metadata of modelId:modelTag is stored under $OLLAMA_HOME/models/manifests/registry.ollama.ai/library/${modelId}/${modelTag}
    - You'll need to get the Jinja template from the original model using llama.cpp's scripts/get_chat_template.py script

    ollama pull qwen2.5-coder:7b
    llama-server -m $( ./get_ollama_gguf.js qwen2.5-coder:7b ) -fa --jinja --chat-template-file <( ./scripts/get_chat_template.py Qwen/Qwen2.5-Coder-7B-Instruct-GGUF tool_use )
*/
const fs = require('fs');
const path = require('path');

const HOME = process.env.HOME;
const OLLAMA_HOME = process.env.OLLAMA_HOME || path.join(HOME, '.ollama');

const [model] = process.argv.slice(2);
if (!model) {
    console.error('Usage: node get_ollama_gguf.js <modelId:modelTag>');
    process.exit(1);
}
const [modelId, modelTag] = model.split(':');

const manifestFile = path.join(OLLAMA_HOME, 'models', 'manifests', 'registry.ollama.ai', 'library', modelId, modelTag);
if (!fs.existsSync(manifestFile)) {
    console.error(`Manifest file not found for ${modelId}:${modelTag}`);
    process.exit(1);
}
const manifest = JSON.parse(fs.readFileSync(manifestFile, 'utf8'));
const modelLayer = manifest.layers.find(l => l.mediaType === 'application/vnd.ollama.image.model');
if (!modelLayer) {
  console.error('Model layer not found');
  process.exit(1);
}

const modelFileName = modelLayer.digest.split(':').join('-');
const modelFile = path.join(OLLAMA_HOME, 'models', 'blobs', modelFileName);
if (!fs.existsSync(modelFile)) {
    console.error(`Model file not found for ${modelId}:${modelTag}`);
    process.exit(1);
}
console.log(modelFile);

ollama pull qwen2.5-coder:7b
llama-server -m $( ./get_ollama_gguf.js qwen2.5-coder:7b ) -fa --jinja --chat-template-file <( ./scripts/get_chat_template.py Qwen/Qwen2.5-Coder-7B-Instruct-GGUF tool_use )

strawberrymelonpanda · 2025-03-19T08:14:42Z

Thanks for reporting back, really matters to know this is useful and appreciated!!

@ochafik I've had this PR tab open in my browser for quite some time and only recently got around to building a simple voice assistant I've been meaning to make, which depends on tool calling.

I've built a lot of LLM tools, but I've put off tool calling for ages due to the gotchas involved, and this really helped to smooth away those wrinkles. Using Qwen 32b with llama-server --jinga, the process of getting tool calling working was straight forward and worked like a charm right out of the box.

So, thanks from me as well. Sincerely looking forward to #12379, but it's incredibly useful as is.

Dampfinchen · 2025-03-29T22:57:42Z

Sadly it doesn't work for me. I'm trying to use VPet with the ChatVPet Plugin, which supports function calling via OpenAI API. Neither LLama 3.1 8B nor Gemma 3 works (I sort of expected Gemma 3 not to work since it doesn't has function calling in its chat template).

For Llama 3.1, it just prints the tool calling in the chatbox for the VRPet which it shouldn't. The GGUF (https://huggingface.co/bartowski/Llama-3.1-8B-Ultra-Instruct-GGUF/blob/main/Llama-3.1-8B-Ultra-Instruct-Q4_K_S.gguf) has the correct metadata for tool calling, but srv params_from_: Chat format shows content-only (I assume if it worked correctly, it would show llama 3 as chat format.

This is the command I put in llama-server: ./llama-server -m "./llama-3.1-8b-ultra-instruct-q4_k_s-imat.gguf" --jinja -c 12288 -ngl 99 -fa --host 127.0.0.1 --port 8080 -t 6

am I missing something`?

ochafik · 2025-03-30T21:28:59Z

Chat format shows content-only (I assume if it worked correctly, it would show llama 3 as chat format.

Hi @Dampfinchen, this might indicate that your request didn’t have a tools parameter in it, could you share a repro test case (e.g. a self-contained curl command? Cf. example on the wiki)

Kreijstal · 2025-03-30T22:04:07Z

get_ollama_gguf.js

can you put that in a gist, so I can favorite it <3

ochafik · 2025-03-30T22:08:34Z

get_ollama_gguf.js

can you put that in a gist, so I can favorite it <3

Haha, sure! get_ollama_gguf.js

Call updated to match the tool used in the output just below, following the example in ggml-org/llama.cpp#9639

Deathn0t · 2025-07-10T13:50:30Z

Hello, I found this thread very related to chat templates/formats. I am using GGUF files for example from bartowski/Meta-Llama-3.1-8B-Instruct-GGUF that include the chat-template, correctly displayed when I start the llama-server. However, later when I use chat.completions I see log from the server as Chat format: Content-Only. I understood it means the chat template is not applied and the row content of messages is used. What should I do to correctly use the chat-template?

Here is an example command as list (launched through Python subprocess...) that I use to start the llama-server:

['llama-server', '--host', '127.0.0.1', '--port', '10001', '--threads', '8', '--alias', 'Llama-3.1-8B-Q4_K_L', '--model', './models/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/Meta-Llama-3.1-8B-Instruct-Q4_K_L.gguf', '--n-gpu-layers', '99', '--flash-attn', '--parallel', '4', '--jinja']

Update:
After reading accross the github I understood that the Chat format: Llama 3.x is meaningful only when tools are provided. If no tools are provided is Chat format: Content-only correct (i.e., correctly using the template loaded at startup)?

github-actions bot added testing Everything test related examples python python script changes server labels Sep 25, 2024

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary 3.2, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine Sep 25, 2024

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) & Minimalist Jinja template engine~~ Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 25, 2024

ochafik mentioned this pull request Sep 27, 2024

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

Closed

15 tasks

ochafik changed the title ~~Tool call support (Llama 3.1, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine~~ Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine Sep 28, 2024

github-actions bot added the script Script related label Oct 2, 2024

ochafik changed the title ~~Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro) w/ lazy grammars & minimalist Jinja engine~~ Tool call support (Llama 3.x, Functionary v3, Hermes 2 Pro, Mistral Nemo, generic) w/ lazy grammars & minimalist Jinja engine Oct 24, 2024

ochafik added 13 commits October 27, 2024 16:44

nits

ec9f3b1

tool-call: slow tool call integration tests

9a86ea7

space nits

c88095e

tool_call: test no tool call on a real model + rename scenarios

7fde6d0

tool-call: script to prefetch models used in server tests

dd6d024

Update tool_call.feature

168add7

tool-call: add tests: tool_call=none, parallel_tool_calls=true

ec547e4

tool-call: remove duplicate script to fetch templates

b51c71c

8000

agent: simplify syntax (default tools to local w/ default port)

74d71a6

tool-call: use Q4_K_M models

b825440

tool-call: update scripts/fetch_server_test_models.py

aefac1e

tool-call: test Hermes-3-Llama-3.1-8B

64287a3

tool-call: use functionary-small-v3.2-Q8_0.gguf in test (Q4_K_M too…

fa4c111

… dumb for function call)

ochafik mentioned this pull request Mar 6, 2025

tool-call: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars #12034

Merged

3 tasks

edmcman mentioned this pull request Mar 12, 2025

Better chat template handling - support Jinja containers/ramalama#890

Closed

ochafik mentioned this pull request Mar 14, 2025

server: streaming of tool calls and thoughts when --jinja is on #12379

Merged

10 tasks

dmahurin mentioned this pull request Apr 5, 2025

Add basic function calling example using a llama-cli python wrapper #9592

Closed

4 tasks

vijaysaayi mentioned this pull request May 11, 2025

Feature Request: Support for function calling in llama-server ikawrakow/ik_llama.cpp#407

Closed

4 tasks

justinryan-0923 pushed a commit to justinryan-0923/llama.cpp that referenced this pull request May 30, 2025

server : (docs) Update wrong tool calling example (#11809)

6073da8

Call updated to match the tool used in the output just below, following the example in ggml-org/llama.cpp#9639

ggerganov added the hot Something that is hot label Jul 11, 2025

firecoperana mentioned this pull request Aug 23, 2025

Tool calls support from mainline ikawrakow/ik_llama.cpp#723

Merged

4 tasks

sayap mentioned this pull request Aug 28, 2025

common : add GLM-4.5 tool calling support #15186

Closed

createthis mentioned this pull request Sep 8, 2025

Deepseek V3.1 native tool calling support (OpenAI Style) #15533

Merged

This was referenced Oct 22, 2025

changelog : llama-server REST API C 4D1C OG-GTM/llama.cpp#245

Open

changelog : libllama API COG-GTM/llama.cpp#246

Open

ochafik mentioned this pull request Dec 26, 2025

[WIP] tool-call: experimental migration of all parsers to peg-parser infra (w/ better test coverage) #18353

Draft

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Conversation

Uh oh!

Which models are supported (in their native style)?

How to use / test

Background

TODOs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants