8000 `server`: streaming of tool calls and thoughts when `--jinja` is on by ochafik · Pull Request #12379 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

server: streaming of tool calls and thoughts when --jinja is on #12379

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 102 commits into from
May 25, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
102 commits
Select commit Hold shift + click to select a range
16c9c63
add common_regex w/ support for partial final matches
Mar 12, 2025
6dcff43
add common_json w/ support for truncated json healing
Mar 12, 2025
a95fe78
renaming: string_find_partial_stop (moved to common.cpp)
Mar 12, 2025
ce2f593
add common_chat_msg_diff
Mar 12, 2025
cd3681d
partial common_chat_parse
Mar 12, 2025
9462365
refactor parser w/ optionals
Mar 12, 2025
6ed8a8f
server: wire chat diffs in stream mode
Mar 12, 2025
eaeed7d
fix trigger of thinking models (must happen after thoughts are closed)
Mar 13, 2025
d6e680a
nits + docs
Mar 14, 2025
64ea080
fix functionary v3.2 raw python!
Mar 14, 2025
c46d4da
rename: common_chat_syntax (now contains format)
Mar 14, 2025
4358d5d
rm common_regex.at_start
Mar 14, 2025
f477288
Merge remote-tracking branch 'origin/master' into tool-diffs
Mar 14, 2025
e0202b3
fix gcc compilation
Mar 14, 2025
f840e3a
fix unreachable code warning after [[noreturn]] annotation
Mar 14, 2025
af7391e
fix / refactor test-regex-partial
Mar 14, 2025
449917b
fix test-chat
Mar 14, 2025
b428b5c
rm spaces
Mar 14, 2025
668fc90
fix command r7b partial parsing (lacked args path)
Mar 14, 2025
b48ab23
Update test_chat_completion.py
Mar 14, 2025
aefc8a4
refactor + test chat parser (try_consume_json_with_dumped_args, liter…
Mar 15, 2025
22428a4
return partial msg from server
Mar 15, 2025 < 10000 /div>
5b9c5a4
refactor partial json
Mar 15, 2025
3fbe84f
don't return empty <think></think>
Mar 15, 2025
d4cb7fe
test_tool_call: allow comment lines in now-multiline code strings (fo…
Mar 15, 2025
31f5eb2
accommodate yet another deepseek r1 distill fantasy syntax (<|tool▁ca…
Mar 15, 2025
bddc65a
rm space
Mar 15, 2025
ea3bf03
nit: fix python type
Mar 15, 2025
f3bfbc6
refactor test-chat-parser
Mar 15, 2025
bb7b9fe
fix QwQ 32B tool call parsing after thoughts (hermes2)
Mar 15, 2025
f0ea330
fix thinking models + tool calls (</think> not part of trigger's capt…
Mar 15, 2025
7856949
reinstate tool call id logic, keep track of previously generated ids
Mar 15, 2025
2412b5d
better logs for triggers
Mar 15, 2025
02913b0
fix msg diff test
Mar 15, 2025
c5c3482
try_consume_regex: basic tests + fix non-partial case
Mar 15, 2025
af79da0
chat-parser: test+fix finish, incomplete methods
Mar 15, 2025
562800f
normalize args in test-chat
Mar 15, 2025
ddeb318
consume spaces after parse_json_tool_calls
Mar 15, 2025
6c3f87e
Revert "fix thinking models + tool calls (</think> not part of trigge…
Mar 15, 2025
e2cef66
fix required tool calls w/ thinking models that have pre-opened think…
Mar 15, 2025
7a61eca
fix thinking model's initial trigger (take 2) + test qwq's template
Mar 15, 2025
2f55571
refactor chat parser (rm incomplete)
Mar 15, 2025
303f640
test groups of common_chat_msg_parser.try_consume_regex
Mar 15, 2025
e9540ad
run most test_tool_call tests in stream + non-stream modes
Mar 15, 2025
a818114
make functionary v3.2 parsing more strict (differentiate first match …
Mar 16, 2025
5031366
send final diff from server, to close off raw python arguments
Mar 16, 2025
dae6a28
nit: spaces
Mar 16, 2025
f026cb0
fix diff aggregation logic in make_any_request
Mar 16, 2025
e7f9d3e
fix test_chat_completion_with_timings_per_token & test_logprobs_stream
Mar 16, 2025
165b525
add missing functional import for gcc compilation
Mar 16, 2025
9d4a6f1
fix typo in test_calc_result
Mar 16, 2025
64b4039
fix thoughts parsing logic
Mar 16, 2025
fbba5da
support partial content streaming in Generic mode
Mar 16, 2025
4dcd653
strip reasoning (now that tags are strings and not regexes)
Mar 16, 2025
56156b7
run test_thoughts in stream mode too
Mar 16, 2025
5dfa2f7
r1: avoid partial call triggers from spaces
Mar 16, 2025
91a5084
fix test_thoughts / refactor expectations
Mar 16, 2025
4f78d44
fix partial json crashes
Mar 16, 2025
ea57e47
fix test-chat's unparsed thought expectation
Mar 16, 2025
1d25178
Merge remote-tracking branch 'origin/master' into tool-diffs
Mar 23, 2025
42cb16f
fix partial json crash after comma
Mar 23, 2025
37b4a3a
fix test-chat.cpp
Mar 23, 2025
13d725d
fix gcc build of test
Mar 23, 2025
a40aead
Merge remote-tracking branch 'origin/master' into tool-diffs
Mar 26, 2025
329d943
Merge remote-tracking branch 'origin/master' into tool-diffs
Apr 1, 2025
e63e542
Merge remote-tracking branch 'origin/master' into tool-diffs
Apr 3, 2025
21cd34c
fix regex-partial (drop reluctant repetitions conversions)
Apr 4, 2025
5f0450d
partial regex: allow newlines in prefixes
Apr 4, 2025
36ecb01
tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
Apr 4, 2025
68eeff1
Update function-calling.md
Apr 4, 2025
12deff6
nit: spaces
Apr 4, 2025
d0a686b
Update tool_bench.py
Apr 4, 2025
a604b2d
Merge remote-tracking branch 'origin/master' into tool-diffs
Apr 4, 2025
90789cd
Inject date_string in llama 3.x + test it & functionary v2
Apr 5, 2025
71435cf
Inject date_string in llama 3.x + fix for functionary v2
Apr 5, 2025
543b73e
add missing chrono include
Apr 7, 2025
e3c372c
move/fix detection of functionary v3.1 before llama 3.x, fix & test t…
Apr 7, 2025
387611a
Merge branch 'date' into tool-diffs
Apr 7, 2025
01a3e31
Merge remote-tracking branch 'origin/master' into tool-diffs
Apr 7, 2025
59b87c5
move string_find_partial_stop & string_ends_with to common
Apr 7, 2025
ff35374
add common_regex (supports partial matches)
Apr 7, 2025
869e1a9
Update test-regex-partial.cpp
Apr 7, 2025
6f109fa
Update common/common.cpp
ochafik Apr 18, 2025
908e12f
Update common/regex-partial.cpp
ochafik Apr 18, 2025
868b442
Update common/regex-partial.cpp
ochafik Apr 18, 2025
2ea5f5c
Update common/regex-partial.h
ochafik Apr 18, 2025
b275da3
partial regex: add missing iterator end checks
Apr 18, 2025
9b620e5
string utils: use string_views
Apr 18, 2025
5c99bdc
direct throw to avoid ggml.h include
Apr 18, 2025
e051be6
regex-partial: replace missed ggml_asserts
Apr 18, 2025
afce553
Merge remote-tracking branch 'origin/master' into partial-regex
May 14, 2025
c879a57
Merge branch 'partial-regex' into tool-diffs
May 14, 2025
ad07a3b
Merge remote-tracking branch 'origin/master' into tool-diffs
May 15, 2025
573e8c3
fix merge
May 15, 2025
d6e1d5b
Merge remote-tracking branch 'origin/master' into tool-diffs
May 15, 2025
6946a83
Merge remote-tracking branch 'origin/master' into tool-diffs
May 15, 2025
224101b
chat-parser: remove input from exception (llm output may contain PII)
May 16, 2025
6ddda10
Merge remote-tracking branch 'origin/master' into tool-diffs
May 16, 2025
8886c24
disable failing tests from test_tool_call.py
May 16, 2025
810c4c3
json-partial: add comments
May 17, 2025
f0d5df2
Merge remote-tracking branch 'origin/master' into tool-diffs
May 23, 2025
40951c8
Merge remote-tracking branch 'origin/master' into tool-diffs
May 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
rename: common_chat_syntax (now contains format)
  • Loading branch information
ochafik committed Mar 14, 2025
commit c46d4da4c2b7f3bfb9e3d555930d0ab2febf8a2e
12 changes: 6 additions & 6 deletions common/chat-parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@

using json = nlohmann::ordered_json;

common_chat_msg_parser::common_chat_msg_parser(const std::string & input, bool is_partial, const common_chat_reasoning_syntax & reasoning_syntax)
: input_(input), is_partial_(is_partial), reasoning_syntax_(reasoning_syntax)
common_chat_msg_parser::common_chat_msg_parser(const std::string & input, bool is_partial, const common_chat_syntax & syntax)
: input_(input), is_partial_(is_partial), syntax_(syntax)
{
result_.role = "assistant";

Expand Down Expand Up @@ -127,14 +127,14 @@ void common_chat_msg_parser::consume_literal(const std::string & literal) {
}

void common_chat_msg_parser::try_consume_think_tags(const common_regex & start_think_regex, const common_regex & end_think_regex) {
if (reasoning_syntax_.format != COMMON_REASONING_FORMAT_NONE) {
if (reasoning_syntax_.thinking_forced_open || try_consume_regex(start_think_regex)) {
if (syntax_.reasoning_format != COMMON_REASONING_FORMAT_NONE) {
if (syntax_.thinking_forced_open || try_consume_regex(start_think_regex)) {
if (auto res = try_find_regex(end_think_regex)) {
result_.reasoning_content = res->prelude;
consume_spaces();
} else {
result_.reasoning_content = consume_rest();
if (!reasoning_syntax_.thinking_forced_open) {
if (!syntax_.thinking_forced_open) {
incomplete("Failed to find end of reasoning tag " + end_think_regex.str());
}
return;
Expand Down Expand Up @@ -218,7 +218,7 @@ std::optional<common_json> common_chat_msg_parser::try_consume_json(
// No healing marker, just return the parsed json
return result;
}
if (!is_partial_) {
if (!is_partial()) {
incomplete("JSON is incomplete");
return std::nullopt; // Actually unreachable
}
Expand Down
4 changes: 2 additions & 2 deletions common/chat-parser.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ class common_chat_msg_partial_exception : public std::runtime_error {
class common_chat_msg_parser {
std::string input_;
bool is_partial_;
common_chat_reasoning_syntax reasoning_syntax_;
common_chat_syntax syntax_;

size_t pos_ = 0;
common_chat_msg result_;
std::string healing_marker_;

public:
common_chat_msg_parser(const std::string & input, bool is_partial, const common_chat_reasoning_syntax & reasoning_syntax);
common_chat_msg_parser(const std::string & input, bool is_partial, const common_chat_syntax & syntax);
const std::string & input() const { return input_; }
size_t pos() const { return pos_; }
const std::string & healing_marker() const { return healing_marker_; }
Expand Down
51 changes: 29 additions & 22 deletions common/chat.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -578,17 +578,22 @@ static void parse_json_tool_calls(
// get_function_name signalled us that we should skip this match and treat it as content.
from = res->groups[0].begin + 1;
continue;
} else {
from = std::string::npos;
}
from = std::string::npos;

builder.add_content(res->prelude);
if (auto partial = builder.try_consume_json({{}})) {
std::string arguments = partial->json.dump();
if (!builder.add_tool_call(name, "", arguments, partial->healing_marker)) {
builder.incomplete("incomplete tool call");
auto maybe_raw_python = name == "python" && allow_raw_python;
if (builder.input()[builder.pos()] == '{' || !maybe_raw_python) {
if (auto partial = builder.try_consume_json({{}})) {
std::string arguments = partial->json.dump();
if (!builder.add_tool_call(name, "", arguments, partial->healing_marker)) {
builder.incomplete("incomplete tool call");
}
builder.consume_regex(close_regex);
}
builder.consume_regex(close_regex);
} else if (name == "python" && allow_raw_python) {
continue;
}
if (maybe_raw_python) {
auto code = builder.consume_rest();
std::string arguments;
common_healing_marker healing_marker;
Expand All @@ -602,13 +607,11 @@ static void parse_json_tool_calls(
builder.incomplete("incomplete tool call");
}
return;
} else {
builder.incomplete("incomplete tool call");
return;
}
} else {
break;
builder.incomplete("incomplete tool call");
return;
}
break;
}
if (block_close) {
builder.consume_regex(*block_close);
Expand Down Expand Up @@ -1238,14 +1241,18 @@ static common_chat_params common_chat_params_init_functionary_v3_2(const common_
std::string args_pattern = "[\\s\\S]*";
auto args_rule = builder.add_schema(name + "-args", parameters);
if (name == "python") {
args_pattern = "\\{" + args_pattern;
args_rule = builder.add_rule(name + "-maybe-raw-args", args_rule + " | [^{] .*");
} else {
args_pattern = "\\{" + args_pattern;
}
auto call_rule = builder.add_rule(name + "-call", "\"" + name + "\\n\" " + args_rule);
first_tool_rules.push_back(call_rule);
if (inputs.parallel_tool_calls) {
subsequent_tool_rules.push_back(builder.add_rule(name + "-call2", "\">>>\" " + call_rule));
}
first_tool_rules.push_back(builder.add_rule(name + "-call", "( \"assistant<|end_header_id|>\\n\" )? \"" + name + "\\n\" " + args_rule));
subsequent_tool_rules.push_back(builder.add_rule(name + "-call2", "\">>>" + name + "\\n\" " + args_rule));
data.grammar_triggers.push_back({
COMMON_GRAMMAR_TRIGGER_TYPE_PATTERN_FULL,
"((?:[\\s\\S]*?>>>)?" + regex_escape(name) + "\n)" + args_pattern,
"((?:[\\s\\S]+?>>>)?" + regex_escape(name) + "\n)" + args_pattern,
});
});
data.preserved_tokens = {
Expand Down Expand Up @@ -1771,20 +1778,20 @@ static void common_chat_parse(common_chat_msg_parser & builder, common_chat_form
builder.finish();
}

common_chat_msg common_chat_parse(const std::string & input, common_chat_format format, bool is_partial, const common_chat_reasoning_syntax & reasoning_syntax) {
common_chat_msg_parser builder(input, is_partial, reasoning_syntax);
common_chat_msg common_chat_parse(const std::string & input, bool is_partial, const common_chat_syntax & syntax) {
common_chat_msg_parser builder(input, is_partial, syntax);
try {
common_chat_parse(builder, format);
common_chat_parse(builder, syntax.format);
} catch (const common_chat_msg_partial_exception & ex) {
LOG_DBG("Partial parse: %s\n", ex.what());
if (!is_partial) {
throw std::runtime_error(ex.what());
}
}
auto msg = builder.result();
switch (reasoning_syntax.format) {
switch (syntax.reasoning_format) {
case COMMON_REASONING_FORMAT_DEEPSEEK:
if (!msg.reasoning_content.empty() && reasoning_syntax.inlined_in_content) {
if (!msg.reasoning_content.empty() && syntax.reasoning_in_content) {
std::string content = "<think>" + msg.reasoning_content;
if (!is_partial || !msg.content.empty()) {
content += "</think>";
Expand Down
12 changes: 7 additions & 5 deletions common/chat.h
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,12 @@ struct common_chat_params {
std::vector<std::string> additional_stops;
};

struct common_chat_reasoning_syntax {
common_reasoning_format format = COMMON_REASONING_FORMAT_NONE;
bool inlined_in_content = false;
bool thinking_forced_open = false;
struct common_chat_syntax {
common_chat_format format = COMMON_CHAT_FORMAT_CONTENT_ONLY;
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_NONE;
// Whether reasoning_content should be inlined in the content (e.g. for reasoning_format=deepseek in stream mode)
bool reasoning_in_content = false;
bool thinking_forced_open = false;
};

// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
Expand Down Expand Up @@ -166,7 +168,7 @@ std::string common_chat_format_example(
bool use_jinja);

std::string common_chat_format_name(common_chat_format format);
common_chat_msg common_chat_parse(const std::string & input, common_chat_format format, bool is_partial = false, const common_chat_reasoning_syntax & reasoning_syntax = {});
common_chat_msg common_chat_parse(const std::string & input, bool is_partial, const common_chat_syntax & syntax);

common_chat_tool_choice common_chat_tool_choice_parse_oaicompat(const std::string & tool_choice);

Expand Down
32 changes: 16 additions & 16 deletions examples/server/server.cpp
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
#include "chat.h"
#include "utils.hpp"

#include "arg.h"
Expand Down Expand Up @@ -117,8 +118,7 @@ struct slot_params {
oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
std::string oaicompat_model;
std::string oaicompat_cmpl_id;
common_chat_format oaicompat_chat_format = COMMON_CHAT_FORMAT_CONTENT_ONLY;
common_chat_reasoning_syntax oaicompat_reasoning_syntax;
common_chat_syntax oaicompat_chat_syntax;

json to_json() const {
std::vector<std::string> samplers;
Expand Down Expand Up @@ -174,7 +174,10 @@ struct slot_params {
{"grammar_lazy", sampling.grammar_lazy},
{"grammar_triggers", grammar_triggers},
{"preserved_tokens", sampling.preserved_tokens},
{"chat_format", common_chat_format_name(oaicompat_chat_format)},
{"chat_format", common_chat_format_name(oaicompat_chat_syntax.format)},
{"reasoning_format", (oaicompat_chat_syntax.reasoning_format == COMMON_REASONING_FORMAT_DEEPSEEK ? "deepseek" : "none")},
{"reasoning_in_content", oaicompat_chat_syntax.reasoning_in_content},
{"thinking_forced_open", oaicompat_chat_syntax.thinking_forced_open},
{"samplers", samplers},
{"speculative.n_max", speculative.n_max},
{"speculative.n_min", speculative.n_min},
Expand Down Expand Up @@ -349,14 +352,14 @@ struct server_task {
{
auto it = data.find("chat_format");
if (it != data.end()) {
params.oaicompat_chat_format = static_cast<common_chat_format>(it->get<int>());
SRV_INF("Chat format: %s\n", common_chat_format_name(params.oaicompat_chat_format).c_str());
params.oaicompat_chat_syntax.format = static_cast<common_chat_format>(it->get<int>());
Copy link
Preview
Copilot AI Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transition from using a simple chat format enum to a full common_chat_syntax struct enhances flexibility but consider adding inline documentation or comments on the new fields (reasoning_format, reasoning_in_content, thinking_forced_open) to aid readability and backward compatibility.

Copilot uses AI. Check for mistakes.

SRV_INF("Chat format: %s\n", common_chat_format_name(params.oaicompat_chat_syntax.format).c_str());
} else {
params.oaicompat_chat_format = defaults.oaicompat_chat_format;
params.oaicompat_chat_syntax.format = defaults.oaicompat_chat_syntax.format;
}
params.oaicompat_reasoning_syntax.format = params_base.reasoning_format;
params.oaicompat_reasoning_syntax.inlined_in_content = params.stream;
params.oaicompat_reasoning_syntax.thinking_forced_open = json_value(data, "thinking_forced_open", false);
params.oaicompat_chat_syntax.reasoning_format = params_base.reasoning_format;
params.oaicompat_chat_syntax.reasoning_in_content = params.stream;
params.oaicompat_chat_syntax.thinking_forced_open = json_value(data, "thinking_forced_open", false);
}

{
Expand Down Expand Up @@ -632,7 +635,7 @@ struct server_task_result_cmpl_final : server_task_result {
CCAC oaicompat_type oaicompat = OAICOMPAT_TYPE_NONE;
std::string oaicompat_model;
std::string oaicompat_cmpl_id;
common_chat_format oaicompat_chat_format = COMMON_CHAT_FORMAT_CONTENT_ONLY;
common_chat_syntax oaicompat_chat_syntax;
common_chat_msg oaicompat_msg;

virtual int get_index() override {
Expand Down Expand Up @@ -2335,9 +2338,8 @@ struct server_context {
SRV_DBG("Parsing chat message: %s\n", slot.generated_text.c_str());
auto new_msg = common_chat_parse(
slot.generated_text,
slot.params.oaicompat_chat_format,
/* is_partial= */ true,
slot.params.oaicompat_reasoning_syntax);
slot.params.oaicompat_chat_syntax);
if (!new_msg.empty()) {
slot.generated_msg = new_msg;
}
Expand All @@ -2347,7 +2349,6 @@ struct server_context {
// res->previous_content = slot.generated_text.substr(0, slot.generated_text.size() - tkn.text_to_send.size());
// res->oaicompat_chat_format = slot.params.oaicompat_chat_format;


// populate res.probs_output
if (slot.params.sampling.n_probs > 0) {
res->prob_output = tkn; // copy the token probs
Expand Down Expand Up @@ -2391,10 +2392,9 @@ struct server_context {
SRV_DBG("Parsing chat message: %s\n", res->content.c_str());
res->oaicompat_msg = slot.generated_msg = common_chat_parse(
res->content,
slot.params.oaicompat_chat_format,
/* is_partial= */ slot.stop == STOP_TYPE_LIMIT,
slot.params.oaicompat_reasoning_syntax);
res->oaicompat_chat_format = slot.params.oaicompat_chat_format;
slot.params.oaicompat_chat_syntax);
res->oaicompat_chat_syntax = slot.params.oaicompat_chat_syntax;

// populate res.probs_output
if (slot.params.sampling.n_probs > 0) {
Expand Down
Loading
0