Eval bug: No output using llama-batched-bench #13553

shibizhao · 2025-05-15T03:29:34Z

Name and Version

$ ./build_cpu/bin/llama-cli
build: 5374 (72df31d) with cc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

Intel 6248R

Models

Llama2-7b-q8_0.gguf

Problem description & steps to reproduce

Hi, when I run the scripts ./build_cpu/bin/llama-batched-bench -m ~/LLM/GGUF/Llama-2-7B-GGUF/llama-2-7b.Q8_0.gguf -npp 512 -ntg 512 -npl 128, there is no output in the table:

I also evaluate the script on ARM CPUs. It has the same outputs.

Thanks.

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48          
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |                                                      |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|                                                      
llama_perf_context_print:        load time =    1418.38 ms
llama_perf_context_print: prompt eval time =     195.92 ms /    16 tokens (   12.24 ms per token,    81.67 tokens per second)                             llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)                             llama_perf_context_print:       total time =    1418.39 ms /    17 tokens

First Bad Commit

No response

Relevant log output

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48          
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |                                                      |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|                                                      
llama_perf_context_print:        load time =    1418.38 ms
llama_perf_context_print: prompt eval time =     195.92 ms /    16 tokens (   12.24 ms per token,    81.67 tokens per second)                             llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)                             llama_perf_context_print:       total time =    1418.39 ms /    17 tokens

The text was updated successfully, but these errors were encountered:

ggerganov · 2025-05-15T05:56:06Z

The -npl 128 means that it will try to allocate 128 parallel sequences with prompt of 512 tokens. This does not fit in the context size that you specified (n_kv_max = 4096). Try to reduce -npl 1,2,3,4,....

shibizhao · 2025-05-19T07:13:20Z

Thanks for your reply! So I should increase the context (-c) or reduce the number of sequences (-npl).

shibizhao added the bug-unconfirmed label May 15, 2025

shibizhao closed this as completed May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: No output using llama-batched-bench #13553

Eval bug: No output using llama-batched-bench #13553

Uh oh!

Uh oh!

Eval bug: No output using llama-batched-bench #13553

Eval bug: No output using llama-batched-bench #13553

Comments

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Uh oh!

Uh oh!