8000 Eval bug: No output using llama-batched-bench · Issue #13553 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: No output using llama-batched-bench #13553

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shibizhao opened this issue May 15, 2025 · 2 comments
Closed

Eval bug: No output using llama-batched-bench #13553

shibizhao opened this issue May 15, 2025 · 2 comments

Comments

@shibizhao
Copy link
Contributor

Name and Version

$ ./build_cpu/bin/llama-cli
build: 5374 (72df31d) with cc (Ubuntu 13.1.0-8ubuntu1~22.04) 13.1.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

Intel 6248R

Models

Llama2-7b-q8_0.gguf

Problem description & steps to reproduce

Hi, when I run the scripts ./build_cpu/bin/llama-batched-bench -m ~/LLM/GGUF/Llama-2-7B-GGUF/llama-2-7b.Q8_0.gguf -npp 512 -ntg 512 -npl 128, there is no output in the table:

I also evaluate the script on ARM CPUs. It has the same outputs.

Thanks.

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48          
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |                                                      |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|                                                      
llama_perf_context_print:        load time =    1418.38 ms
llama_perf_context_print: prompt eval time =     195.92 ms /    16 tokens (   12.24 ms per token,    81.67 tokens per second)                             llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)                             llama_perf_context_print:       total time =    1418.39 ms /    17 tokens

First Bad Commit

No response

Relevant log output

main: n_kv_max = 4096, n_batch = 2048, n_ubatch = 512, flash_attn = 0, is_pp_shared = 0, n_gpu_layers = -1, n_threads = 48, n_threads_batch = 48          
|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |                                                      |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|                                                      
llama_perf_context_print:        load time =    1418.38 ms
llama_perf_context_print: prompt eval time =     195.92 ms /    16 tokens (   12.24 ms per token,    81.67 tokens per second)                             llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)                             llama_perf_context_print:       total time =    1418.39 ms /    17 tokens
@ggerganov
Copy link
Member

The -npl 128 means that it will try to allocate 128 parallel sequences with prompt of 512 tokens. This does not fit in the context size that you specified (n_kv_max = 4096). Try to reduce -npl 1,2,3,4,....

@shibizhao
Copy link
Contributor Author

Thanks for your reply! So I should increase the context (-c) or reduce the number of sequences (-npl).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0