10000 Eval bug: Long text numbers/words in prompt breaks llama.cpp permanently in parallel mode with flash attention · Issue #12758 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: Long text numbers/words in prompt breaks llama.cpp permanently in parallel mode with flash attention #12758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jj123451 opened this issue Apr 4, 2025 · 2 comments

Comments

@jj123451
Copy link
jj123451 commented Apr 4, 2025

Name and Version

version: 5050 (23106f9)
built with MSVC 19.29.30158.0 for
(https://github.com/ggml-org/llama.cpp/releases/download/b5050/llama-b5050-bin-win-cuda-cu12.4-x64.zip)

Operating systems

Windows

GGML backends

CUDA

Hardware

i5 13600K + RTX 4080

Models

google_gemma-3-12b-it-IQ4_XS
(https://huggingface.co/bartowski/google_gemma-3-12b-it-GGUF/blob/main/google_gemma-3-12b-it-IQ4_XS.gguf)

Problem description & steps to reproduce

Prerequisite:

  • occurs only in parallel mode with flash attention
    • in my case: fails with -np 3 -fa, works ok with -np 1 -fa and with -np 3 (without -fa)

Symptoms:

  • After executing prompt which contains really long number, all subsequent prompt executions will execute "forever" without correct answer (by "forever" I mean until number of tokens to predict is reached)
  • The symptoms do not fix themselves, llama server restart is the only solution

Recreate steps:

  • run llama server: llama-server.exe -m "google_gemma-3-12b-it-IQ4_XS.gguf" --port 8087 --api-key "empty" -n 1000 -fa -lv 0 -ngl 999 -c 21000 -np 3 --temp 1.0 --top-k 64 --top-p 0.95
  • execute following python script (first rename to .py)
    test.txt, e.g.
    • install python venv with open ai library, e.g. uv venv . --python 3.12
    • install openai lib uv pip install openai
    • activate venv: Scripts\activate
    • and execute python.exe .\test.py

Behaviour:

  • Script executes (almost) the same prompt 5 times. The only difference is that execution nr. 3 contains really long number.
  • executions 1 & 2 works fine (fast, correct response)
  • execution 3 fails (long, no answer) - but this is kinda expected
  • executions 4 & 5 also fails (and they are identical to 1 & 2 - should not fail)

First Bad Commit

Not sure, I started playing with parallel execution recently. It might be here from the very beginning.

Relevant log output

The malicious prompt (also present in python file):
You are an assistant helping to analyze my email.
The analyzed emails come from my private mailbox: some_email@gmail.com
Below are the details of a particular email: Subject, Sender, Date, Recipient, and Summary of the email content.
Your task is to assess whether the given email is an advertisement.
Examples of such emails may include emails consisting entirely of product advertisements, emails informing about sales or promotions, and similar.
Below are the data and summary of the email for analysis, they begin with the tag: <mail> and end with the tag </mail>
<mail>
Subject: ALIBIURO.pl Order confirmation 310906 (512905216)
Sender: "Alibiuro.pl" info@alibiuro.pl
Date: 2023-02-27T08:06:00Z
Recipient: some_email@gmail.com
Summary: Order no. 11000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 (512905216) has been placed and forwarded for processing by Alibiuro. The package will be sent within 24 business hours and a shipping notification along with the tracking number will be sent by email.
</mail>
It may happen that an email is an advertisement only in part, then consider it not an advertisement.
If you are not sure whether the email is an advertisement, consider it not an advertisement.
After finishing, for certainty, verify once more the correctness of your assessment. 
Please don't include your reasoning description in the answer. 
Place the final decision at the end of the answer in the following form:
'Advertisement: YES' if it is an advertisement
'Advertisement: NO' otherwise

llama_server logs:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes
build: 5050 (23106f94) with MSVC 19.29.30158.0 for
system info: n_threads = 14, n_threads_batch = 14, total_threads = 20

system_info: n_threads = 14 (n_threads_batch = 14) / 20 | CUDA : ARCHS = 500,610,700,750,800 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8087, http threads: 19
main: loading model
srv    load_model: loading model 'D:/llm/models/google_gemma-3-12b-it-IQ4_XS/google_gemma-3-12b-it-IQ4_XS.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080) - 15048 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 626 tensors from D:/llm/models/google_gemma-3-12b-it-IQ4_XS/google_gemma-3-12b-it-IQ4_XS.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 12b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 12B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 12b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 3840
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 48
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 15360
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 16
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 30
llama_model_loader: - kv  40:                      quantize.imatrix.file str              = /models_out/gemma-3-12b-it-GGUF/googl...
llama_model_loader: - kv  41:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  42:             quantize.imatrix.entries_count i32              = 336
llama_model_loader: - kv  43:              quantize.imatrix.chunks_count i32              = 129
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  336 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ4_XS - 4.25 bpw
print_info: file size   = 6.09 GiB (4.45 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3840
print_info: n_layer          = 48
print_info: n_head           = 16
print_info: n_head_kv        = 8
print_info: n_rot            = 256
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 15360
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 12B
print_info: model params     = 11.77 B
print_info: general.name     = Gemma 3 12b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =  6241.10 MiB
load_tensors:   CPU_Mapped model buffer size =   787.50 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 3
llama_context: n_ctx         = 21000
llama_context: n_ctx_per_seq = 7000
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (7000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     3.00 MiB
init: kv_size = 21248, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
init:      CUDA0 KV buffer size =  7968.00 MiB
llama_context: KV self size  = 7968.00 MiB, K (f16): 3984.00 MiB, V (f16): 3984.00 MiB
llama_context:      CUDA0 compute buffer size =   519.50 MiB
llama_context:  CUDA_Host compute buffer size =    90.51 MiB
llama_context: graph nodes  = 1833
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 21248
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 3
slot         init: id  0 | task -1 | new slot n_ctx_slot = 7082
slot         init: id  1 | task -1 | new slot n_ctx_slot = 7082
slot         init: id  2 | task -1 | new slot n_ctx_slot = 7082
main: model loaded
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:8087 - starting the main loop
srv  update_slots: all slots are idle
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 7082, n_keep = 0, n_prompt_tokens = 385
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 385, n_tokens = 385, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 385, n_tokens = 385
slot      release: id  0 | task 0 | stop processing: n_past = 388, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =     339.74 ms /   385 tokens (    0.88 ms per token,  1133.23 tokens per second)
       eval time =      49.53 ms /     4 tokens (   12.38 ms per token,    80.76 tokens per second)
      total time =     389.27 ms /   389 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 5 | processing task
slot update_slots: id  0 | task 5 | new prompt, n_ctx_slot 
8000
= 7082, n_keep = 0, n_prompt_tokens = 385
slot update_slots: id  0 | task 5 | need to evaluate at least 1 token to generate logits, n_past = 385, n_prompt_tokens = 385
slot update_slots: id  0 | task 5 | kv cache rm [384, end)
slot update_slots: id  0 | task 5 | prompt processing progress, n_past = 385, n_tokens = 1, progress = 0.002597
slot update_slots: id  0 | task 5 | prompt done, n_past = 385, n_tokens = 1
slot      release: id  0 | task 5 | stop processing: n_past = 388, truncated = 0
slot print_timing: id  0 | task 5 |
prompt eval time =      14.51 ms /     1 tokens (   14.51 ms per token,    68.93 tokens per second)
       eval time =      43.55 ms /     4 tokens (   10.89 ms per token,    91.84 tokens per second)
      total time =      58.06 ms /     5 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  0 | task 10 | processing task
slot update_slots: id  0 | task 10 | new prompt, n_ctx_slot = 7082, n_keep = 0, n_prompt_tokens = 1359
slot update_slots: id  0 | task 10 | kv cache rm [216, end)
slot update_slots: id  0 | task 10 | prompt processing progress, n_past = 1359, n_tokens = 1143, progress = 0.841060
slot update_slots: id  0 | task 10 | prompt done, n_past = 1359, n_tokens = 1143
slot      release: id  0 | task 10 | stop processing: n_past = 2358, truncated = 0
slot print_timing: id  0 | task 10 |
prompt eval time =     308.26 ms /  1143 tokens (    0.27 ms per token,  3707.85 tokens per second)
       eval time =   15146.68 ms /  1000 tokens (   15.15 ms per token,    66.02 tokens per second)
      total time =   15454.95 ms /  2143 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  1 | task 1011 | processing task
slot update_slots: id  1 | task 1011 | new prompt, n_ctx_slot = 7082, n_keep = 0, n_prompt_tokens = 385
slot update_slots: id  1 | task 1011 | kv cache rm [0, end)
slot update_slots: id  1 | task 1011 | prompt processing progress, n_past = 385, n_tokens = 385, progress = 1.000000
slot update_slots: id  1 | task 1011 | prompt done, n_past = 385, n_tokens = 385
slot      release: id  1 | task 1011 | stop processing: n_past = 1384, truncated = 0
slot print_timing: id  1 | task 1011 |
prompt eval time =     129.78 ms /   385 tokens (    0.34 ms per token,  2966.67 tokens per second)
       eval time =   16373.33 ms /  1000 tokens (   16.37 ms per token,    61.07 tokens per second)
      total time =   16503.11 ms /  1385 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
srv  params_from_: Chat format: Content-only
slot launch_slot_: id  2 | task 2012 | processing task
slot update_slots: id  2 | task 2012 | new prompt, n_ctx_slot = 7082, n_keep = 0, n_prompt_tokens = 385
slot update_slots: id  2 | task 2012 | kv cache rm [0, end)
slot update_slots: id  2 | task 2012 | prompt processing progress, n_past = 385, n_tokens = 385, progress = 1.000000
slot update_slots: id  2 | task 2012 | prompt done, n_past = 385, n_tokens = 385
slot      release: id  2 | task 2012 | stop processing: n_past = 1384, truncated = 0
slot print_timing: id  2 | task 2012 |
prompt eval time =     133.54 ms /   385 tokens (    0.35 ms per token,  2882.94 tokens per second)
       eval time =   17365.05 ms /  1000 tokens (   17.37 ms per token,    57.59 tokens per second)
      total time =   17498.60 ms /  1385 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
@jj123451
Copy link
Author
jj123451 commented Apr 4, 2025

I also checked old versions back to the first supporting gemma3 (b4997, b4923, b4898, b4875) - all have the error

@jj123451 jj123451 changed the title Eval bug: Long text numbers in prompt breaks llama.cpp permanently in parallel mode Eval bug: Long text numbers/words in prompt breaks llama.cpp permanently in parallel mode with flash attention Apr 4, 2025
@github-actions github-actions bot added the stale label May 5, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
0