8000 Eval bug: Server Returns Empty Responses Under High Load · Issue #13703 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: Server Returns Empty Responses Under High Load #13703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
prd-tuong-nguyen opened this issue May 22, 2025 · 0 comments
Open

Eval bug: Server Returns Empty Responses Under High Load #13703

prd-tuong-nguyen opened this issue May 22, 2025 · 0 comments

Comments

@prd-tuong-nguyen
Copy link

Name and Version

Image: https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/421179669?tag=server-cuda-b5452

Operating systems

Linux

GGML backends

CUDA

Hardware

L40S GPU (aws g6e.xlarge instance)

Models

https://huggingface.co/lmstudio-community/gemma-3-27b-it-GGUF/resolve/main/gemma-3-27b-it-Q4_K_M.gguf

Problem description & steps to reproduce

Command: docker run -p 8080:8080 -it --gpus all ghcr.io/ggml-org/llama.cpp:server-cuda --port 8080 --host 0.0.0.0 --parallel 16 --ctx-size 49152 --cont-batching --slot-prompt-similarity 0.3 --n-gpu-layers 100000000 --flash-attn --no-warmup --jinja --lora-init-without-apply --lora-scaled ... -m ...
After the high-load requests (50 CCU), the server response empty content even the completion_tokens is ok
This is the difference in log:
The error version:

llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 49152 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =  3840.00 MiB
llama_kv_cache_unified: size = 3840.00 MiB ( 49152 cells,  10 layers), K (f16): 1920.00 MiB, V (f16): 1920.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 18432 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =  7488.00 MiB
llama_kv_cache_unified: size = 7488.00 MiB ( 18432 cells,  52 layers), K (f16): 3744.00 MiB, V (f16): 3744.00 MiB
llama_context:      CUDA0 compute buffer size =   130.62 MiB
llama_context:  CUDA_Host compute buffer size =    35.63 MiB
llama_context: graph nodes  = 2489
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting

The previous version (work well):

llama_kv_cache_unified: kv_size = 49152, type_k = 'f16', type_v = 'f16', n_layer = 62, can_shift = 1, padding = 256
llama_kv_cache_unified:      CUDA0 KV buffer size = 23808.00 MiB
llama_kv_cache_unified: KV self size  = 23808.00 MiB, K (f16): 11904.00 MiB, V (f16): 11904.00 MiB
llama_context:      CUDA0 compute buffer size =   522.50 MiB
llama_context:  CUDA_Host compute buffer size =   202.51 MiB
llama_context: graph nodes  = 2365
llama_context: graph splits = 2

Example response error:

{
    "id": "chatcmpl-lck9Lm6T24oN61ZuwgXtfzWxYBLhKzkO",
    "created": 1747909068,
    "model": "google/gemma-3-27b-it",
    "object": "chat.completion",
    "system_fingerprint": "b5428-f0adb80b",
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "message": {
                "content": "l",
                "role": "assistant",
                "tool_calls": null,
                "function_call": null,
                "refusal": null
            }
        }
    ],
    "usage": {
        "completion_tokens": 45,
        "prompt_tokens": 630,
        "total_tokens": 675,
        "completion_tokens_details": null,
        "prompt_tokens_details": null
    },
    "service_tier": null,
    "timings": {
        "prompt_n": 630,
        "prompt_ms": 430.044,
        "prompt_per_token_ms": 0.6826095238095238,
        "prompt_per_second": 1464.9663755336665,
        "predicted_n": 45,
        "predicted_ms": 1818.511,
        "predicted_per_token_ms": 40.41135555555555,
        "predicted_per_second": 24.74551982363593
    }
}

First Bad Commit

This image is still working fine: https://github.com/ggml-org/llama.cpp/pkgs/container/llama.cpp/419241486?tag=server-cuda-b5428

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from /app/libggml-cuda.so
load_backend: loaded CPU backend from /app/libggml-cpu-haswell.so
warn: LLAMA_ARG_HOST environment variable is set, but will be overwritten by command line argument --host
build: 5439 (33983057) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
system info: n_threads = 2, n_threads_batch = 2, total_threads = 4

system_info: n_threads = 2 (n_threads_batch = 2) / 4 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: binding port with default address family
main: HTTP server is listening, hostname: 0.0.0.0, port: 8080, http threads: 18
main: loading model
srv    load_model: loading model '/models/gemma-3-27b-it-Q4_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA L40S) - 45036 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 808 tensors from /models/gemma-3-27b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 3 27b It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = gemma-3
llama_model_loader: - kv   5:                         general.size_label str              = 27B
llama_model_loader: - kv   6:                            general.license str              = gemma
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Gemma 3 27b Pt
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Google
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                      gemma3.context_length u32              = 131072
llama_model_loader: - kv  13:                    gemma3.embedding_length u32              = 5376
llama_model_loader: - kv  14:                         gemma3.block_count u32              = 62
llama_model_loader: - kv  15:                 gemma3.feed_forward_length u32              = 21504
llama_model_loader: - kv  16:                gemma3.attention.head_count u32              = 32
llama_model_loader: - kv  17:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma3.attention.key_length u32              = 128
llama_model_loader: - kv  19:              gemma3.attention.value_length u32              = 128
llama_model_loader: - kv  20:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  21:            gemma3.attention.sliding_window u32              = 1024
llama_model_loader: - kv  22:             gemma3.attention.head_count_kv u32              = 16
llama_model_loader: - kv  23:                   gemma3.rope.scaling.type str              = linear
llama_model_loader: - kv  24:                 gemma3.rope.scaling.factor f32              = 8.000000
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  36:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  37:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  38:               general.quantization_version u32              = 2
llama_model_loader: - kv  39:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  373 tensors
llama_model_loader: - type q4_K:  374 tensors
llama_model_loader: - type q6_K:   61 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 15.40 GiB (4.90 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 5376
print_info: n_layer          = 62
print_info: n_head           = 32
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 1024
print_info: n_swa_pattern    = 6
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 7.7e-02
print_info: n_ff             = 21504
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 27B
print_info: model params     = 27.01 B
print_info: general.name     = Gemma 3 27b It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 62 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 63/63 layers to GPU
load_tensors:        CUDA0 model buffer size = 15773.70 MiB
load_tensors:   CPU_Mapped model buffer size =  1102.50 MiB
.........................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 16
llama_context: n_ctx         = 49152
llama_context: n_ctx_per_seq = 3072
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 128
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 0.125
llama_context: n_ctx_per_seq (3072) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =    16.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 49152 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =  3840.00 MiB
llama_kv_cache_unified: size = 3840.00 MiB ( 49152 cells,  10 layers), K (f16): 1920.00 MiB, V (f16): 1920.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 18432 cells
llama_kv_cache_unified:      CUDA0 KV buffer size =  7488.00 MiB
llama_kv_cache_unified: size = 7488.00 MiB ( 18432 cells,  52 layers), K (f16): 3744.00 MiB, V (f16): 3744.00 MiB
llama_context:      CUDA0 compute buffer size =   130.62 MiB
llama_context:  CUDA_Host compute buffer size =    35.63 MiB
llama_context: graph nodes  = 2489
llama_context: graph splits = 2
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
llama_adapter_lora_init_impl: loading lora adapter from '/models/adapters/care_augment_river_v0.1.31.gguf' ...
llama_adapter_lora_init_impl:      CUDA0 LoRA buffer size =   433.03 MiB
llama_adapter_lora_init_impl: loaded 868 tensors from lora file
common_init_from_params: setting dry_penalty_last_n to ctx_size = 49152
srv          init: initializing slots, n_slots = 16
slot         init: id  0 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  1 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  2 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  3 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  4 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  5 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  6 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  7 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  8 | task -1 | new slot n_ctx_slot = 3072
slot         init: id  9 | task -1 | new slot n_ctx_slot = 3072
slot         init: id 10 | task -1 | new slot n_ctx_slot = 3072
slot         init: id 11 | task -1 | new slot n_ctx_slot = 3072
slot         init: id 12 | task -1 | new slot n_ctx_slot = 3072
slot         init: id 13 | task -1 | new slot n_ctx_slot = 3072
slot         init: id 14 | task -1 | new slot n_ctx_slot = 3072
slot         init: id 15 | task -1 | new slot n_ctx_slot = 3072
main: model loaded
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
    {%- if messages[0]['content'] is string -%}
        {%- set first_user_prefix = messages[0]['content'] + '

' -%}
    {%- else -%}
        {%- set first_user_prefix = messages[0]['content'][0]['text'] + '

' -%}
    {%- endif -%}
    {%- set loop_messages = messages[1:] -%}
{%- else -%}
    {%- set first_user_prefix = "" -%}
    {%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
    {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
        {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
    {%- endif -%}
    {%- if (message['role'] == 'assistant') -%}
        {%- set role = "model" -%}
    {%- else -%}
        {%- set role = message['role'] -%}
    {%- endif -%}
    {{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
    {%- if message['content'] is string -%}
        {{ message['content'] | trim }}
    {%- elif message['content'] is iterable -%}
        {%- for item in message['content'] -%}
            {%- if item['type'] == 'image' -%}
                {{ '<start_of_image>' }}
            {%- elif item['type'] == 'text' -%}
                {{ item['text'] | trim }}
            {%- endif -%}
        {%- endfor -%}
    {%- else -%}
        {{ raise_exception("Invalid content type") }}
    {%- endif -%}
    {{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
    {{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant

Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://0.0.0.0:8080 - starting the main loop
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant
0