Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788

yeahdongcn · 2025-05-26T01:15:14Z

Name and Version

root@f7cd9f1a2456:/ws# ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
Device 0: MTT S80, compute capability 2.1, VMM: no
version: 5488 (e121edc)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

Musa

Hardware

12th Gen Intel(R) Core(TM) i5-12400 + MTT S80

Models

DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth

Problem description & steps to reproduce

git reset 2d77d88e70d017cd82c3f1a4517e3102e2028ac4 --hard
apply diff and build

diff --git a/ggml/src/ggml-musa/CMakeLists.txt b/ggml/src/ggml-musa/CMakeLists.txt
index 92f05d555..eb80418b2 100644
--- a/ggml/src/ggml-musa/CMakeLists.txt
+++ b/ggml/src/ggml-musa/CMakeLists.txt
@@ -75,9 +75,9 @@ if (MUSAToolkit_FOUND)
         add_compile_definitions(GGML_CUDA_FORCE_CUBLAS)
     endif()
 
-    if (GGML_CUDA_NO_VMM)
+    # if (GGML_CUDA_NO_VMM)
         add_compile_definitions(GGML_CUDA_NO_VMM)
-    endif()
+    # endif()
 
     if (NOT GGML_CUDA_FA)
         add_compile_definitions(GGML_CUDA_NO_FA)

run DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth

root@f7cd9f1a2456:/ws# ./build/bin/llama-cli -m /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
build: 4953 (2d77d88e7) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                  
8000
         general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = DeepSeek R1 Distill Qwen 7B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 151643 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
make_cpu_buft_list: disabling extra buffer types (i.e. repacking) since a GPU device is available
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        MUSA0 model buffer size =  4168.09 MiB
load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init:      MUSA0 KV buffer size =   224.00 MiB
llama_context: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      MUSA0 compute buffer size =   304.00 MiB
llama_context:  MUSA_Host compute buffer size =    15.01 MiB
llama_context: graph nodes  = 1042
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 597688009
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hi there
UINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINT
>

with --fa, everything goes fine
mannully revert 2d77d88 on master (e121edc), with or without -fa, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf can generate token correctly; other models (nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf, qwen3_8b_q4_k_m.gguf) seems not have this issue; turn off kvcache by using --no-kv-offload the issue is gone. I also noticed pp performance downgrade if kvcache is on && disable flash attention (default)
also tried CPU backend and no such issue found

First Bad Commit

2d77d88

Relevant log output

root@f7cd9f1a2456:/ws# ./build/bin/llama-cli -m /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
build: 5488 (e121edc43) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 292 tensors from /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 Nemotron Nano 8B v1
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                       general.organization str              = Nvidia
llama_model_loader: - kv   5:                           general.finetune str              = 42f62a403ee352e019834442673256e3fe3de275
llama_model_loader: - kv   6:                           general.basename str              = Llama-3.1-Nemotron-Nano
llama_model_loader: - kv   7:                         general.size_label str              = 8B
llama_model_loader: - kv   8:                            general.license str              = other
llama_model_loader: - kv   9:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv  10:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          llama.block_count u32              = 32
llama_model_loader: - kv  14:                       llama.context_length u32              = 131072
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  22:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  23:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  24:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.58 GiB (4.89 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Llama 3.1 Nemotron Nano 8B v1
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        MUSA0 model buffer size =  4403.49 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      MUSA0 KV buffer size =   512.00 MiB
llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context:      MUSA0 compute buffer size =   296.00 MiB
llama_context:  MUSA_Host compute buffer size =    16.01 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 3669391543
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hi
Hello! It's nice to meet you. How can I assist you today?

> 
llama_perf_sampler_print:    sampling time =       1.48 ms /    27 runs   (    0.05 ms per token, 18218.62 tokens per second)
llama_perf_context_print:        load time =    2201.14 ms
llama_perf_context_print: prompt eval time =    4094.29 ms /    11 tokens (  372.21 ms per token,     2.69 tokens per second)
llama_perf_context_print:        eval time =    1359.07 ms /    16 runs   (   84.94 ms per token,    11.77 tokens per second)
llama_perf_context_print:       total time =    8089.12 ms /    27 tokens
Interrupted by user
root@f7cd9f1a2456:/ws# ./build/bin/llama-cli -m /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf -ngl 999 -fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
build: 5488 (e121edc43) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 292 tensors from /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 Nemotron Nano 8B v1
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                       general.organization str              = Nvidia
llama_model_loader: - kv   5:                           general.finetune str              = 42f62a403ee352e019834442673256e3fe3de275
llama_model_loader: - kv   6:                           general.basename str              = Llama-3.1-Nemotron-Nano
llama_model_loader: - kv   7:                         general.size_label str              = 8B
llama_model_loader: - kv   8:                            general.license str              = other
llama_model_loader: - kv   9:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv  10:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          llama.block_count u32              = 32
llama_model_loader: - kv  14:                       llama.context_length u32              = 131072
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  22:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  23:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  24:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.58 GiB (4.89 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Llama 3.1 Nemotron Nano 8B v1
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        MUSA0 model buffer size =  4403.49 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      MUSA0 KV buffer size =   512.00 MiB
llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context:      MUSA0 compute buffer size =   266.55 MiB
llama_context:  MUSA_Host compute buffer size =    36.01 MiB
llama_context: graph nodes  = 1031
llama_context: graph splits = 66
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 401977407
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hi
Hello! How can I assist you today?

> 
llama_perf_sampler_print:    sampling time =       0.90 ms /    20 runs   (    0.05 ms per token, 22148.39 tokens per second)
llama_perf_context_print:        load time =     944.39 ms
llama_perf_context_print: prompt eval time =     781.78 ms /    11 tokens (   71.07 ms per token,    14.07 tokens per second)
llama_perf_context_print:        eval time =     810.46 ms /     9 runs   (   90.05 ms per token,    11.10 tokens per second)
llama_perf_context_print:       total time =   10864.23 ms /    20 tokens
Interrupted by user
root@f7cd9f1a2456:/ws#

The text was updated successfully, but these errors were encountered:

yeahdongcn · 2025-05-26T01:18:00Z

Hi @ggerganov Could you please take a look and let me know if you have any insights or suggestions to debug? Thanks.

ggerganov · 2025-05-26T06:19:03Z

On latest master, I am not able to reproduce the problem on my RTX 2060:

cmake -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON ..

Downloaded the model from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main

Command:

./bin/llama-cli -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf  -ngl 99 -lv 1

./bin/llama-cli -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf  -ngl 99 -lv 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: no
0.00.000.541 I build: 5490 (fef693dc6) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
0.00.000.617 I main: llama backend init
0.00.000.622 I main: load the model and apply lora adapter, if any
0.00.130.784 I llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060 SUPER) - 7680 MiB free
0.00.154.075 I llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
0.00.154.086 I llama_model_loader: - tensor split  0:                    output.weight q6_K     [  3584, 152064,     1,     1 ]   426,36 MiB
0.00.154.088 I llama_model_loader: - tensor split  0:               output_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.089 I llama_model_loader: - tensor split  0:                token_embd.weight q4_K     [  3584, 152064,     1,     1 ]   292,36 MiB
0.00.154.090 I llama_model_loader: - tensor split  0:                blk.0.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.091 I llama_model_loader: - tensor split  0:              blk.0.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.091 I llama_model_loader: - tensor split  0:           blk.0.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.092 I llama_model_loader: - tensor split  0:         blk.0.attn_output.weight q4_K     [  3584,  35
8000
84,     1,     1 ]     6,89 MiB
0.00.154.092 I llama_model_loader: - tensor split  0:                blk.0.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.093 I llama_model_loader: - tensor split  0:              blk.0.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.093 I llama_model_loader: - tensor split  0:                blk.0.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.094 I llama_model_loader: - tensor split  0:              blk.0.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.094 I llama_model_loader: - tensor split  0:            blk.0.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.095 I llama_model_loader: - tensor split  0:            blk.0.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.095 I llama_model_loader: - tensor split  0:            blk.0.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.096 I llama_model_loader: - tensor split  0:              blk.0.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.096 I llama_model_loader: - tensor split  0:                blk.1.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.097 I llama_model_loader: - tensor split  0:              blk.1.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.097 I llama_model_loader: - tensor split  0:           blk.1.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.098 I llama_model_loader: - tensor split  0:         blk.1.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.098 I llama_model_loader: - tensor split  0:                blk.1.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.099 I llama_model_loader: - tensor split  0:              blk.1.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.100 I llama_model_loader: - tensor split  0:                blk.1.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.100 I llama_model_loader: - tensor split  0:              blk.1.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.101 I llama_model_loader: - tensor split  0:            blk.1.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.101 I llama_model_loader: - tensor split  0:            blk.1.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.102 I llama_model_loader: - tensor split  0:            blk.1.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.102 I llama_model_loader: - tensor split  0:              blk.1.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.103 I llama_model_loader: - tensor split  0:                blk.2.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.103 I llama_model_loader: - tensor split  0:              blk.2.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.103 I llama_model_loader: - tensor split  0:           blk.2.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.104 I llama_model_loader: - tensor split  0:         blk.2.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.104 I llama_model_loader: - tensor split  0:                blk.2.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.105 I llama_model_loader: - tensor split  0:              blk.2.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.105 I llama_model_loader: - tensor split  0:                blk.2.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.106 I llama_model_loader: - tensor split  0:              blk.2.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.106 I llama_model_loader: - tensor split  0:            blk.2.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.107 I llama_model_loader: - tensor split  0:            blk.2.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.107 I llama_model_loader: - tensor split  0:            blk.2.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.108 I llama_model_loader: - tensor split  0:              blk.2.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.109 I llama_model_loader: - tensor split  0:                blk.3.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.110 I llama_model_loader: - tensor split  0:              blk.3.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.111 I llama_model_loader: - tensor split  0:           blk.3.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.111 I llama_model_loader: - tensor split  0:         blk.3.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.111 I llama_model_loader: - tensor split  0:                blk.3.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.112 I llama_model_loader: - tensor split  0:              blk.3.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.112 I llama_model_loader: - tensor split  0:                blk.3.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.113 I llama_model_loader: - tensor split  0:              blk.3.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.113 I llama_model_loader: - tensor split  0:            blk.3.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.114 I llama_model_loader: - tensor split  0:            blk.3.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.114 I llama_model_loader: - tensor split  0:            blk.3.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.115 I llama_model_loader: - tensor split  0:              blk.3.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.115 I llama_model_loader: - tensor split  0:                blk.4.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.116 I llama_model_loader: - tensor split  0:              blk.4.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.116 I llama_model_loader: - tensor split  0:           blk.4.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.117 I llama_model_loader: - tensor split  0:         blk.4.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.117 I llama_model_loader: - tensor split  0:                blk.4.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.118 I llama_model_loader: - tensor split  0:              blk.4.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.118 I llama_model_loader: - tensor split  0:                blk.4.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.119 I llama_model_loader: - tensor split  0:              blk.4.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.119 I llama_model_loader: - tensor split  0:            blk.4.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.120 I llama_model_loader: - tensor split  0:            blk.4.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.120 I llama_model_loader: - tensor split  0:            blk.4.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.121 I llama_model_loader: - tensor split  0:              blk.4.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.121 I llama_model_loader: - tensor split  0:                blk.5.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.122 I llama_model_loader: - tensor split  0:              blk.5.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.122 I llama_model_loader: - tensor split  0:           blk.5.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.123 I llama_model_loader: - tensor split  0:         blk.5.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.123 I llama_model_loader: - tensor split  0:                blk.5.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.123 I llama_model_loader: - tensor split  0:              blk.5.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.124 I llama_model_loader: - tensor split  0:                blk.5.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.124 I llama_model_loader: - tensor split  0:              blk.5.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.125 I llama_model_loader: - tensor split  0:            blk.5.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.126 I llama_model_loader: - tensor split  0:            blk.5.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.126 I llama_model_loader: - tensor split  0:            blk.5.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.127 I llama_model_loader: - tensor split  0:              blk.5.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.127 I llama_model_loader: - tensor split  0:                blk.6.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.128 I llama_model_loader: - tensor split  0:              blk.6.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.128 I llama_model_loader: - tensor split  0:           blk.6.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.128 I llama_model_loader: - tensor split  0:         blk.6.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.129 I llama_model_loader: - tensor split  0:                blk.6.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.129 I llama_model_loader: - tensor split  0:              blk.6.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.130 I llama_model_loader: - tensor split  0:                blk.6.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.130 I llama_model_loader: - tensor split  0:              blk.6.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.131 I llama_model_loader: - tensor split  0:            blk.6.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.131 I llama_model_loader: - tensor split  0:            blk.6.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.132 I llama_model_loader: - tensor split  0:            blk.6.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.132 I llama_model_loader: - tensor split  0:              blk.6.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.133 I llama_model_loader: - tensor split  0:                blk.7.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.134 I llama_model_loader: - tensor split  0:              blk.7.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.135 I llama_model_loader: - tensor split  0:           blk.7.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.135 I llama_model_loader: - tensor split  0:         blk.7.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.136 I llama_model_loader: - tensor split  0:                blk.7.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.136 I llama_model_loader: - tensor split  0:              blk.7.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.137 I llama_model_loader: - tensor split  0:                blk.7.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.137 I llama_model_loader: - tensor split  0:              blk.7.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.138 I llama_model_loader: - tensor split  0:            blk.7.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.138 I llama_model_loader: - tensor split  0:            blk.7.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.139 I llama_model_loader: - tensor split  0:            blk.7.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.139 I llama_model_loader: - tensor split  0:              blk.7.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.140 I llama_model_loader: - tensor split  0:                blk.8.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.140 I llama_model_loader: - tensor split  0:              blk.8.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.141 I llama_model_loader: - tensor split  0:           blk.8.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.141 I llama_model_loader: - tensor split  0:         blk.8.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.142 I llama_model_loader: - tensor split  0:                blk.8.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.142 I llama_model_loader: - tensor split  0:              blk.8.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.143 I llama_model_loader: - tensor split  0:                blk.8.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.143 I llama_model_loader: - tensor split  0:              blk.8.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.144 I llama_model_loader: - tensor split  0:            blk.8.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.144 I llama_model_loader: - tensor split  0:            blk.8.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.145 I llama_model_loader: - tensor split  0:            blk.8.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.145 I llama_model_loader: - tensor split  0:              blk.8.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.146 I llama_model_loader: - tensor split  0:                blk.9.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.146 I llama_model_loader: - tensor split  0:              blk.9.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.147 I llama_model_loader: - tensor split  0:           blk.9.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.147 I llama_model_loader: - tensor split  0:         blk.9.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.147 I llama_model_loader: - tensor split  0:                blk.9.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.148 I llama_model_loader: - tensor split  0:              blk.9.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.148 I llama_model_loader: - tensor split  0:                blk.9.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.149 I llama_model_loader: - tensor split  0:              blk.9.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.149 I llama_model_loader: - tensor split  0:            blk.9.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.150 I llama_model_loader: - tensor split  0:            blk.9.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.150 I llama_model_loader: - tensor split  0:            blk.9.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.151 I llama_model_loader: - tensor split  0:              blk.9.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.151 I llama_model_loader: - tensor split  0:               blk.10.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.152 I llama_model_loader: - tensor split  0:             blk.10.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.152 I llama_model_loader: - tensor split  0:          blk.10.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.153 I llama_model_loader: - tensor split  0:        blk.10.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.153 I llama_model_loader: - tensor split  0:               blk.10.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.154 I llama_model_loader: - tensor split  0:             blk.10.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.154 I llama_model_loader: - tensor split  0:               blk.10.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.155 I llama_model_loader: - tensor split  0:             blk.10.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.155 I llama_model_loader: - tensor split  0:           blk.10.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.156 I llama_model_loader: - tensor split  0:           blk.10.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.156 I llama_model_loader: - tensor split  0:           blk.10.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.157 I llama_model_loader: - tensor split  0:             blk.10.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.157 I llama_model_loader: - tensor split  0:               blk.11.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.158 I llama_model_loader: - tensor split  0:             blk.11.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.158 I llama_model_loader: - tensor split  0:          blk.11.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.159 I llama_model_loader: - tensor split  0:        blk.11.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.160 I llama_model_loader: - tensor split  0:               blk.11.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.161 I llama_model_loader: - tensor split  0:             blk.11.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.161 I llama_model_loader: - tensor split  0:               blk.11.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.162 I llama_model_loader: - tensor split  0:             blk.11.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.162 I llama_model_loader: - tensor split  0:           blk.11.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.163 I llama_model_loader: - tensor split  0:           blk.11.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.164 I llama_model_loader: - tensor split  0:           blk.11.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.164 I llama_model_loader: - tensor split  0:             blk.11.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.165 I llama_model_loader: - tensor split  0:               blk.12.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.165 I llama_model_loader: - tensor split  0:             blk.12.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.166 I llama_model_loader: - tensor split  0:          blk.12.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.166 I llama_model_loader: - tensor split  0:        blk.12.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.167 I llama_model_loader: - tensor split  0:               blk.12.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.167 I llama_model_loader: - tensor split  0:             blk.12.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.167 I llama_model_loader: - tensor split  0:               blk.12.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.168 I llama_model_loader: - tensor split  0:             blk.12.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.168 I llama_model_loader: - tensor split  0:           blk.12.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.169 I llama_model_loader: - tensor split  0:           blk.12.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.169 I llama_model_loader: - tensor split  0:           blk.12.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.170 I llama_model_loader: - tensor split  0:             blk.12.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.170 I llama_model_loader: - tensor split  0:               blk.13.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.171 I llama_model_loader: - tensor split  0:             blk.13.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.171 I llama_model_loader: - tensor split  0:          blk.13.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.172 I llama_model_loader: - tensor split  0:        blk.13.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.172 I llama_model_loader: - tensor split  0:               blk.13.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.173 I llama_model_loader: - tensor split  0:             blk.13.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.173 I llama_model_loader: - tensor split  0:               blk.13.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.174 I llama_model_loader: - tensor split  0:             blk.13.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.174 I llama_model_loader: - tensor split  0:           blk.13.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.175 I llama_model_loader: - tensor split  0:           blk.13.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.175 I llama_model_loader: - tensor split  0:           blk.13.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.176 I llama_model_loader: - tensor split  0:             blk.13.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.176 I llama_model_loader: - tensor split  0:               blk.14.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.176 I llama_model_loader: - tensor split  0:             blk.14.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.177 I llama_model_loader: - tensor split  0:          blk.14.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.178 I llama_model_loader: - tensor split  0:        blk.14.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.178 I llama_model_loader: - tensor split  0:               blk.14.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.179 I llama_model_loader: - tensor split  0:             blk.14.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.179 I llama_model_loader: - tensor split  0:               blk.14.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.180 I llama_model_loader: - tensor split  0:             blk.14.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.180 I llama_model_loader: - tensor split  0:           blk.14.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.181 I llama_model_loader: - tensor split  0:           blk.14.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.181 I llama_model_loader: - tensor split  0:           blk.14.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.182 I llama_model_loader: - tensor split  0:             blk.14.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.182 I llama_model_loader: - tensor split  0:               blk.15.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.183 I llama_model_loader: - tensor split  0:             blk.15.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.183 I llama_model_loader: - tensor split  0:          blk.15.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.184 I llama_model_loader: - tensor split  0:        blk.15.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.184 I llama_model_loader: - tensor split  0:               blk.15.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.184 I llama_model_loader: - tensor split  0:             blk.15.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.185 I llama_model_loader: - tensor split  0:               blk.15.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.185 I llama_model_loader: - tensor split  0:             blk.15.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.186 I llama_model_loader: - tensor split  0:           blk.15.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.186 I llama_model_loader: - tensor split  0:           blk.15.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.187 I llama_model_loader: - tensor split  0:           blk.15.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.187 I llama_model_loader: - tensor split  0:             blk.15.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.188 I llama_model_loader: - tensor split  0:               blk.16.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.188 I llama_model_loader: - tensor split  0:             blk.16.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.189 I llama_model_loader: - tensor split  0:          blk.16.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.189 I llama_model_loader: - tensor split  0:        blk.16.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.190 I llama_model_loader: - tensor split  0:               blk.16.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.190 I llama_model_loader: - tensor split  0:             blk.16.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.191 I llama_model_loader: - tensor split  0:               blk.16.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.191 I llama_model_loader: - tensor split  0:             blk.16.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.192 I llama_model_loader: - tensor split  0:           blk.16.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.192 I llama_model_loader: - tensor split  0:           blk.16.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.193 I llama_model_loader: - tensor split  0:           blk.16.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.193 I llama_model_loader: - tensor split  0:             blk.16.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.194 I llama_model_loader: - tensor split  0:               blk.17.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.194 I llama_model_loader: - tensor split  0:             blk.17.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.195 I llama_model_loader: - tensor split  0:          blk.17.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.195 I llama_model_loader: - tensor split  0:        blk.17.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.195 I llama_model_loader: - tensor split  0:               blk.17.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.196 I llama_model_loader: - tensor split  0:             blk.17.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.197 I llama_model_loader: - tensor split  0:               blk.17.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.198 I llama_model_loader: - tensor split  0:             blk.17.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.199 I llama_model_loader: - tensor split  0:           blk.17.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.199 I llama_model_loader: - tensor split  0:           blk.17.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.200 I llama_model_loader: - tensor split  0:           blk.17.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.200 I llama_model_loader: - tensor split  0:             blk.17.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.201 I llama_model_loader: - tensor split  0:               blk.18.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.201 I llama_model_loader: - tensor split  0:             blk.18.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.202 I llama_model_loader: - tensor split  0:          blk.18.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.202 I llama_model_loader: - tensor split  0:        blk.18.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.202 I llama_model_loader: - tensor split  0:               blk.18.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.203 I llama_model_loader: - tensor split  0:             blk.18.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.204 I llama_model_loader: - tensor split  0:               blk.18.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.204 I llama_model_loader: - tensor split  0:             blk.18.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.204 I llama_model_loader: - tensor split  0:           blk.18.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.205 I llama_model_loader: - tensor split  0:           blk.18.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.205 I llama_model_loader: - tensor split  0:           blk.18.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.206 I llama_model_loader: - tensor split  0:             blk.18.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.206 I llama_model_loader: - tensor split  0:               blk.19.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.207 I llama_model_loader: - tensor split  0:             blk.19.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.207 I llama_model_loader: - tensor split  0:          blk.19.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.208 I llama_model_loader: - tensor split  0:        blk.19.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.208 I llama_model_loader: - tensor split  0:               blk.19.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.209 I llama_model_loader: - tensor split  0:             blk.19.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.210 I llama_model_loader: - tensor split  0:               blk.19.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.210 I llama_model_loader: - tensor split  0:             blk.19.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.211 I llama_model_loader: - tensor split  0:           blk.19.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.211 I llama_model_loader: - tensor split  0:           blk.19.ffn_gate.weight q4_K     [  3584, 18944,     
8000
1,     1 ]    36,42 MiB
0.00.154.212 I llama_model_loader: - tensor split  0:           blk.19.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.212 I llama_model_loader: - tensor split  0:             blk.19.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.213 I llama_model_loader: - tensor split  0:               blk.20.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.213 I llama_model_loader: - tensor split  0:             blk.20.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.213 I llama_model_loader: - tensor split  0:          blk.20.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.214 I llama_model_loader: - tensor split  0:        blk.20.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.214 I llama_model_loader: - tensor split  0:               blk.20.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.215 I llama_model_loader: - tensor split  0:             blk.20.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.215 I llama_model_loader: - tensor split  0:               blk.20.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.216 I llama_model_loader: - tensor split  0:             blk.20.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.217 I llama_model_loader: - tensor split  0:           blk.20.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.217 I llama_model_loader: - tensor split  0:           blk.20.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.218 I llama_model_loader: - tensor split  0:           blk.20.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.218 I llama_model_loader: - tensor split  0:             blk.20.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.219 I llama_model_loader: - tensor split  0:               blk.21.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.220 I llama_model_loader: - tensor split  0:             blk.21.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.220 I llama_model_loader: - tensor split  0:          blk.21.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.221 I llama_model_loader: - tensor split  0:        blk.21.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.222 I llama_model_loader: - tensor split  0:               blk.21.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.222 I llama_model_loader: - tensor split  0:             blk.21.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.223 I llama_model_loader: - tensor split  0:               blk.21.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.224 I llama_model_loader: - tensor split  0:             blk.21.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.233 I llama_model_loader: - tensor split  0:           blk.21.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.234 I llama_model_loader: - tensor split  0:           blk.21.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.235 I llama_model_loader: - tensor split  0:           blk.21.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.237 I llama_model_loader: - tensor split  0:             blk.21.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.238 I llama_model_loader: - tensor split  0:               blk.22.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.238 I llama_model_loader: - tensor split  0:             blk.22.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.251 I llama_model_loader: - tensor split  0:          blk.22.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.252 I llama_model_loader: - tensor split  0:        blk.22.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.252 I llama_model_loader: - tensor split  0:               blk.22.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.253 I llama_model_loader: - tensor split  0:             blk.22.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.253 I llama_model_loader: - tensor split  0:               blk.22.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.254 I llama_model_loader: - tensor split  0:             blk.22.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.254 I llama_model_loader: - tensor split  0:           blk.22.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.255 I llama_model_loader: - tensor split  0:           blk.22.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.256 I llama_model_loader: - tensor split  0:           blk.22.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.256 I llama_model_loader: - tensor split  0:             blk.22.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.257 I llama_model_loader: - tensor split  0:               blk.23.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.257 I llama_model_loader: - tensor split  0:             blk.23.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.257 I llama_model_loader: - tensor split  0:          blk.23.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.258 I llama_model_loader: - tensor split  0:        blk.23.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.258 I llama_model_loader: - tensor split  0:               blk.23.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.259 I llama_model_loader: - tensor split  0:             blk.23.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.259 I llama_model_loader: - tensor split  0:               blk.23.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.260 I llama_model_loader: - tensor split  0:             blk.23.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.260 I llama_model_loader: - tensor split  0:           blk.23.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.261 I llama_model_loader: - tensor split  0:           blk.23.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.261 I llama_model_loader: - tensor split  0:           blk.23.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.262 I llama_model_loader: - tensor split  0:             blk.23.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.262 I llama_model_loader: - tensor split  0:               blk.24.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.263 I llama_model_loader: - tensor split  0:             blk.24.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.263 I llama_model_loader: - tensor split  0:          blk.24.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.264 I llama_model_loader: - tensor split  0:        blk.24.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.264 I llama_model_loader: - tensor split  0:               blk.24.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.265 I llama_model_loader: - tensor split  0:             blk.24.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.265 I llama_model_loader: - tensor split  0:               blk.24.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.266 I llama_model_loader: - tensor split  0:             blk.24.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.266 I llama_model_loader: - tensor split  0:           blk.24.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.267 I llama_model_loader: - tensor split  0:           blk.24.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.267 I llama_model_loader: - tensor split  0:           blk.24.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.268 I llama_model_loader: - tensor split  0:             blk.24.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.268 I llama_model_loader: - tensor split  0:               blk.25.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.269 I llama_model_loader: - tensor split  0:             blk.25.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.269 I llama_model_loader: - tensor split  0:          blk.25.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.270 I llama_model_loader: - tensor split  0:        blk.25.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.270 I llama_model_loader: - tensor split  0:               blk.25.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.271 I llama_model_loader: - tensor split  0:             blk.25.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.271 I llama_model_loader: - tensor split  0:               blk.25.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.272 I llama_model_loader: - tensor split  0:             blk.25.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.272 I llama_model_loader: - tensor split  0:           blk.25.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.273 I llama_model_loader: - tensor split  0:           blk.25.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.273 I llama_model_loader: - tensor split  0:           blk.25.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.274 I llama_model_loader: - tensor split  0:             blk.25.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.274 I llama_model_loader: - tensor split  0:               blk.26.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.275 I llama_model_loader: - tensor split  0:             blk.26.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.275 I llama_model_loader: - tensor split  0:          blk.26.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.276 I llama_model_loader: - tensor split  0:        blk.26.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.276 I llama_model_loader: - tensor split  0:               blk.26.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.277 I llama_model_loader: - tensor split  0:             blk.26.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.277 I llama_model_loader: - tensor split  0:               blk.26.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.278 I llama_model_loader: - tensor split  0:             blk.26.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.278 I llama_model_loader: - tensor split  0:           blk.26.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.279 I llama_model_loader: - tensor split  0:           blk.26.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.279 I llama_model_loader: - tensor split  0:           blk.26.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.279 I llama_model_loader: - tensor split  0:             blk.26.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.280 I llama_model_loader: - tensor split  0:               blk.27.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.280 I llama_model_loader: - tensor split  0:             blk.27.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.281 I llama_model_loader: - tensor split  0:          blk.27.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.281 I llama_model_loader: - tensor split  0:        blk.27.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.282 I llama_model_loader: - tensor split  0:               blk.27.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.282 I llama_model_loader: - tensor split  0:             blk.27.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.283 I llama_model_loader: - tensor split  0:               blk.27.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.283 I llama_model_loader: - tensor split  0:             blk.27.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.284 I llama_model_loader: - tensor split  0:           blk.27.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.284 I llama_model_loader: - tensor split  0:           blk.27.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.285 I llama_model_loader: - tensor split  0:           blk.27.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.285 I llama_model_loader: - tensor split  0:             blk.27.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.289 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.154.297 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2
0.00.154.298 I llama_model_loader: - kv   1:                               general.type str              = model
0.00.154.298 I llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
0.00.154.299 I llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
0.00.154.299 I llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Qwen
0.00.154.299 I llama_model_loader: - kv   5:                         general.size_label str              = 7B
0.00.154.302 I llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
0.00.154.302 I llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
0.00.154.302 I llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
0.00.154.303 I llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
0.00.154.303 I llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
0.00.154.303 I llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
0.00.154.306 I llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000,000000
0.00.154.306 I llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0,000001
0.00.154.307 I llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
0.00.154.307 I llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
0.00.166.302 I llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.169.960 I llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.181.266 I llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.00.181.269 I llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
0.00.181.269 I llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
0.00.181.269 I llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151654
0.00.181.270 I llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
0.00.181.270 I llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
0.00.181.271 I llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
0.00.181.272 I llama_model_loader: - kv  25:               general.quantization_version u32              = 2
0.00.181.272 I llama_model_loader: - kv  26:                          general.file_type u32              = 15
0.00.181.272 I llama_model_loader: - type  f32:  141 tensors
0.00.181.273 I llama_model_loader: - type q4_K:  169 tensors
0.00.181.273 I llama_model_loader: - type q6_K:   29 tensors
0.00.181.275 I print_info: file format = GGUF V3 (latest)
0.00.181.275 I print_info: file type   = Q4_K - Medium
0.00.181.276 I print_info: file size   = 4,36 GiB (4,91 BPW) 
0.00.225.519 D init_tokenizer: initializing tokenizer for type 2
0.00.233.934 D load: control token: 151660 '<|fim_middle|>' is not marked as EOG
0.00.233.936 D load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
0.00.233.936 D load: control token: 151653 '<|vision_end|>' is not marked as EOG
0.00.233.937 D load: control token: 151645 '<｜Assistant｜>' is not marked as EOG
0.00.234.243 D load: control token: 151644 '<｜User｜>' is not marked as EOG
0.00.234.349 D load: control token: 151655 '<|image_pad|>' is not marked as EOG
0.00.234.388 D load: control token: 151651 '<|quad_end|>' is not marked as EOG
0.00.235.230 D load: control token: 151646 '<｜begin▁of▁sentence｜>' is not marked as EOG
0.00.235.800 D load: control token: 151643 '<｜end▁of▁sentence｜>' is not marked as EOG
0.00.235.808 D load: control token: 151652 '<|vision_start|>' is not marked as EOG
0.00.236.039 D load: control token: 151647 '<|EOT|>' is not marked as EOG
0.00.236.739 D load: control token: 151654 '<|vision_pad|>' is not marked as EOG
0.00.237.456 D load: control token: 151656 '<|video_pad|>' is not marked as EOG
0.00.237.772 D load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
0.00.237.948 D load: control token: 151650 '<|quad_start|>' is not marked as EOG
0.00.238.366 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.00.238.449 I load: special tokens cache size = 22
0.00.273.622 I load: token to piece cache size = 0,9310 MB
0.00.273.632 I print_info: arch             = qwen2
0.00.273.633 I print_info: vocab_only       = 0
0.00.273.633 I print_info: n_ctx_train      = 131072
0.00.273.633 I print_info: n_embd           = 3584
0.00.273.634 I print_info: n_layer          = 28
0.00.273.642 I print_info: n_head           = 28
0.00.273.644 I print_info: n_head_kv        = 4
0.00.273.644 I print_info: n_rot            = 128
0.00.273.644 I print_info: n_swa            = 0
0.00.273.646 I print_info: is_swa_any       = 0
0.00.273.646 I print_info: n_embd_head_k    = 128
0.00.273.646 I print_info: n_embd_head_v    = 128
0.00.273.648 I print_info: n_gqa            = 7
0.00.273.649 I print_info: n_embd_k_gqa     = 512
0.00.273.650 I print_info: n_embd_v_gqa     = 512
0.00.273.651 I print_info: f_norm_eps       = 0,0e+00
0.00.273.652 I print_info: f_norm_rms_eps   = 1,0e-06
0.00.273.652 I print_info: f_clamp_kqv      = 0,0e+00
0.00.273.652 I print_info: f_max_alibi_bias = 0,0e+00
0.00.273.654 I print_info: f_logit_scale    = 0,0e+00
0.00.273.654 I print_info: f_attn_scale     = 0,0e+00
0.00.273.656 I print_info: n_ff             = 18944
0.00.273.656 I print_info: n_expert         = 0
0.00.273.656 I print_info: n_expert_used    = 0
0.00.273.656 I print_info: causal attn      = 1
0.00.273.656 I print_info: pooling type     = -1
0.00.273.656 I print_info: rope type        = 2
0.00.273.657 I print_info: rope scaling     = linear
0.00.273.657 I print_info: freq_base_train  = 10000,0
0.00.273.658 I print_info: freq_scale_train = 1
0.00.273.658 I print_info: n_ctx_orig_yarn  = 131072
0.00.273.658 I print_info: rope_finetuned   = unknown
0.00.273.658 I print_info: ssm_d_conv       = 0
0.00.273.658 I print_info: ssm_d_inner      = 0
0.00.273.658 I print_info: ssm_d_state      = 0
0.00.273.659 I print_info: ssm_dt_rank      = 0
0.00.273.659 I print_info: ssm_dt_b_c_rms   = 0
0.00.273.659 I print_info: model type       = 7B
0.00.273.660 I print_info: model params     = 7,62 B
0.00.273.660 I print_info: general.name     = DeepSeek R1 Distill Qwen 7B
0.00.273.663 I print_info: vocab type       = BPE
0.00.273.664 I print_info: n_vocab          = 152064
0.00.273.664 I print_info: n_merges         = 151387
0.00.273.664 I print_info: BOS token        = 151646 '<｜begin▁of▁sentence｜>'
0.00.273.664 I print_info: EOS token        = 151643 '<｜end▁of▁sentence｜>'
0.00.273.665 I print_info: EOT token        = 151643 '<｜end▁of▁sentence｜>'
0.00.273.665 I print_info: PAD token        = 151654 '<|vision_pad|>'
0.00.273.665 I print_info: LF token         = 198 'Ċ'
0.00.273.665 I print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
0.00.273.666 I print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
0.00.273.666 I print_info: FIM MID token    = 151660 '<|fim_middle|>'
0.00.273.666 I print_info: FIM PAD token    = 151662 '<|fim_pad|>'
0.00.273.666 I print_info: FIM REP token    = 151663 '<|repo_name|>'
0.00.273.666 I print_info: FIM SEP token    = 151664 '<|file_sep|>'
0.00.273.666 I print_info: EOG token        = 151643 '<｜end▁of▁sentence｜>'
0.00.273.667 I print_info: EOG token        = 151662 '<|fim_pad|>'
0.00.273.667 I print_info: EOG token        = 151663 '<|repo_name|>'
0.00.273.667 I print_info: EOG token        = 151664 '<|file_sep|>'
0.00.273.667 I print_info: max token length = 256
0.00.273.674 I load_tensors: loading model tensors, this can take a while... (mmap = true)
0.00.273.791 D load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
0.00.273.791 D load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
0.00.273.792 D load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
0.00.273.793 D load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
0.00.273.793 D load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
0.00.273.794 D load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
0.00.273.794 D load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
0.00.273.794 D load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
0.00.273.799 D load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
0.00.274.863 D load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
0.00.548.882 I load_tensors: offloading 28 repeating layers to GPU
0.00.548.886 I load_tensors: offloading output layer to GPU
0.00.548.886 I load_tensors: offloaded 29/29 layers to GPU
0.00.548.892 I load_tensors:        CUDA0 model buffer size =  4168,09 MiB
0.00.548.893 I load_tensors:   CPU_Mapped model buffer size =   292,36 MiB
..................................................................................
0.01.108.985 I llama_context: constructing llama_context
0.01.108.989 I llama_context: n_seq_max     = 1
0.01.108.989 I llama_context: n_ctx         = 4096
0.01.108.990 I llama_context: n_ctx_per_seq = 4096
0.01.108.990 I llama_context: n_batch       = 2048
0.01.108.990 I llama_context: n_ubatch      = 512
0.01.108.991 I llama_context: causal_attn   = 1
0.01.108.991 I llama_context: flash_attn    = 0
0.01.108.995 I llama_context: freq_base     = 10000,0
0.01.108.995 I llama_context: freq_scale    = 1
0.01.108.996 W llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.01.109.034 D set_abort_callback: call
0.01.110.304 I llama_context:  CUDA_Host  output buffer size =     0,58 MiB
0.01.110.310 D create_memory: n_ctx = 4096 (padded)
0.01.110.323 D llama_kv_cache_unified: layer   0: dev = CUDA0
0.01.110.333 D llama_kv_cache_unified: layer   1: dev = CUDA0
0.01.110.334 D llama_kv_cache_unified: layer   2: dev = CUDA0
0.01.110.334 D llama_kv_cache_unified: layer   3: dev = CUDA0
0.01.110.335 D llama_kv_cache_unified: layer   4: dev = CUDA0
0.01.110.336 D llama_kv_cache_unified: layer   5: dev = CUDA0
0.01.110.337 D llama_kv_cache_unified: layer   6: dev = CUDA0
0.01.110.338 D llama_kv_cache_unified: layer   7: dev = CUDA0
0.01.110.339 D llama_kv_cache_unified: layer   8: dev = CUDA0
0.01.110.339 D llama_kv_cache_unified: layer   9: dev = CUDA0
0.01.110.340 D llama_kv_cache_unified: layer  10: dev = CUDA0
0.01.110.340 D llama_kv_cache_unified: layer  11: dev = CUDA0
0.01.110.341 D llama_kv_cache_unified: layer  12: dev = CUDA0
0.01.110.341 D llama_kv_cache_unified: layer  13: dev = CUDA0
0.01.110.342 D llama_kv_cache_unified: layer  14: dev = CUDA0
0.01.110.343 D llama_kv_cache_unified: layer  15: dev = CUDA0
0.01.110.343 D llama_kv_cache_unified: layer  16: dev = CUDA0
0.01.110.344 D llama_kv_cache_unified: layer  17: dev = CUDA0
0.01.110.344 D llama_kv_cache_unified: layer  18: dev = CUDA0
0.01.110.345 D llama_kv_cache_unified: layer  19: dev = CUDA0
0.01.110.345 D llama_kv_cache_unified: layer  20: dev = CUDA0
0.01.110.346 D llama_kv_cache_unified: layer  21: dev = CUDA0
0.01.110.346 D llama_kv_cache_unified: layer  22: dev = CUDA0
0.01.110.346 D llama_kv_cache_unified: layer  23: dev = CUDA0
0.01.110.347 D llama_kv_cache_unified: layer  24: dev = CUDA0
0.01.110.347 D llama_kv_cache_unified: layer  25: dev = CUDA0
0.01.110.348 D llama_kv_cache_unified: layer  26: dev = CUDA0
0.01.110.348 D llama_kv_cache_unified: layer  27: dev = CUDA0
0.01.110.491 I llama_kv_cache_unified:      CUDA0 KV buffer size =   224,00 MiB
0.01.111.096 I llama_kv_cache_unified: size =  224,00 MiB (  4096 cells,  28 layers,  1 seqs), K (f16):  112,00 MiB, V (f16):  112,00 MiB
0.01.111.098 D llama_context: enumerating backends
0.01.111.104 D llama_context: backend_ptrs.size() = 2
0.01.111.105 D llama_context: max_nodes = 65536
0.01.124.903 D llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
0.01.124.906 D llama_context: reserving graph for n_tokens = 512, n_seqs = 1
0.01.133.128 D llama_context: reserving graph for n_tokens = 1, n_seqs = 1
0.01.133.478 D llama_context: reserving graph for n_tokens = 512, n_seqs = 1
0.01.133.801 I llama_context:      CUDA0 compute buffer size =   304,00 MiB
0.01.133.803 I llama_context:  CUDA_Host compute buffer size =    15,01 MiB
0.01.133.803 I llama_context: graph nodes  = 1098
0.01.133.804 I llama_context: graph splits = 2
0.01.133.806 D clear_adapter_lora: call
0.01.133.808 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
0.01.133.808 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.133.809 D set_warmup: value = 1
0.01.206.041 D set_warmup: value = 0
0.01.212.762 I main: llama threadpool init, n_threads = 16
0.01.212.770 D attach_threadpool: call
0.01.212.771 I main: chat template is available, enabling conversation mode (disable it with -no-cnv)
0.01.212.959 I main: chat template example:
You are a helpful assistant
<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>
0.01.212.961 I 
0.01.213.012 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 750 | NO_VMM = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
0.01.213.012 I 
0.01.213.014 D n_ctx: 4096, add_bos: 1
0.01.213.015 D tokenize the prompt
0.01.213.025 D prompt: ""
0.01.213.027 D tokens: [ '<beginofsentence>':151646 ]
0.01.213.028 D recalculate the cached logits (check): embd_inp.size() 1, n_matching_session_tokens 0, embd_inp.size() 1, session_tokens.size() 0
0.01.213.029 I main: interactive mode on.
0.01.213.083 I sampler seed: 3164957287
0.01.213.089 I sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
0.01.213.092 I sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
0.01.213.093 I generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
0.01.213.093 I 
0.01.213.093 I == Running in interactive mode. ==
0.01.213.093 I  - Press Ctrl+C to interject at any time.
0.01.213.094 I  - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
0.01.213.094 I  - Not using system message. To change it, set a different value via -sys PROMPT
0.01.213.094 I 
0.01.213.096 D embd_inp.size(): 1, n_consumed: 0
0.01.213.100 D waiting for user input
> Hi there
0.03.678.678 D buffer: 'Hi there'
0.03.678.777 D formatted: '<｜User｜>Hi there<｜Assistant｜>'
0.03.679.396 D input tokens: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
0.03.679.401 D n_remain: -5
0.03.679.416 D eval: [ '<beginofsentence>':151646 ]
0.03.685.685 D n_past = 1
0.03.685.691 D embd_inp.size(): 5, n_consumed: 1
0.03.685.701 D eval: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
0.03.705.495 D n_past = 5
0.03.734.162 D n_remain: -6
<think>0.03.734.167 D eval: [ '<think>':151648 ]
0.03.736.514 D n_past = 6
0.03.753.503 D n_remain: -7
0.03.753.507 D eval: [ '':271 ]
0.03.755.861 D n_past = 7
0.03.772.840 D n_remain: -8
</think>0.03.772.844 D eval: [ '</think>':151649 ]
0.03.775.183 D n_past = 8
0.03.791.895 D n_remain: -9
0.03.791.898 D eval: [ '':271 ]
0.03.794.237 D n_past = 9
0.03.810.164 D n_remain: -10
Hello0.03.810.168 D eval: [ 'Hello':9707 ]
0.03.812.501 D n_past = 10
0.03.828.176 D n_remain: -11
!0.03.828.179 D eval: [ '!':0 ]
0.03.830.510 D n_past = 11
0.03.845.830 D n_remain: -12
 How0.03.845.833 D eval: [ ' How':2585 ]
0.03.848.178 D n_past = 12
0.03.863.242 D n_remain: -13
 can0.03.863.246 D eval: [ ' can':646 ]
0.03.865.588 D n_past = 13
0.
8000
03.880.445 D n_remain: -14
 I0.03.880.448 D eval: [ ' I':358 ]
0.03.882.780 D n_past = 14
0.03.897.428 D n_remain: -15
 assist0.03.897.432 D eval: [ ' assist':7789 ]
0.03.899.772 D n_past = 15
0.03.914.247 D n_remain: -16
 you0.03.914.251 D eval: [ ' you':498 ]
0.03.916.592 D n_past = 16
0.03.931.012 D n_remain: -17
 today0.03.931.016 D eval: [ ' today':3351 ]
0.03.933.366 D n_past = 17
0.03.947.784 D n_remain: -18
?0.03.947.788 D eval: [ '?':30 ]
0.03.950.125 D n_past = 18
0.03.964.557 D n_remain: -19
 �0.03.964.561 D eval: [ ' ':26525 ]
0.03.966.898 D n_past = 19
0.03.981.322 D n_remain: -20
�0.03.981.326 D eval: [ '':232 ]
0.03.983.664 D n_past = 20
0.03.997.941 D n_remain: -21
0.03.997.944 D found an EOG token
0.03.997.982 D formatted: '<｜Assistant｜><think>
</think>
Hello! How can I assist you today? 😊<｜end▁of▁sentence｜>'
0.03.997.983 D waiting for user input
> 
0.05.157.487 I llama_perf_sampler_print:    sampling time =       0,74 ms /    20 runs   (    0,04 ms per token, 27173,91 tokens per second)
0.05.157.492 I llama_perf_context_print:        load time =    1075,25 ms
0.05.157.493 I llama_perf_context_print: prompt eval time =      53,68 ms /     5 tokens (   10,74 ms per token,    93,14 tokens per second)
0.05.157.494 I llama_perf_context_print:        eval time =     261,94 ms /    15 runs   (   17,46 ms per token,    57,27 tokens per second)
0.05.157.494 I llama_perf_context_print:       total time =    3951,45 ms /    20 tokens
Interrupted by user

Can you post your output on latest master and add the -lv 1 flag at the end?

ggerganov · 2025-05-26T06:39:07Z

I think the patch that you are using to disable VMM is not correct. Instead, you should build with cmake -DGGML_CUDA_NO_VMM=ON ... like in my command.

yeahdongcn · 2025-05-26T06:54:30Z

Can you post your output on latest master and add the -lv 1 flag at the end?

Please see the logs (updated with export LLAMA_TRACE=1) below:

root@f7cd9f1a2456:/ws# ./build-cuda-no-vmm/bin/llama-cli -m /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -ngl 99 -lv 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
register_backend: registered backend MUSA (1 devices)
register_device: registered device MUSA0 (MTT S80)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i5-12400)
load_backend: failed to find ggml_backend_init in /ws/build-cuda-no-vmm/bin/libggml-musa.so
load_backend: failed to find ggml_backend_init in /ws/build-cuda-no-vmm/bin/libggml-cpu.so
build: 5490 (fef693dc6) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15723 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor split  0:                    output.weight q6_K     [  3584, 152064,     1,     1 ]   426.36 MiB
llama_model_loader: - tensor split  0:               output_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:                token_embd.weight q4_K     [  3584, 152064,     1,     1 ]   292.36 MiB
llama_model_loader: - tensor split  0:                blk.0.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.0.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.0.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.0.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.0.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.0.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.0.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.0.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.0.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.0.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.0.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.0.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.1.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.1.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.1.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.1.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.1.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.1.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.1.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.1.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.1.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.1.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.1.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.1.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.2.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.2.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.2.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.2.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.2.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.2.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.2.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.2.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.2.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.2.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.2.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.2.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.3.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.3.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.3.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.3.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.3.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.3.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.3.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.3.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.3.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.3.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.3.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.3.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.4.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.4.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.4.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.4.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.4.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.4.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.4.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.4.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.4.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.4.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.4.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.4.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.5.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.5.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.5.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.5.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.5.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.5.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.5.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.5.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.5.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.5.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.5.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.5.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.6.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.6.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.6.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.6.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.6.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.6.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.6.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.6.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.6.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.6.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.6.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.6.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.7.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.7.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.7.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.7.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.7.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.7.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.7.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.7.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.7.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.7.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.7.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.7.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.8.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.8.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.8.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.8.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.8.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.8.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.8.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.8.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.8.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.8.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.8.ffn_norm.wei
8000
ght f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.8.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.9.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.9.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.9.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.9.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.9.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.9.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.9.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.9.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.9.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.9.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.9.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.9.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.10.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.10.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.10.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.10.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.10.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.10.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.10.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.10.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.10.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.10.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.10.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.10.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.11.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.11.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.11.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.11.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.11.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.11.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.11.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.11.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.11.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.11.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.11.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.11.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.12.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.12.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.12.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.12.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.12.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.12.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.12.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.12.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.12.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.12.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.12.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.12.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.13.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.13.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.13.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.13.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.13.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.13.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.13.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.13.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.13.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.13.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.13.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.13.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.14.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.14.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.14.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.14.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.14.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.14.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.14.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.14.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.14.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.14.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.14.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.14.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.15.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.15.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.15.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.15.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.15.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.15.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.15.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.15.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.15.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.15.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.15.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.15.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.16.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.16.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.16.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.16.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.16.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.16.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.16.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.16.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.16.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.16.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.16.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.16.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.17.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.17.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.17.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.17.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.17.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.17.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.17.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.17.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.17.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.17.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.17.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.17.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.18.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.18.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.18.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.18.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.18.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.18.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.18.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.18.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.18.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.18.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.18.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.18.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.19.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.19.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.19.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.19.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.19.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.19.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.19.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.19.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.19.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.19.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.19.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.19.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.20.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.20.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.20.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.20.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.20.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.20.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.20.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.20.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.20.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.20.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.20.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.20.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.21.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.21.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.21.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.21.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.21.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.21.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.21.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.21.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.21.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.21.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.21.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.21.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.22.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.22.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.22.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.22.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.22.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.22.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.22.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.22.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.22.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.22.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.22.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.22.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.23.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.23.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.23.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.23.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.23.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.23.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.23.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.23.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.23.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.23.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.23.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.23.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.24.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.24.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.24.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.24.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.24.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.24.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.24.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.24.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.24.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.24.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.24.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.24.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.25.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.25.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.25.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.25.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.25.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.25.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.25.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.25.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.25.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.25.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.25.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.25.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.26.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.26.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.26.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.26.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.26.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.26.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.26.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.26.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.26.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.26.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.26.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.26.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.27.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.27.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.27.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.27.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.27.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.27.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.27.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.27.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.27.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.27.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.27.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.27.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_ver
8000
sion u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<｜Assistant｜>' is not marked as EOG
load: control token: 151644 '<｜User｜>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<｜begin▁of▁sentence｜>' is not marked as EOG
load: control token: 151643 '<｜end▁of▁sentence｜>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = DeepSeek R1 Distill Qwen 7B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<｜begin▁of▁sentence｜>'
print_info: EOS token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOT token        = 151643 '<｜end▁of▁sentence｜>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<｜end▁of▁sentence｜>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device MUSA0, is_swa = 0
load_tensors: layer   1 assigned to device MUSA0, is_swa = 0
load_tensors: layer   2 assigned to device MUSA0, is_swa = 0
load_tensors: layer   3 assigned to device MUSA0, is_swa = 0
load_tensors: layer   4 assigned to device MUSA0, is_swa = 0
load_tensors: layer   5 assigned to device MUSA0, is_swa = 0
load_tensors: layer   6 assigned to device MUSA0, is_swa = 0
load_tensors: layer   7 assigned to device MUSA0, is_swa = 0
load_tensors: layer   8 assigned to device MUSA0, is_swa = 0
load_tensors: layer   9 assigned to device MUSA0, is_swa = 0
load_tensors: layer  10 assigned to device MUSA0, is_swa = 0
load_tensors: layer  11 assigned to device MUSA0, is_swa = 0
load_tensors: layer  12 assigned to device MUSA0, is_swa = 0
load_tensors: layer  13 assigned to device MUSA0, is_swa = 0
load_tensors: layer  14 assigned to device MUSA0, is_swa = 0
load_tensors: layer  15 assigned to device MUSA0, is_swa = 0
load_tensors: layer  16 assigned to device MUSA0, is_swa = 0
load_tensors: layer  17 assigned to device MUSA0, is_swa = 0
load_tensors: layer  18 assigned to device MUSA0, is_swa = 0
load_tensors: layer  19 assigned to device MUSA0, is_swa = 0
load_tensors: layer  20 assigned to device MUSA0, is_swa = 0
load_tensors: layer  21 assigned to device MUSA0, is_swa = 0
load_tensors: layer  22 assigned to device MUSA0, is_swa = 0
load_tensors: layer  23 assigned to device MUSA0, is_swa = 0
load_tensors: layer  24 assigned to device MUSA0, is_swa = 0
load_tensors: layer  25 assigned to device MUSA0, is_swa = 0
load_tensors: layer  26 assigned to device MUSA0, is_swa = 0
load_tensors: layer  27 assigned to device MUSA0, is_swa = 0
load_tensors: layer  28 assigned to device MUSA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type MUSA_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
load_tensors:        MUSA0 model buffer size =  4168.09 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  MUSA_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer   0: dev = MUSA0
llama_kv_cache_unified: layer   1: dev = MUSA0
llama_kv_cache_unified: layer   2: dev = MUSA0
llama_kv_cache_unified: layer   3: dev = MUSA0
llama_kv_cache_unified: layer   4: dev = MUSA0
llama_kv_cache_unified: layer   5: dev = MUSA0
llama_kv_cache_unified: layer   6: dev = MUSA0
llama_kv_cache_unified: layer   7: dev = MUSA0
llama_kv_cache_unified: layer   8: dev = MUSA0
llama_kv_cache_unified: layer   9: dev = MUSA0
llama_kv_cache_unified: layer  10: dev = MUSA0
llama_kv_cache_unified: layer  11: dev = MUSA0
llama_kv_cache_unified: layer  12: dev = MUSA0
llama_kv_cache_unified: layer  13: dev = MUSA0
llama_kv_cache_unified: layer  14: dev = MUSA0
llama_kv_cache_unified: layer  15: dev = MUSA0
llama_kv_cache_unified: layer  16: dev = MUSA0
llama_kv_cache_unified: layer  17: dev = MUSA0
llama_kv_cache_unified: layer  18: dev = MUSA0
llama_kv_cache_unified: layer  19: dev = MUSA0
llama_kv_cache_unified: layer  20: dev = MUSA0
llama_kv_cache_unified: layer  21: dev = MUSA0
llama_kv_cache_unified: layer  22: dev = MUSA0
llama_kv_cache_unified: layer  23: dev = MUSA0
llama_kv_cache_unified: layer  24: dev = MUSA0
llama_kv_cache_unified: layer  25: dev = MUSA0
llama_kv_cache_unified: layer  26: dev = MUSA0
llama_kv_cache_unified: layer  27: dev = MUSA0
llama_kv_cache_unified:      MUSA0 KV buffer size =   224.00 MiB
llama_kv_cache_unified: size =  224.00 MiB (  4096 cells,  28 layers,  1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_gallocr_reserve_n: reallocating MUSA0 buffer from size 0.00 MiB to 304.00 MiB
ggml_gallocr_reserve_n: reallocating MUSA_Host buffer from size 0.00 MiB to 15.01 MiB
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      MUSA0 compute buffer size =   304.00 MiB
llama_context:  MUSA_Host compute buffer size =    15.01 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
main: llama threadpool init, n_threads = 6
attach_threadpool: call
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<｜User｜>Hello<｜Assistant｜>Hi there<｜end▁of▁sentence｜><｜User｜>How are you?<｜Assistant｜>

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

n_ctx: 4096, add_bos: 1
tokenize the prompt
prompt: ""
tokens: [ '<beginofsentence>':151646 ]
recalculate the cached logits (check): embd_inp.size() 1, n_matching_session_tokens 0, embd_inp.size() 1, session_tokens.size() 0
main: interactive mode on.
sampler seed: 612160863
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

embd_inp.size(): 1, n_consumed: 0
waiting for user input

> Hi there
buffer: 'Hi there'
formatted: '<｜User｜>Hi there<｜Assistant｜>'
input tokens: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
n_remain: -5
eval: [ '<beginofsentence>':151646 ]
n_past = 1
embd_inp.size(): 5, n_consumed: 1
eval: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
n_past = 5
n_remain: -6
자eval: [ '':25715 ]
n_past = 6
n_remain: -7
UINTeval: [ 'UINT':25712 ]
n_past = 7
n_remain: -8
UINTeval: [ 'UINT':25712 ]
n_past = 8
n_remain: -9
 insightseval: [ ' insights':25709 ]
n_past = 9
n_remain: -10
 insightseval: [ ' insights':25709 ]
n_past = 10
n_remain: -11
UINTeval: [ 'UINT':25712 ]
n_past = 11
n_remain: -12
Tooltipeval: [ 'Tooltip':25717 ]
n_past = 12
n_remain: -13
UINTeval: [ 'UINT':25712 ]
n_past = 13
n_remain: -14
 insightseval: [ ' insights':25709 ]
n_past = 14
n_remain: -15
UINTeval: [ 'UINT':25712 ]
n_past = 15
n_remain: -16
 insightseval: [ ' insights':25709 ]
n_past = 16
n_remain: -17
 insightseval: [ ' insights':25709 ]
n_past = 17
n_remain: -18
 insightseval: [ ' insights':25709 ]
n_past = 18
n_remain: -19
UINTeval: [ 'UINT':25712 ]
n_past = 19
n_remain: -20
UINTeval: [ 'UINT':25712 ]
n_past = 20
n_remain: -21
UINTeval: [ 'UINT':25712 ]
n_past = 21
n_remain: -22
UINTeval: [ 'UINT':25712 ]
n_past = 22
n_remain: -23
UINTeval: [ 'UINT':25712 ]
n_past = 23
n_remain: -24
 insightseval: [ ' insights':25709 ]
n_past = 24
n_remain: -25
UINTwaiting for user input

>

yeahdongcn · 2025-05-26T10:49:21Z

By comparing the two log files, I noticed only one explicit difference:

yeahdongcn added the bug-unconfirmed label May 26, 2025

yeahdongcn mentioned this issue May 26, 2025

musa: add support for muBLAS and MMA #13149

Closed

3 tasks

ggerganov mentioned this issue May 26, 2025

cuda : avoid cuGetErrorString when not needed #13791

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788

Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788

Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788

Comments

Uh oh!

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!