8000 Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off · Issue #13788 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yeahdongcn opened this issue May 26, 2025 · 5 comments

Comments

@yeahdongcn
Copy link
Collaborator
yeahdongcn commented May 26, 2025

Name and Version

root@f7cd9f1a2456:/ws# ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
Device 0: MTT S80, compute capability 2.1, VMM: no
version: 5488 (e121edc)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

Musa

Hardware

12th Gen Intel(R) Core(TM) i5-12400 + MTT S80

Models

DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth

Problem description & steps to reproduce

  1. git reset 2d77d88e70d017cd82c3f1a4517e3102e2028ac4 --hard
  2. apply diff and build
diff --git a/ggml/src/ggml-musa/CMakeLists.txt b/ggml/src/ggml-musa/CMakeLists.txt
index 92f05d555..eb80418b2 100644
--- a/ggml/src/ggml-musa/CMakeLists.txt
+++ b/ggml/src/ggml-musa/CMakeLists.txt
@@ -75,9 +75,9 @@ if (MUSAToolkit_FOUND)
         add_compile_definitions(GGML_CUDA_FORCE_CUBLAS)
     endif()
 
-    if (GGML_CUDA_NO_VMM)
+    # if (GGML_CUDA_NO_VMM)
         add_compile_definitions(GGML_CUDA_NO_VMM)
-    endif()
+    # endif()
 
     if (NOT GGML_CUDA_FA)
         add_compile_definitions(GGML_CUDA_NO_FA)
  1. run DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth
root@f7cd9f1a2456:/ws# ./build/bin/llama-cli -m /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
build: 4953 (2d77d88e7) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                  
8000
         general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = DeepSeek R1 Distill Qwen 7B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
make_cpu_buft_list: disabling extra buffer types (i.e. repacking) since a GPU device is available
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        MUSA0 model buffer size =  4168.09 MiB
load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
..................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.58 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 28, can_shift = 1
init:      MUSA0 KV buffer size =   224.00 MiB
llama_context: KV self size  =  224.00 MiB, K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      MUSA0 compute buffer size =   304.00 MiB
llama_context:  MUSA_Host compute buffer size =    15.01 MiB
llama_context: graph nodes  = 1042
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 597688009
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hi there
UINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINTUINT
> 
  1. with --fa, everything goes fine
  2. mannully revert 2d77d88 on master (e121edc), with or without -fa, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf can generate token correctly; other models (nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf, qwen3_8b_q4_k_m.gguf) seems not have this issue; turn off kvcache by using --no-kv-offload the issue is gone. I also noticed pp performance downgrade if kvcache is on && disable flash attention (default)
  3. also tried CPU backend and no such issue found

First Bad Commit

2d77d88

Relevant log output

root@f7cd9f1a2456:/ws# ./build/bin/llama-cli -m /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf -ngl 999
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
build: 5488 (e121edc43) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 292 tensors from /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 Nemotron Nano 8B v1
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                       general.organization str              = Nvidia
llama_model_loader: - kv   5:                           general.finetune str              = 42f62a403ee352e019834442673256e3fe3de275
llama_model_loader: - kv   6:                           general.basename str              = Llama-3.1-Nemotron-Nano
llama_model_loader: - kv   7:                         general.size_label str              = 8B
llama_model_loader: - kv   8:                            general.license str              = other
llama_model_loader: - kv   9:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv  10:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          llama.block_count u32              = 32
llama_model_loader: - kv  14:                       llama.context_length u32              = 131072
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  22:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  23:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  24:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.58 GiB (4.89 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Llama 3.1 Nemotron Nano 8B v1
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        MUSA0 model buffer size =  4403.49 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      MUSA0 KV buffer size =   512.00 MiB
llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context:      MUSA0 compute buffer size =   296.00 MiB
llama_context:  MUSA_Host compute buffer size =    16.01 MiB
llama_context: graph nodes  = 1158
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 3669391543
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hi
Hello! It's nice to meet you. How can I assist you today?

> 
llama_perf_sampler_print:    sampling time =       1.48 ms /    27 runs   (    0.05 ms per token, 18218.62 tokens per second)
llama_perf_context_print:        load time =    2201.14 ms
llama_perf_context_print: prompt eval time =    4094.29 ms /    11 tokens (  372.21 ms per token,     2.69 tokens per second)
llama_perf_context_print:        eval time =    1359.07 ms /    16 runs   (   84.94 ms per token,    11.77 tokens per second)
llama_perf_context_print:       total time =    8089.12 ms /    27 tokens
Interrupted by user
root@f7cd9f1a2456:/ws# ./build/bin/llama-cli -m /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf -ngl 999 -fa
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
build: 5488 (e121edc43) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15752 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 292 tensors from /models/nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 Nemotron Nano 8B v1
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                       general.organization str              = Nvidia
llama_model_loader: - kv   5:                           general.finetune str              = 42f62a403ee352e019834442673256e3fe3de275
llama_model_loader: - kv   6:                           general.basename str              = Llama-3.1-Nemotron-Nano
llama_model_loader: - kv   7:                         general.size_label str              = 8B
llama_model_loader: - kv   8:                            general.license str              = other
llama_model_loader: - kv   9:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv  10:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  11:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          llama.block_count u32              = 32
llama_model_loader: - kv  14:                       llama.context_length u32              = 131072
llama_model_loader: - kv  15:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  16:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  17:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  18:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  20:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  22:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  23:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  24:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  28:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  29:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 15
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.58 GiB (4.89 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = Llama 3.1 Nemotron Nano 8B v1
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:        MUSA0 model buffer size =  4403.49 MiB
load_tensors:   CPU_Mapped model buffer size =   281.81 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:  MUSA_Host  output buffer size =     0.49 MiB
llama_kv_cache_unified:      MUSA0 KV buffer size =   512.00 MiB
llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context:      MUSA0 compute buffer size =   266.55 MiB
llama_context:  MUSA_Host compute buffer size =    36.01 MiB
llama_context: graph nodes  = 1031
llama_context: graph splits = 66
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: interactive mode on.
sampler seed: 401977407
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> Hi
Hello! How can I assist you today?

> 
llama_perf_sampler_print:    sampling time =       0.90 ms /    20 runs   (    0.05 ms per token, 22148.39 tokens per second)
llama_perf_context_print:        load time =     944.39 ms
llama_perf_context_print: prompt eval time =     781.78 ms /    11 tokens (   71.07 ms per token,    14.07 tokens per second)
llama_perf_context_print:        eval time =     810.46 ms /     9 runs   (   90.05 ms per token,    11.10 tokens per second)
llama_perf_context_print:       total time =   10864.23 ms /    20 tokens
Interrupted by user
root@f7cd9f1a2456:/ws#
@yeahdongcn
Copy link
Collaborator Author

Hi @ggerganov Could you please take a look and let me know if you have any insights or suggestions to debug? Thanks.

@ggerganov
Copy link
Member

On latest master, I am not able to reproduce the problem on my RTX 2060:

cmake -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON ..

Downloaded the model from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main

Command:

./bin/llama-cli -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf  -ngl 99 -lv 1
./bin/llama-cli -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf  -ngl 99 -lv 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: no
0.00.000.541 I build: 5490 (fef693dc6) with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
0.00.000.617 I main: llama backend init
0.00.000.622 I main: load the model and apply lora adapter, if any
0.00.130.784 I llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2060 SUPER) - 7680 MiB free
0.00.154.075 I llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
0.00.154.086 I llama_model_loader: - tensor split  0:                    output.weight q6_K     [  3584, 152064,     1,     1 ]   426,36 MiB
0.00.154.088 I llama_model_loader: - tensor split  0:               output_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.089 I llama_model_loader: - tensor split  0:                token_embd.weight q4_K     [  3584, 152064,     1,     1 ]   292,36 MiB
0.00.154.090 I llama_model_loader: - tensor split  0:                blk.0.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.091 I llama_model_loader: - tensor split  0:              blk.0.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.091 I llama_model_loader: - tensor split  0:           blk.0.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.092 I llama_model_loader: - tensor split  0:         blk.0.attn_output.weight q4_K     [  3584,  35
8000
84,     1,     1 ]     6,89 MiB
0.00.154.092 I llama_model_loader: - tensor split  0:                blk.0.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.093 I llama_model_loader: - tensor split  0:              blk.0.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.093 I llama_model_loader: - tensor split  0:                blk.0.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.094 I llama_model_loader: - tensor split  0:              blk.0.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.094 I llama_model_loader: - tensor split  0:            blk.0.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.095 I llama_model_loader: - tensor split  0:            blk.0.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.095 I llama_model_loader: - tensor split  0:            blk.0.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.096 I llama_model_loader: - tensor split  0:              blk.0.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.096 I llama_model_loader: - tensor split  0:                blk.1.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.097 I llama_model_loader: - tensor split  0:              blk.1.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.097 I llama_model_loader: - tensor split  0:           blk.1.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.098 I llama_model_loader: - tensor split  0:         blk.1.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.098 I llama_model_loader: - tensor split  0:                blk.1.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.099 I llama_model_loader: - tensor split  0:              blk.1.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.100 I llama_model_loader: - tensor split  0:                blk.1.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.100 I llama_model_loader: - tensor split  0:              blk.1.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.101 I llama_model_loader: - tensor split  0:            blk.1.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.101 I llama_model_loader: - tensor split  0:            blk.1.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.102 I llama_model_loader: - tensor split  0:            blk.1.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.102 I llama_model_loader: - tensor split  0:              blk.1.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.103 I llama_model_loader: - tensor split  0:                blk.2.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.103 I llama_model_loader: - tensor split  0:              blk.2.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.103 I llama_model_loader: - tensor split  0:           blk.2.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.104 I llama_model_loader: - tensor split  0:         blk.2.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.104 I llama_model_loader: - tensor split  0:                blk.2.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.105 I llama_model_loader: - tensor split  0:              blk.2.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.105 I llama_model_loader: - tensor split  0:                blk.2.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.106 I llama_model_loader: - tensor split  0:              blk.2.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.106 I llama_model_loader: - tensor split  0:            blk.2.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.107 I llama_model_loader: - tensor split  0:            blk.2.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.107 I llama_model_loader: - tensor split  0:            blk.2.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.108 I llama_model_loader: - tensor split  0:              blk.2.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.109 I llama_model_loader: - tensor split  0:                blk.3.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.110 I llama_model_loader: - tensor split  0:              blk.3.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.111 I llama_model_loader: - tensor split  0:           blk.3.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.111 I llama_model_loader: - tensor split  0:         blk.3.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.111 I llama_model_loader: - tensor split  0:                blk.3.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.112 I llama_model_loader: - tensor split  0:              blk.3.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.112 I llama_model_loader: - tensor split  0:                blk.3.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.113 I llama_model_loader: - tensor split  0:              blk.3.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.113 I llama_model_loader: - tensor split  0:            blk.3.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.114 I llama_model_loader: - tensor split  0:            blk.3.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.114 I llama_model_loader: - tensor split  0:            blk.3.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.115 I llama_model_loader: - tensor split  0:              blk.3.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.115 I llama_model_loader: - tensor split  0:                blk.4.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.116 I llama_model_loader: - tensor split  0:              blk.4.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.116 I llama_model_loader: - tensor split  0:           blk.4.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.117 I llama_model_loader: - tensor split  0:         blk.4.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.117 I llama_model_loader: - tensor split  0:                blk.4.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.118 I llama_model_loader: - tensor split  0:              blk.4.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.118 I llama_model_loader: - tensor split  0:                blk.4.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.119 I llama_model_loader: - tensor split  0:              blk.4.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.119 I llama_model_loader: - tensor split  0:            blk.4.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.120 I llama_model_loader: - tensor split  0:            blk.4.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.120 I llama_model_loader: - tensor split  0:            blk.4.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.121 I llama_model_loader: - tensor split  0:              blk.4.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.121 I llama_model_loader: - tensor split  0:                blk.5.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.122 I llama_model_loader: - tensor split  0:              blk.5.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.122 I llama_model_loader: - tensor split  0:           blk.5.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.123 I llama_model_loader: - tensor split  0:         blk.5.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.123 I llama_model_loader: - tensor split  0:                blk.5.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.123 I llama_model_loader: - tensor split  0:              blk.5.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.124 I llama_model_loader: - tensor split  0:                blk.5.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.124 I llama_model_loader: - tensor split  0:              blk.5.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.125 I llama_model_loader: - tensor split  0:            blk.5.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.126 I llama_model_loader: - tensor split  0:            blk.5.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.126 I llama_model_loader: - tensor split  0:            blk.5.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.127 I llama_model_loader: - tensor split  0:              blk.5.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.127 I llama_model_loader: - tensor split  0:                blk.6.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.128 I llama_model_loader: - tensor split  0:              blk.6.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.128 I llama_model_loader: - tensor split  0:           blk.6.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.128 I llama_model_loader: - tensor split  0:         blk.6.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.129 I llama_model_loader: - tensor split  0:                blk.6.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.129 I llama_model_loader: - tensor split  0:              blk.6.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.130 I llama_model_loader: - tensor split  0:                blk.6.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.130 I llama_model_loader: - tensor split  0:              blk.6.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.131 I llama_model_loader: - tensor split  0:            blk.6.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.131 I llama_model_loader: - tensor split  0:            blk.6.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.132 I llama_model_loader: - tensor split  0:            blk.6.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.132 I llama_model_loader: - tensor split  0:              blk.6.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.133 I llama_model_loader: - tensor split  0:                blk.7.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.134 I llama_model_loader: - tensor split  0:              blk.7.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.135 I llama_model_loader: - tensor split  0:           blk.7.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.135 I llama_model_loader: - tensor split  0:         blk.7.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.136 I llama_model_loader: - tensor split  0:                blk.7.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.136 I llama_model_loader: - tensor split  0:              blk.7.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.137 I llama_model_loader: - tensor split  0:                blk.7.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.137 I llama_model_loader: - tensor split  0:              blk.7.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.138 I llama_model_loader: - tensor split  0:            blk.7.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.138 I llama_model_loader: - tensor split  0:            blk.7.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.139 I llama_model_loader: - tensor split  0:            blk.7.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.139 I llama_model_loader: - tensor split  0:              blk.7.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.140 I llama_model_loader: - tensor split  0:                blk.8.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.140 I llama_model_loader: - tensor split  0:              blk.8.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.141 I llama_model_loader: - tensor split  0:           blk.8.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.141 I llama_model_loader: - tensor split  0:         blk.8.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.142 I llama_model_loader: - tensor split  0:                blk.8.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.142 I llama_model_loader: - tensor split  0:              blk.8.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.143 I llama_model_loader: - tensor split  0:                blk.8.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.143 I llama_model_loader: - tensor split  0:              blk.8.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.144 I llama_model_loader: - tensor split  0:            blk.8.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.144 I llama_model_loader: - tensor split  0:            blk.8.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.145 I llama_model_loader: - tensor split  0:            blk.8.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.145 I llama_model_loader: - tensor split  0:              blk.8.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.146 I llama_model_loader: - tensor split  0:                blk.9.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.146 I llama_model_loader: - tensor split  0:              blk.9.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.147 I llama_model_loader: - tensor split  0:           blk.9.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.147 I llama_model_loader: - tensor split  0:         blk.9.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.147 I llama_model_loader: - tensor split  0:                blk.9.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.148 I llama_model_loader: - tensor split  0:              blk.9.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.148 I llama_model_loader: - tensor split  0:                blk.9.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.149 I llama_model_loader: - tensor split  0:              blk.9.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.149 I llama_model_loader: - tensor split  0:            blk.9.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.150 I llama_model_loader: - tensor split  0:            blk.9.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.150 I llama_model_loader: - tensor split  0:            blk.9.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.151 I llama_model_loader: - tensor split  0:              blk.9.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.151 I llama_model_loader: - tensor split  0:               blk.10.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.152 I llama_model_loader: - tensor split  0:             blk.10.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.152 I llama_model_loader: - tensor split  0:          blk.10.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.153 I llama_model_loader: - tensor split  0:        blk.10.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.153 I llama_model_loader: - tensor split  0:               blk.10.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.154 I llama_model_loader: - tensor split  0:             blk.10.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.154 I llama_model_loader: - tensor split  0:               blk.10.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.155 I llama_model_loader: - tensor split  0:             blk.10.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.155 I llama_model_loader: - tensor split  0:           blk.10.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.156 I llama_model_loader: - tensor split  0:           blk.10.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.156 I llama_model_loader: - tensor split  0:           blk.10.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.157 I llama_model_loader: - tensor split  0:             blk.10.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.157 I llama_model_loader: - tensor split  0:               blk.11.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.158 I llama_model_loader: - tensor split  0:             blk.11.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.158 I llama_model_loader: - tensor split  0:          blk.11.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.159 I llama_model_loader: - tensor split  0:        blk.11.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.160 I llama_model_loader: - tensor split  0:               blk.11.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.161 I llama_model_loader: - tensor split  0:             blk.11.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.161 I llama_model_loader: - tensor split  0:               blk.11.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.162 I llama_model_loader: - tensor split  0:             blk.11.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.162 I llama_model_loader: - tensor split  0:           blk.11.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.163 I llama_model_loader: - tensor split  0:           blk.11.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.164 I llama_model_loader: - tensor split  0:           blk.11.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.164 I llama_model_loader: - tensor split  0:             blk.11.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.165 I llama_model_loader: - tensor split  0:               blk.12.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.165 I llama_model_loader: - tensor split  0:             blk.12.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.166 I llama_model_loader: - tensor split  0:          blk.12.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.166 I llama_model_loader: - tensor split  0:        blk.12.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.167 I llama_model_loader: - tensor split  0:               blk.12.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.167 I llama_model_loader: - tensor split  0:             blk.12.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.167 I llama_model_loader: - tensor split  0:               blk.12.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.168 I llama_model_loader: - tensor split  0:             blk.12.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.168 I llama_model_loader: - tensor split  0:           blk.12.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.169 I llama_model_loader: - tensor split  0:           blk.12.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.169 I llama_model_loader: - tensor split  0:           blk.12.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.170 I llama_model_loader: - tensor split  0:             blk.12.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.170 I llama_model_loader: - tensor split  0:               blk.13.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.171 I llama_model_loader: - tensor split  0:             blk.13.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.171 I llama_model_loader: - tensor split  0:          blk.13.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.172 I llama_model_loader: - tensor split  0:        blk.13.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.172 I llama_model_loader: - tensor split  0:               blk.13.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.173 I llama_model_loader: - tensor split  0:             blk.13.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.173 I llama_model_loader: - tensor split  0:               blk.13.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.174 I llama_model_loader: - tensor split  0:             blk.13.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.174 I llama_model_loader: - tensor split  0:           blk.13.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.175 I llama_model_loader: - tensor split  0:           blk.13.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.175 I llama_model_loader: - tensor split  0:           blk.13.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.176 I llama_model_loader: - tensor split  0:             blk.13.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.176 I llama_model_loader: - tensor split  0:               blk.14.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.176 I llama_model_loader: - tensor split  0:             blk.14.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.177 I llama_model_loader: - tensor split  0:          blk.14.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.178 I llama_model_loader: - tensor split  0:        blk.14.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.178 I llama_model_loader: - tensor split  0:               blk.14.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.179 I llama_model_loader: - tensor split  0:             blk.14.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.179 I llama_model_loader: - tensor split  0:               blk.14.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.180 I llama_model_loader: - tensor split  0:             blk.14.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.180 I llama_model_loader: - tensor split  0:           blk.14.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.181 I llama_model_loader: - tensor split  0:           blk.14.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.181 I llama_model_loader: - tensor split  0:           blk.14.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.182 I llama_model_loader: - tensor split  0:             blk.14.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.182 I llama_model_loader: - tensor split  0:               blk.15.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.183 I llama_model_loader: - tensor split  0:             blk.15.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.183 I llama_model_loader: - tensor split  0:          blk.15.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.184 I llama_model_loader: - tensor split  0:        blk.15.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.184 I llama_model_loader: - tensor split  0:               blk.15.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.184 I llama_model_loader: - tensor split  0:             blk.15.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.185 I llama_model_loader: - tensor split  0:               blk.15.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.185 I llama_model_loader: - tensor split  0:             blk.15.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.186 I llama_model_loader: - tensor split  0:           blk.15.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.186 I llama_model_loader: - tensor split  0:           blk.15.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.187 I llama_model_loader: - tensor split  0:           blk.15.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.187 I llama_model_loader: - tensor split  0:             blk.15.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.188 I llama_model_loader: - tensor split  0:               blk.16.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.188 I llama_model_loader: - tensor split  0:             blk.16.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.189 I llama_model_loader: - tensor split  0:          blk.16.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.189 I llama_model_loader: - tensor split  0:        blk.16.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.190 I llama_model_loader: - tensor split  0:               blk.16.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.190 I llama_model_loader: - tensor split  0:             blk.16.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.191 I llama_model_loader: - tensor split  0:               blk.16.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.191 I llama_model_loader: - tensor split  0:             blk.16.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.192 I llama_model_loader: - tensor split  0:           blk.16.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.192 I llama_model_loader: - tensor split  0:           blk.16.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.193 I llama_model_loader: - tensor split  0:           blk.16.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.193 I llama_model_loader: - tensor split  0:             blk.16.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.194 I llama_model_loader: - tensor split  0:               blk.17.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.194 I llama_model_loader: - tensor split  0:             blk.17.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.195 I llama_model_loader: - tensor split  0:          blk.17.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.195 I llama_model_loader: - tensor split  0:        blk.17.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.195 I llama_model_loader: - tensor split  0:               blk.17.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.196 I llama_model_loader: - tensor split  0:             blk.17.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.197 I llama_model_loader: - tensor split  0:               blk.17.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.198 I llama_model_loader: - tensor split  0:             blk.17.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.199 I llama_model_loader: - tensor split  0:           blk.17.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.199 I llama_model_loader: - tensor split  0:           blk.17.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.200 I llama_model_loader: - tensor split  0:           blk.17.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.200 I llama_model_loader: - tensor split  0:             blk.17.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.201 I llama_model_loader: - tensor split  0:               blk.18.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.201 I llama_model_loader: - tensor split  0:             blk.18.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.202 I llama_model_loader: - tensor split  0:          blk.18.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.202 I llama_model_loader: - tensor split  0:        blk.18.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.202 I llama_model_loader: - tensor split  0:               blk.18.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.203 I llama_model_loader: - tensor split  0:             blk.18.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.204 I llama_model_loader: - tensor split  0:               blk.18.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.204 I llama_model_loader: - tensor split  0:             blk.18.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.204 I llama_model_loader: - tensor split  0:           blk.18.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.205 I llama_model_loader: - tensor split  0:           blk.18.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.205 I llama_model_loader: - tensor split  0:           blk.18.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.206 I llama_model_loader: - tensor split  0:             blk.18.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.206 I llama_model_loader: - tensor split  0:               blk.19.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.207 I llama_model_loader: - tensor split  0:             blk.19.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.207 I llama_model_loader: - tensor split  0:          blk.19.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.208 I llama_model_loader: - tensor split  0:        blk.19.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.208 I llama_model_loader: - tensor split  0:               blk.19.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.209 I llama_model_loader: - tensor split  0:             blk.19.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.210 I llama_model_loader: - tensor split  0:               blk.19.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.210 I llama_model_loader: - tensor split  0:             blk.19.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.211 I llama_model_loader: - tensor split  0:           blk.19.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.211 I llama_model_loader: - tensor split  0:           blk.19.ffn_gate.weight q4_K     [  3584, 18944,     
8000
1,     1 ]    36,42 MiB
0.00.154.212 I llama_model_loader: - tensor split  0:           blk.19.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.212 I llama_model_loader: - tensor split  0:             blk.19.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.213 I llama_model_loader: - tensor split  0:               blk.20.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.213 I llama_model_loader: - tensor split  0:             blk.20.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.213 I llama_model_loader: - tensor split  0:          blk.20.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.214 I llama_model_loader: - tensor split  0:        blk.20.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.214 I llama_model_loader: - tensor split  0:               blk.20.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.215 I llama_model_loader: - tensor split  0:             blk.20.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.215 I llama_model_loader: - tensor split  0:               blk.20.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.216 I llama_model_loader: - tensor split  0:             blk.20.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.217 I llama_model_loader: - tensor split  0:           blk.20.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.217 I llama_model_loader: - tensor split  0:           blk.20.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.218 I llama_model_loader: - tensor split  0:           blk.20.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.218 I llama_model_loader: - tensor split  0:             blk.20.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.219 I llama_model_loader: - tensor split  0:               blk.21.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.220 I llama_model_loader: - tensor split  0:             blk.21.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.220 I llama_model_loader: - tensor split  0:          blk.21.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.221 I llama_model_loader: - tensor split  0:        blk.21.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.222 I llama_model_loader: - tensor split  0:               blk.21.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.222 I llama_model_loader: - tensor split  0:             blk.21.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.223 I llama_model_loader: - tensor split  0:               blk.21.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.224 I llama_model_loader: - tensor split  0:             blk.21.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.233 I llama_model_loader: - tensor split  0:           blk.21.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.234 I llama_model_loader: - tensor split  0:           blk.21.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.235 I llama_model_loader: - tensor split  0:           blk.21.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.237 I llama_model_loader: - tensor split  0:             blk.21.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.238 I llama_model_loader: - tensor split  0:               blk.22.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.238 I llama_model_loader: - tensor split  0:             blk.22.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.251 I llama_model_loader: - tensor split  0:          blk.22.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.252 I llama_model_loader: - tensor split  0:        blk.22.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.252 I llama_model_loader: - tensor split  0:               blk.22.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.253 I llama_model_loader: - tensor split  0:             blk.22.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.253 I llama_model_loader: - tensor split  0:               blk.22.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.254 I llama_model_loader: - tensor split  0:             blk.22.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.254 I llama_model_loader: - tensor split  0:           blk.22.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36,42 MiB
0.00.154.255 I llama_model_loader: - tensor split  0:           blk.22.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.256 I llama_model_loader: - tensor split  0:           blk.22.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.256 I llama_model_loader: - tensor split  0:             blk.22.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.257 I llama_model_loader: - tensor split  0:               blk.23.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.257 I llama_model_loader: - tensor split  0:             blk.23.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.257 I llama_model_loader: - tensor split  0:          blk.23.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.258 I llama_model_loader: - tensor split  0:        blk.23.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.258 I llama_model_loader: - tensor split  0:               blk.23.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.259 I llama_model_loader: - tensor split  0:             blk.23.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.259 I llama_model_loader: - tensor split  0:               blk.23.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.260 I llama_model_loader: - tensor split  0:             blk.23.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.260 I llama_model_loader: - tensor split  0:           blk.23.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.261 I llama_model_loader: - tensor split  0:           blk.23.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.261 I llama_model_loader: - tensor split  0:           blk.23.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.262 I llama_model_loader: - tensor split  0:             blk.23.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.262 I llama_model_loader: - tensor split  0:               blk.24.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.263 I llama_model_loader: - tensor split  0:             blk.24.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.263 I llama_model_loader: - tensor split  0:          blk.24.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.264 I llama_model_loader: - tensor split  0:        blk.24.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.264 I llama_model_loader: - tensor split  0:               blk.24.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.265 I llama_model_loader: - tensor split  0:             blk.24.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.265 I llama_model_loader: - tensor split  0:               blk.24.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.266 I llama_model_loader: - tensor split  0:             blk.24.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.266 I llama_model_loader: - tensor split  0:           blk.24.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.267 I llama_model_loader: - tensor split  0:           blk.24.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.267 I llama_model_loader: - tensor split  0:           blk.24.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.268 I llama_model_loader: - tensor split  0:             blk.24.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.268 I llama_model_loader: - tensor split  0:               blk.25.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.269 I llama_model_loader: - tensor split  0:             blk.25.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.269 I llama_model_loader: - tensor split  0:          blk.25.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.270 I llama_model_loader: - tensor split  0:        blk.25.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.270 I llama_model_loader: - tensor split  0:               blk.25.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.271 I llama_model_loader: - tensor split  0:             blk.25.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.271 I llama_model_loader: - tensor split  0:               blk.25.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.272 I llama_model_loader: - tensor split  0:             blk.25.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.272 I llama_model_loader: - tensor split  0:           blk.25.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.273 I llama_model_loader: - tensor split  0:           blk.25.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.273 I llama_model_loader: - tensor split  0:           blk.25.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.274 I llama_model_loader: - tensor split  0:             blk.25.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.274 I llama_model_loader: - tensor split  0:               blk.26.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.275 I llama_model_loader: - tensor split  0:             blk.26.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.275 I llama_model_loader: - tensor split  0:          blk.26.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.276 I llama_model_loader: - tensor split  0:        blk.26.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.276 I llama_model_loader: - tensor split  0:               blk.26.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.277 I llama_model_loader: - tensor split  0:             blk.26.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.277 I llama_model_loader: - tensor split  0:               blk.26.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.278 I llama_model_loader: - tensor split  0:             blk.26.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.278 I llama_model_loader: - tensor split  0:           blk.26.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.279 I llama_model_loader: - tensor split  0:           blk.26.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.279 I llama_model_loader: - tensor split  0:           blk.26.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.279 I llama_model_loader: - tensor split  0:             blk.26.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.280 I llama_model_loader: - tensor split  0:               blk.27.attn_k.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.280 I llama_model_loader: - tensor split  0:             blk.27.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0,98 MiB
0.00.154.281 I llama_model_loader: - tensor split  0:          blk.27.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.281 I llama_model_loader: - tensor split  0:        blk.27.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.282 I llama_model_loader: - tensor split  0:               blk.27.attn_q.bias f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.282 I llama_model_loader: - tensor split  0:             blk.27.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6,89 MiB
0.00.154.283 I llama_model_loader: - tensor split  0:               blk.27.attn_v.bias f32      [   512,     1,     1,     1 ]     0,00 MiB
0.00.154.283 I llama_model_loader: - tensor split  0:             blk.27.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1,44 MiB
0.00.154.284 I llama_model_loader: - tensor split  0:           blk.27.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53,12 MiB
0.00.154.284 I llama_model_loader: - tensor split  0:           blk.27.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.285 I llama_model_loader: - tensor split  0:           blk.27.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0,01 MiB
0.00.154.285 I llama_model_loader: - tensor split  0:             blk.27.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36,42 MiB
0.00.154.289 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.00.154.297 I llama_model_loader: - kv   0:                       general.architecture str              = qwen2
0.00.154.298 I llama_model_loader: - kv   1:                               general.type str              = model
0.00.154.298 I llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
0.00.154.299 I llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
0.00.154.299 I llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Qwen
0.00.154.299 I llama_model_loader: - kv   5:                         general.size_label str              = 7B
0.00.154.302 I llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
0.00.154.302 I llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
0.00.154.302 I llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
0.00.154.303 I llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
0.00.154.303 I llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
0.00.154.303 I llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
0.00.154.306 I llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000,000000
0.00.154.306 I llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0,000001
0.00.154.307 I llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
0.00.154.307 I llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
0.00.166.302 I llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
0.00.169.960 I llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.00.181.266 I llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
0.00.181.269 I llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
0.00.181.269 I llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
0.00.181.269 I llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151654
0.00.181.270 I llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
0.00.181.270 I llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
0.00.181.271 I llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
0.00.181.272 I llama_model_loader: - kv  25:               general.quantization_version u32              = 2
0.00.181.272 I llama_model_loader: - kv  26:                          general.file_type u32              = 15
0.00.181.272 I llama_model_loader: - type  f32:  141 tensors
0.00.181.273 I llama_model_loader: - type q4_K:  169 tensors
0.00.181.273 I llama_model_loader: - type q6_K:   29 tensors
0.00.181.275 I print_info: file format = GGUF V3 (latest)
0.00.181.275 I print_info: file type   = Q4_K - Medium
0.00.181.276 I print_info: file size   = 4,36 GiB (4,91 BPW) 
0.00.225.519 D init_tokenizer: initializing tokenizer for type 2
0.00.233.934 D load: control token: 151660 '<|fim_middle|>' is not marked as EOG
0.00.233.936 D load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
0.00.233.936 D load: control token: 151653 '<|vision_end|>' is not marked as EOG
0.00.233.937 D load: control token: 151645 '<|Assistant|>' is not marked as EOG
0.00.234.243 D load: control token: 151644 '<|User|>' is not marked as EOG
0.00.234.349 D load: control token: 151655 '<|image_pad|>' is not marked as EOG
0.00.234.388 D load: control token: 151651 '<|quad_end|>' is not marked as EOG
0.00.235.230 D load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
0.00.235.800 D load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
0.00.235.808 D load: control token: 151652 '<|vision_start|>' is not marked as EOG
0.00.236.039 D load: control token: 151647 '<|EOT|>' is not marked as EOG
0.00.236.739 D load: control token: 151654 '<|vision_pad|>' is not marked as EOG
0.00.237.456 D load: control token: 151656 '<|video_pad|>' is not marked as EOG
0.00.237.772 D load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
0.00.237.948 D load: control token: 151650 '<|quad_start|>' is not marked as EOG
0.00.238.366 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.00.238.449 I load: special tokens cache size = 22
0.00.273.622 I load: token to piece cache size = 0,9310 MB
0.00.273.632 I print_info: arch             = qwen2
0.00.273.633 I print_info: vocab_only       = 0
0.00.273.633 I print_info: n_ctx_train      = 131072
0.00.273.633 I print_info: n_embd           = 3584
0.00.273.634 I print_info: n_layer          = 28
0.00.273.642 I print_info: n_head           = 28
0.00.273.644 I print_info: n_head_kv        = 4
0.00.273.644 I print_info: n_rot            = 128
0.00.273.644 I print_info: n_swa            = 0
0.00.273.646 I print_info: is_swa_any       = 0
0.00.273.646 I print_info: n_embd_head_k    = 128
0.00.273.646 I print_info: n_embd_head_v    = 128
0.00.273.648 I print_info: n_gqa            = 7
0.00.273.649 I print_info: n_embd_k_gqa     = 512
0.00.273.650 I print_info: n_embd_v_gqa     = 512
0.00.273.651 I print_info: f_norm_eps       = 0,0e+00
0.00.273.652 I print_info: f_norm_rms_eps   = 1,0e-06
0.00.273.652 I print_info: f_clamp_kqv      = 0,0e+00
0.00.273.652 I print_info: f_max_alibi_bias = 0,0e+00
0.00.273.654 I print_info: f_logit_scale    = 0,0e+00
0.00.273.654 I print_info: f_attn_scale     = 0,0e+00
0.00.273.656 I print_info: n_ff             = 18944
0.00.273.656 I print_info: n_expert         = 0
0.00.273.656 I print_info: n_expert_used    = 0
0.00.273.656 I print_info: causal attn      = 1
0.00.273.656 I print_info: pooling type     = -1
0.00.273.656 I print_info: rope type        = 2
0.00.273.657 I print_info: rope scaling     = linear
0.00.273.657 I print_info: freq_base_train  = 10000,0
0.00.273.658 I print_info: freq_scale_train = 1
0.00.273.658 I print_info: n_ctx_orig_yarn  = 131072
0.00.273.658 I print_info: rope_finetuned   = unknown
0.00.273.658 I print_info: ssm_d_conv       = 0
0.00.273.658 I print_info: ssm_d_inner      = 0
0.00.273.658 I print_info: ssm_d_state      = 0
0.00.273.659 I print_info: ssm_dt_rank      = 0
0.00.273.659 I print_info: ssm_dt_b_c_rms   = 0
0.00.273.659 I print_info: model type       = 7B
0.00.273.660 I print_info: model params     = 7,62 B
0.00.273.660 I print_info: general.name     = DeepSeek R1 Distill Qwen 7B
0.00.273.663 I print_info: vocab type       = BPE
0.00.273.664 I print_info: n_vocab          = 152064
0.00.273.664 I print_info: n_merges         = 151387
0.00.273.664 I print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
0.00.273.664 I print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
0.00.273.665 I print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
0.00.273.665 I print_info: PAD token        = 151654 '<|vision_pad|>'
0.00.273.665 I print_info: LF token         = 198 'Ċ'
0.00.273.665 I print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
0.00.273.666 I print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
0.00.273.666 I print_info: FIM MID token    = 151660 '<|fim_middle|>'
0.00.273.666 I print_info: FIM PAD token    = 151662 '<|fim_pad|>'
0.00.273.666 I print_info: FIM REP token    = 151663 '<|repo_name|>'
0.00.273.666 I print_info: FIM SEP token    = 151664 '<|file_sep|>'
0.00.273.666 I print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
0.00.273.667 I print_info: EOG token        = 151662 '<|fim_pad|>'
0.00.273.667 I print_info: EOG token        = 151663 '<|repo_name|>'
0.00.273.667 I print_info: EOG token        = 151664 '<|file_sep|>'
0.00.273.667 I print_info: max token length = 256
0.00.273.674 I load_tensors: loading model tensors, this can take a while... (mmap = true)
0.00.273.791 D load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
0.00.273.791 D load_tensors: layer   1 assigned to device CUDA0, is_swa = 0
0.00.273.792 D load_tensors: layer   2 assigned to device CUDA0, is_swa = 0
0.00.273.793 D load_tensors: layer   3 assigned to device CUDA0, is_swa = 0
0.00.273.793 D load_tensors: layer   4 assigned to device CUDA0, is_swa = 0
0.00.273.794 D load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
0.00.273.794 D load_tensors: layer   6 assigned to device CUDA0, is_swa = 0
0.00.273.794 D load_tensors: layer   7 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer   8 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer   9 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer  10 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
0.00.273.795 D load_tensors: layer  12 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  13 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  14 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  15 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  16 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
0.00.273.796 D load_tensors: layer  18 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  19 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  20 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  21 assigned to device CUDA0, is_swa = 0
0.00.273.797 D load_tensors: layer  22 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  24 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  25 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  26 assigned to device CUDA0, is_swa = 0
0.00.273.798 D load_tensors: layer  27 assigned to device CUDA0, is_swa = 0
0.00.273.799 D load_tensors: layer  28 assigned to device CUDA0, is_swa = 0
0.00.274.863 D load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
0.00.548.882 I load_tensors: offloading 28 repeating layers to GPU
0.00.548.886 I load_tensors: offloading output layer to GPU
0.00.548.886 I load_tensors: offloaded 29/29 layers to GPU
0.00.548.892 I load_tensors:        CUDA0 model buffer size =  4168,09 MiB
0.00.548.893 I load_tensors:   CPU_Mapped model buffer size =   292,36 MiB
..................................................................................
0.01.108.985 I llama_context: constructing llama_context
0.01.108.989 I llama_context: n_seq_max     = 1
0.01.108.989 I llama_context: n_ctx         = 4096
0.01.108.990 I llama_context: n_ctx_per_seq = 4096
0.01.108.990 I llama_context: n_batch       = 2048
0.01.108.990 I llama_context: n_ubatch      = 512
0.01.108.991 I llama_context: causal_attn   = 1
0.01.108.991 I llama_context: flash_attn    = 0
0.01.108.995 I llama_context: freq_base     = 10000,0
0.01.108.995 I llama_context: freq_scale    = 1
0.01.108.996 W llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
0.01.109.034 D set_abort_callback: call
0.01.110.304 I llama_context:  CUDA_Host  output buffer size =     0,58 MiB
0.01.110.310 D create_memory: n_ctx = 4096 (padded)
0.01.110.323 D llama_kv_cache_unified: layer   0: dev = CUDA0
0.01.110.333 D llama_kv_cache_unified: layer   1: dev = CUDA0
0.01.110.334 D llama_kv_cache_unified: layer   2: dev = CUDA0
0.01.110.334 D llama_kv_cache_unified: layer   3: dev = CUDA0
0.01.110.335 D llama_kv_cache_unified: layer   4: dev = CUDA0
0.01.110.336 D llama_kv_cache_unified: layer   5: dev = CUDA0
0.01.110.337 D llama_kv_cache_unified: layer   6: dev = CUDA0
0.01.110.338 D llama_kv_cache_unified: layer   7: dev = CUDA0
0.01.110.339 D llama_kv_cache_unified: layer   8: dev = CUDA0
0.01.110.339 D llama_kv_cache_unified: layer   9: dev = CUDA0
0.01.110.340 D llama_kv_cache_unified: layer  10: dev = CUDA0
0.01.110.340 D llama_kv_cache_unified: layer  11: dev = CUDA0
0.01.110.341 D llama_kv_cache_unified: layer  12: dev = CUDA0
0.01.110.341 D llama_kv_cache_unified: layer  13: dev = CUDA0
0.01.110.342 D llama_kv_cache_unified: layer  14: dev = CUDA0
0.01.110.343 D llama_kv_cache_unified: layer  15: dev = CUDA0
0.01.110.343 D llama_kv_cache_unified: layer  16: dev = CUDA0
0.01.110.344 D llama_kv_cache_unified: layer  17: dev = CUDA0
0.01.110.344 D llama_kv_cache_unified: layer  18: dev = CUDA0
0.01.110.345 D llama_kv_cache_unified: layer  19: dev = CUDA0
0.01.110.345 D llama_kv_cache_unified: layer  20: dev = CUDA0
0.01.110.346 D llama_kv_cache_unified: layer  21: dev = CUDA0
0.01.110.346 D llama_kv_cache_unified: layer  22: dev = CUDA0
0.01.110.346 D llama_kv_cache_unified: layer  23: dev = CUDA0
0.01.110.347 D llama_kv_cache_unified: layer  24: dev = CUDA0
0.01.110.347 D llama_kv_cache_unified: layer  25: dev = CUDA0
0.01.110.348 D llama_kv_cache_unified: layer  26: dev = CUDA0
0.01.110.348 D llama_kv_cache_unified: layer  27: dev = CUDA0
0.01.110.491 I llama_kv_cache_unified:      CUDA0 KV buffer size =   224,00 MiB
0.01.111.096 I llama_kv_cache_unified: size =  224,00 MiB (  4096 cells,  28 layers,  1 seqs), K (f16):  112,00 MiB, V (f16):  112,00 MiB
0.01.111.098 D llama_context: enumerating backends
0.01.111.104 D llama_context: backend_ptrs.size() = 2
0.01.111.105 D llama_context: max_nodes = 65536
0.01.124.903 D llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
0.01.124.906 D llama_context: reserving graph for n_tokens = 512, n_seqs = 1
0.01.133.128 D llama_context: reserving graph for n_tokens = 1, n_seqs = 1
0.01.133.478 D llama_context: reserving graph for n_tokens = 512, n_seqs = 1
0.01.133.801 I llama_context:      CUDA0 compute buffer size =   304,00 MiB
0.01.133.803 I llama_context:  CUDA_Host compute buffer size =    15,01 MiB
0.01.133.803 I llama_context: graph nodes  = 1098
0.01.133.804 I llama_context: graph splits = 2
0.01.133.806 D clear_adapter_lora: call
0.01.133.808 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
0.01.133.808 W common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.01.133.809 D set_warmup: value = 1
0.01.206.041 D set_warmup: value = 0
0.01.212.762 I main: llama threadpool init, n_threads = 16
0.01.212.770 D attach_threadpool: call
0.01.212.771 I main: chat template is available, enabling conversation mode (disable it with -no-cnv)
0.01.212.959 I main: chat template example:
You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
0.01.212.961 I 
0.01.213.012 I system_info: n_threads = 16 (n_threads_batch = 16) / 32 | CUDA : ARCHS = 750 | NO_VMM = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
0.01.213.012 I 
0.01.213.014 D n_ctx: 4096, add_bos: 1
0.01.213.015 D tokenize the prompt
0.01.213.025 D prompt: ""
0.01.213.027 D tokens: [ '<beginofsentence>':151646 ]
0.01.213.028 D recalculate the cached logits (check): embd_inp.size() 1, n_matching_session_tokens 0, embd_inp.size() 1, session_tokens.size() 0
0.01.213.029 I main: interactive mode on.
0.01.213.083 I sampler seed: 3164957287
0.01.213.089 I sampler params: 
	repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000
	dry_multiplier = 0,000, dry_base = 1,750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0,950, min_p = 0,050, xtc_probability = 0,000, xtc_threshold = 0,100, typical_p = 1,000, top_n_sigma = -1,000, temp = 0,800
	mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
0.01.213.092 I sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
0.01.213.093 I generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
0.01.213.093 I 
0.01.213.093 I == Running in interactive mode. ==
0.01.213.093 I  - Press Ctrl+C to interject at any time.
0.01.213.094 I  - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
0.01.213.094 I  - Not using system message. To change it, set a different value via -sys PROMPT
0.01.213.094 I 
0.01.213.096 D embd_inp.size(): 1, n_consumed: 0
0.01.213.100 D waiting for user input
> Hi there
0.03.678.678 D buffer: 'Hi there'
0.03.678.777 D formatted: '<|User|>Hi there<|Assistant|>'
0.03.679.396 D input tokens: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
0.03.679.401 D n_remain: -5
0.03.679.416 D eval: [ '<beginofsentence>':151646 ]
0.03.685.685 D n_past = 1
0.03.685.691 D embd_inp.size(): 5, n_consumed: 1
0.03.685.701 D eval: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
0.03.705.495 D n_past = 5
0.03.734.162 D n_remain: -6
<think>0.03.734.167 D eval: [ '<think>':151648 ]
0.03.736.514 D n_past = 6
0.03.753.503 D n_remain: -7
0.03.753.507 D eval: [ '':271 ]
0.03.755.861 D n_past = 7
0.03.772.840 D n_remain: -8
</think>0.03.772.844 D eval: [ '</think>':151649 ]
0.03.775.183 D n_past = 8
0.03.791.895 D n_remain: -9
0.03.791.898 D eval: [ '':271 ]
0.03.794.237 D n_past = 9
0.03.810.164 D n_remain: -10
Hello0.03.810.168 D eval: [ 'Hello':9707 ]
0.03.812.501 D n_past = 10
0.03.828.176 D n_remain: -11
!0.03.828.179 D eval: [ '!':0 ]
0.03.830.510 D n_past = 11
0.03.845.830 D n_remain: -12
 How0.03.845.833 D eval: [ ' How':2585 ]
0.03.848.178 D n_past = 12
0.03.863.242 D n_remain: -13
 can0.03.863.246 D eval: [ ' can':646 ]
0.03.865.588 D n_past = 13
0.
8000
03.880.445 D n_remain: -14
 I0.03.880.448 D eval: [ ' I':358 ]
0.03.882.780 D n_past = 14
0.03.897.428 D n_remain: -15
 assist0.03.897.432 D eval: [ ' assist':7789 ]
0.03.899.772 D n_past = 15
0.03.914.247 D n_remain: -16
 you0.03.914.251 D eval: [ ' you':498 ]
0.03.916.592 D n_past = 16
0.03.931.012 D n_remain: -17
 today0.03.931.016 D eval: [ ' today':3351 ]
0.03.933.366 D n_past = 17
0.03.947.784 D n_remain: -18
?0.03.947.788 D eval: [ '?':30 ]
0.03.950.125 D n_past = 18
0.03.964.557 D n_remain: -19
 �0.03.964.561 D eval: [ ' ':26525 ]
0.03.966.898 D n_past = 19
0.03.981.322 D n_remain: -20
�0.03.981.326 D eval: [ '':232 ]
0.03.983.664 D n_past = 20
0.03.997.941 D n_remain: -21
0.03.997.944 D found an EOG token
0.03.997.982 D formatted: '<|Assistant|><think>
</think>
Hello! How can I assist you today? 😊<|end▁of▁sentence|>'
0.03.997.983 D waiting for user input
> 
0.05.157.487 I llama_perf_sampler_print:    sampling time =       0,74 ms /    20 runs   (    0,04 ms per token, 27173,91 tokens per second)
0.05.157.492 I llama_perf_context_print:        load time =    1075,25 ms
0.05.157.493 I llama_perf_context_print: prompt eval time =      53,68 ms /     5 tokens (   10,74 ms per token,    93,14 tokens per second)
0.05.157.494 I llama_perf_context_print:        eval time =     261,94 ms /    15 runs   (   17,46 ms per token,    57,27 tokens per second)
0.05.157.494 I llama_perf_context_print:       total time =    3951,45 ms /    20 tokens
Interrupted by user

Can you post your output on latest master and add the -lv 1 flag at the end?

@ggerganov
Copy link
Member

I think the patch that you are using to disable VMM is not correct. Instead, you should build with cmake -DGGML_CUDA_NO_VMM=ON ... like in my command.

@yeahdongcn
Copy link
Collaborator Author
yeahdongcn commented May 26, 2025

Can you post your output on latest master and add the -lv 1 flag at the end?

Please see the logs (updated with export LLAMA_TRACE=1) below:

root@f7cd9f1a2456:/ws# ./build-cuda-no-vmm/bin/llama-cli -m /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -ngl 99 -lv 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
  Device 0: MTT S80, compute capability 2.1, VMM: no
register_backend: registered backend MUSA (1 devices)
register_device: registered device MUSA0 (MTT S80)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i5-12400)
load_backend: failed to find ggml_backend_init in /ws/build-cuda-no-vmm/bin/libggml-musa.so
load_backend: failed to find ggml_backend_init in /ws/build-cuda-no-vmm/bin/libggml-cpu.so
build: 5490 (fef693dc6) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15723 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor split  0:                    output.weight q6_K     [  3584, 152064,     1,     1 ]   426.36 MiB
llama_model_loader: - tensor split  0:               output_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:                token_embd.weight q4_K     [  3584, 152064,     1,     1 ]   292.36 MiB
llama_model_loader: - tensor split  0:                blk.0.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.0.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.0.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.0.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.0.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.0.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.0.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.0.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.0.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.0.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.0.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.0.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.1.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.1.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.1.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.1.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.1.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.1.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.1.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.1.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.1.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.1.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.1.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.1.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.2.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.2.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.2.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.2.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.2.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.2.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.2.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.2.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.2.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.2.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.2.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.2.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.3.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.3.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.3.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.3.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.3.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.3.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.3.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.3.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.3.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.3.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.3.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.3.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.4.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.4.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.4.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.4.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.4.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.4.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.4.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.4.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.4.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.4.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.4.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.4.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.5.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.5.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.5.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.5.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.5.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.5.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.5.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.5.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.5.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.5.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.5.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.5.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.6.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.6.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.6.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.6.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.6.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.6.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.6.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.6.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.6.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.6.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.6.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.6.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.7.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.7.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.7.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.7.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.7.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.7.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.7.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.7.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.7.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.7.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.7.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.7.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.8.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.8.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.8.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.8.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.8.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.8.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.8.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.8.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:            blk.8.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:            blk.8.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.8.ffn_norm.wei
8000
ght f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.8.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:                blk.9.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.9.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.9.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:         blk.9.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.9.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.9.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:                blk.9.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:              blk.9.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:            blk.9.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.9.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:            blk.9.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:              blk.9.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.10.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.10.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.10.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.10.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.10.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.10.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.10.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.10.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.10.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.10.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.10.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.10.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.11.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.11.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.11.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.11.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.11.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.11.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.11.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.11.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.11.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.11.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.11.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.11.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.12.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.12.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.12.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.12.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.12.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.12.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.12.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.12.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.12.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.12.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.12.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.12.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.13.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.13.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.13.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.13.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.13.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.13.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.13.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.13.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.13.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.13.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.13.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.13.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.14.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.14.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.14.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.14.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.14.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.14.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.14.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.14.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.14.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.14.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.14.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.14.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.15.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.15.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.15.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.15.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.15.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.15.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.15.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.15.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.15.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.15.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.15.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.15.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.16.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.16.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.16.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.16.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.16.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.16.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.16.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.16.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.16.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.16.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.16.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.16.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.17.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.17.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.17.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.17.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.17.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.17.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.17.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.17.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.17.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.17.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.17.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.17.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.18.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.18.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.18.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.18.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.18.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.18.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.18.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.18.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.18.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.18.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.18.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.18.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.19.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.19.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.19.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.19.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.19.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.19.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.19.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.19.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.19.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.19.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.19.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.19.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.20.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.20.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.20.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.20.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.20.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.20.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.20.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.20.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.20.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.20.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.20.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.20.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.21.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.21.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.21.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.21.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.21.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.21.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.21.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.21.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.21.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.21.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.21.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.21.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.22.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.22.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.22.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.22.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.22.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.22.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.22.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.22.attn_v.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:           blk.22.ffn_down.weight q4_K     [ 18944,  3584,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.22.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.22.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.22.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.23.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.23.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.23.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.23.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.23.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.23.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.23.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.23.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.23.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.23.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.23.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.23.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.24.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.24.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.24.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.24.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.24.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.24.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.24.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.24.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.24.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.24.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.24.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.24.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.25.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.25.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.25.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.25.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.25.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.25.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.25.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.25.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.25.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.25.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.25.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.25.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.26.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.26.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.26.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.26.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.26.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.26.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.26.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.26.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.26.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.26.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.26.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.26.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:               blk.27.attn_k.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.27.attn_k.weight q4_K     [  3584,   512,     1,     1 ]     0.98 MiB
llama_model_loader: - tensor split  0:          blk.27.attn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:        blk.27.attn_output.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.27.attn_q.bias f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.27.attn_q.weight q4_K     [  3584,  3584,     1,     1 ]     6.89 MiB
llama_model_loader: - tensor split  0:               blk.27.attn_v.bias f32      [   512,     1,     1,     1 ]     0.00 MiB
llama_model_loader: - tensor split  0:             blk.27.attn_v.weight q6_K     [  3584,   512,     1,     1 ]     1.44 MiB
llama_model_loader: - tensor split  0:           blk.27.ffn_down.weight q6_K     [ 18944,  3584,     1,     1 ]    53.12 MiB
llama_model_loader: - tensor split  0:           blk.27.ffn_gate.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: - tensor split  0:           blk.27.ffn_norm.weight f32      [  3584,     1,     1,     1 ]     0.01 MiB
llama_model_loader: - tensor split  0:             blk.27.ffn_up.weight q4_K     [  3584, 18944,     1,     1 ]    36.42 MiB
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv   3:                       general.organization str              = Deepseek Ai
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv   5:                         general.size_label str              = 7B
llama_model_loader: - kv   6:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   7:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv   8:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   9:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv  10:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv  11:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv  12:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  13:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = deepseek-r1-qwen
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151654
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  25:               general.quantization_ver
8000
sion u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q4_K:  169 tensors
llama_model_loader: - type q6_K:   29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.36 GiB (4.91 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = DeepSeek R1 Distill Qwen 7B
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151654 '<|vision_pad|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device MUSA0, is_swa = 0
load_tensors: layer   1 assigned to device MUSA0, is_swa = 0
load_tensors: layer   2 assigned to device MUSA0, is_swa = 0
load_tensors: layer   3 assigned to device MUSA0, is_swa = 0
load_tensors: layer   4 assigned to device MUSA0, is_swa = 0
load_tensors: layer   5 assigned to device MUSA0, is_swa = 0
load_tensors: layer   6 assigned to device MUSA0, is_swa = 0
load_tensors: layer   7 assigned to device MUSA0, is_swa = 0
load_tensors: layer   8 assigned to device MUSA0, is_swa = 0
load_tensors: layer   9 assigned to device MUSA0, is_swa = 0
load_tensors: layer  10 assigned to device MUSA0, is_swa = 0
load_tensors: layer  11 assigned to device MUSA0, is_swa = 0
load_tensors: layer  12 assigned to device MUSA0, is_swa = 0
load_tensors: layer  13 assigned to device MUSA0, is_swa = 0
load_tensors: layer  14 assigned to device MUSA0, is_swa = 0
load_tensors: layer  15 assigned to device MUSA0, is_swa = 0
load_tensors: layer  16 assigned to device MUSA0, is_swa = 0
load_tensors: layer  17 assigned to device MUSA0, is_swa = 0
load_tensors: layer  18 assigned to device MUSA0, is_swa = 0
load_tensors: layer  19 assigned to device MUSA0, is_swa = 0
load_tensors: layer  20 assigned to device MUSA0, is_swa = 0
load_tensors: layer  21 assigned to device MUSA0, is_swa = 0
load_tensors: layer  22 assigned to device MUSA0, is_swa = 0
load_tensors: layer  23 assigned to device MUSA0, is_swa = 0
load_tensors: layer  24 assigned to device MUSA0, is_swa = 0
load_tensors: layer  25 assigned to device MUSA0, is_swa = 0
load_tensors: layer  26 assigned to device MUSA0, is_swa = 0
load_tensors: layer  27 assigned to device MUSA0, is_swa = 0
load_tensors: layer  28 assigned to device MUSA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type MUSA_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   292.36 MiB
load_tensors:        MUSA0 model buffer size =  4168.09 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  MUSA_Host  output buffer size =     0.58 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer   0: dev = MUSA0
llama_kv_cache_unified: layer   1: dev = MUSA0
llama_kv_cache_unified: layer   2: dev = MUSA0
llama_kv_cache_unified: layer   3: dev = MUSA0
llama_kv_cache_unified: layer   4: dev = MUSA0
llama_kv_cache_unified: layer   5: dev = MUSA0
llama_kv_cache_unified: layer   6: dev = MUSA0
llama_kv_cache_unified: layer   7: dev = MUSA0
llama_kv_cache_unified: layer   8: dev = MUSA0
llama_kv_cache_unified: layer   9: dev = MUSA0
llama_kv_cache_unified: layer  10: dev = MUSA0
llama_kv_cache_unified: layer  11: dev = MUSA0
llama_kv_cache_unified: layer  12: dev = MUSA0
llama_kv_cache_unified: layer  13: dev = MUSA0
llama_kv_cache_unified: layer  14: dev = MUSA0
llama_kv_cache_unified: layer  15: dev = MUSA0
llama_kv_cache_unified: layer  16: dev = MUSA0
llama_kv_cache_unified: layer  17: dev = MUSA0
llama_kv_cache_unified: layer  18: dev = MUSA0
llama_kv_cache_unified: layer  19: dev = MUSA0
llama_kv_cache_unified: layer  20: dev = MUSA0
llama_kv_cache_unified: layer  21: dev = MUSA0
llama_kv_cache_unified: layer  22: dev = MUSA0
llama_kv_cache_unified: layer  23: dev = MUSA0
llama_kv_cache_unified: layer  24: dev = MUSA0
llama_kv_cache_unified: layer  25: dev = MUSA0
llama_kv_cache_unified: layer  26: dev = MUSA0
llama_kv_cache_unified: layer  27: dev = MUSA0
llama_kv_cache_unified:      MUSA0 KV buffer size =   224.00 MiB
llama_kv_cache_unified: size =  224.00 MiB (  4096 cells,  28 layers,  1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_gallocr_reserve_n: reallocating MUSA0 buffer from size 0.00 MiB to 304.00 MiB
ggml_gallocr_reserve_n: reallocating MUSA_Host buffer from size 0.00 MiB to 15.01 MiB
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:      MUSA0 compute buffer size =   304.00 MiB
llama_context:  MUSA_Host compute buffer size =    15.01 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
main: llama threadpool init, n_threads = 6
attach_threadpool: call
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

n_ctx: 4096, add_bos: 1
tokenize the prompt
prompt: ""
tokens: [ '<beginofsentence>':151646 ]
recalculate the cached logits (check): embd_inp.size() 1, n_matching_session_tokens 0, embd_inp.size() 1, session_tokens.size() 0
main: interactive mode on.
sampler seed: 612160863
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT

embd_inp.size(): 1, n_consumed: 0
waiting for user input

> Hi there
buffer: 'Hi there'
formatted: '<|User|>Hi there<|Assistant|>'
input tokens: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
n_remain: -5
eval: [ '<beginofsentence>':151646 ]
n_past = 1
embd_inp.size(): 5, n_consumed: 1
eval: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
n_past = 5
n_remain: -6
자eval: [ '':25715 ]
n_past = 6
n_remain: -7
UINTeval: [ 'UINT':25712 ]
n_past = 7
n_remain: -8
UINTeval: [ 'UINT':25712 ]
n_past = 8
n_remain: -9
 insightseval: [ ' insights':25709 ]
n_past = 9
n_remain: -10
 insightseval: [ ' insights':25709 ]
n_past = 10
n_remain: -11
UINTeval: [ 'UINT':25712 ]
n_past = 11
n_remain: -12
Tooltipeval: [ 'Tooltip':25717 ]
n_past = 12
n_remain: -13
UINTeval: [ 'UINT':25712 ]
n_past = 13
n_remain: -14
 insightseval: [ ' insights':25709 ]
n_past = 14
n_remain: -15
UINTeval: [ 'UINT':25712 ]
n_past = 15
n_remain: -16
 insightseval: [ ' insights':25709 ]
n_past = 16
n_remain: -17
 insightseval: [ ' insights':25709 ]
n_past = 17
n_remain: -18
 insightseval: [ ' insights':25709 ]
n_past = 18
n_remain: -19
UINTeval: [ 'UINT':25712 ]
n_past = 19
n_remain: -20
UINTeval: [ 'UINT':25712 ]
n_past = 20
n_remain: -21
UINTeval: [ 'UINT':25712 ]
n_past = 21
n_remain: -22
UINTeval: [ 'UINT':25712 ]
n_past = 22
n_remain: -23
UINTeval: [ 'UINT':25712 ]
n_past = 23
n_remain: -24
 insightseval: [ ' insights':25709 ]
n_past = 24
n_remain: -25
UINTwaiting for user input

>

@yeahdongcn
Copy link
Collaborator Author

By comparing the two log files, I noticed only one explicit difference:

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0