-
Notifications
You must be signed in to change notification settings - Fork 12k
Eval bug: Output garbled on DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth using musa backend with VMM off #13788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @ggerganov Could you please take a look and let me know if you have any insights or suggestions to debug? Thanks. |
On latest cmake -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON .. Downloaded the model from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-7B-GGUF/tree/main Command: ./bin/llama-cli -m DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -ngl 99 -lv 1
Can you post your output on latest |
I think the patch that you are using to disable VMM is not correct. Instead, you should build with |
Please see the logs (updated with root@f7cd9f1a2456:/ws# ./build-cuda-no-vmm/bin/llama-cli -m /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf -ngl 99 -lv 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
Device 0: MTT S80, compute capability 2.1, VMM: no
register_backend: registered backend MUSA (1 devices)
register_device: registered device MUSA0 (MTT S80)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (12th Gen Intel(R) Core(TM) i5-12400)
load_backend: failed to find ggml_backend_init in /ws/build-cuda-no-vmm/bin/libggml-musa.so
load_backend: failed to find ggml_backend_init in /ws/build-cuda-no-vmm/bin/libggml-cpu.so
build: 5490 (fef693dc6) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu (debug)
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S80) - 15723 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /models/DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor split 0: output.weight q6_K [ 3584, 152064, 1, 1 ] 426.36 MiB
llama_model_loader: - tensor split 0: output_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: token_embd.weight q4_K [ 3584, 152064, 1, 1 ] 292.36 MiB
llama_model_loader: - tensor split 0: blk.0.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.0.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.0.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.0.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.0.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.0.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.0.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.0.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.0.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.0.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.0.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.0.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.1.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.1.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.1.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.1.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.1.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.1.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.1.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.1.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.1.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.1.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.1.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.1.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.2.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.2.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.2.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.2.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.2.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.2.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.2.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.2.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.2.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.2.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.2.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.2.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.3.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.3.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.3.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.3.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.3.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.3.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.3.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.3.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.3.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.3.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.3.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.3.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.4.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.4.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.4.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.4.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.4.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.4.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.4.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.4.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.4.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.4.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.4.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.4.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.5.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.5.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.5.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.5.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.5.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.5.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.5.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.5.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.5.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.5.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.5.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.5.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.6.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.6.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.6.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.6.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.6.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.6.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.6.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.6.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.6.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.6.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.6.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.6.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.7.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.7.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.7.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.7.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.7.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.7.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.7.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.7.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.7.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.7.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.7.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.7.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.8.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.8.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.8.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.8.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.8.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.8.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.8.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.8.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.8.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.8.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.8.ffn_norm.wei
8000
ght f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.8.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.9.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.9.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.9.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.9.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.9.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.9.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.9.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.9.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.9.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.9.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.9.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.9.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.10.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.10.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.10.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.10.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.10.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.10.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.10.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.10.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.10.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.10.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.10.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.10.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.11.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.11.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.11.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.11.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.11.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.11.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.11.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.11.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.11.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.11.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.11.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.11.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.12.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.12.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.12.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.12.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.12.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.12.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.12.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.12.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.12.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.12.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.12.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.12.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.13.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.13.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.13.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.13.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.13.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.13.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.13.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.13.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.13.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.13.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.13.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.13.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.14.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.14.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.14.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.14.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.14.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.14.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.14.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.14.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.14.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.14.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.14.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.14.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.15.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.15.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.15.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.15.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.15.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.15.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.15.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.15.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.15.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.15.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.15.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.15.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.16.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.16.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.16.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.16.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.16.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.16.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.16.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.16.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.16.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.16.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.16.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.16.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.17.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.17.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.17.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.17.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.17.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.17.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.17.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.17.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.17.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.17.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.17.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.17.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.18.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.18.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.18.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.18.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.18.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.18.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.18.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.18.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.18.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.18.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.18.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.18.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.19.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.19.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.19.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.19.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.19.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.19.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.19.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.19.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.19.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.19.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.19.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.19.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.20.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.20.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.20.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.20.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.20.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.20.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.20.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.20.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.20.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.20.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.20.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.20.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.21.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.21.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.21.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.21.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.21.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.21.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.21.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.21.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.21.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.21.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.21.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.21.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.22.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.22.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.22.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.22.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.22.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.22.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.22.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.22.attn_v.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.22.ffn_down.weight q4_K [ 18944, 3584, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.22.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.22.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.22.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.23.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.23.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.23.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.23.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.23.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.23.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.23.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.23.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.23.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.23.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.23.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.23.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.24.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.24.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.24.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.24.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.24.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.24.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.24.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.24.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.24.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.24.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.24.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.24.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.25.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.25.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.25.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.25.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.25.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.25.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.25.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.25.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.25.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.25.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.25.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.25.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.26.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.26.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.26.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.26.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.26.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.26.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.26.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.26.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.26.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.26.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.26.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.26.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.27.attn_k.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.27.attn_k.weight q4_K [ 3584, 512, 1, 1 ] 0.98 MiB
llama_model_loader: - tensor split 0: blk.27.attn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.27.attn_output.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.27.attn_q.bias f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.27.attn_q.weight q4_K [ 3584, 3584, 1, 1 ] 6.89 MiB
llama_model_loader: - tensor split 0: blk.27.attn_v.bias f32 [ 512, 1, 1, 1 ] 0.00 MiB
llama_model_loader: - tensor split 0: blk.27.attn_v.weight q6_K [ 3584, 512, 1, 1 ] 1.44 MiB
llama_model_loader: - tensor split 0: blk.27.ffn_down.weight q6_K [ 18944, 3584, 1, 1 ] 53.12 MiB
llama_model_loader: - tensor split 0: blk.27.ffn_gate.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: - tensor split 0: blk.27.ffn_norm.weight f32 [ 3584, 1, 1, 1 ] 0.01 MiB
llama_model_loader: - tensor split 0: blk.27.ffn_up.weight q4_K [ 3584, 18944, 1, 1 ] 36.42 MiB
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 Distill Qwen 7B
llama_model_loader: - kv 3: general.organization str = Deepseek Ai
llama_model_loader: - kv 4: general.basename str = DeepSeek-R1-Distill-Qwen
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: qwen2.block_count u32 = 28
llama_model_loader: - kv 7: qwen2.context_length u32 = 131072
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 3584
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 18944
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 28
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 4
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 15: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 18: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 25: general.quantization_ver
8000
sion u32 = 2
llama_model_loader: - kv 26: general.file_type u32 = 15
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_K: 169 tensors
llama_model_loader: - type q6_K: 29 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.36 GiB (4.91 BPW)
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151645 '<|Assistant|>' is not marked as EOG
load: control token: 151644 '<|User|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151646 '<|begin▁of▁sentence|>' is not marked as EOG
load: control token: 151643 '<|end▁of▁sentence|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151647 '<|EOT|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = DeepSeek R1 Distill Qwen 7B
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token = 151643 '<|end▁of▁sentence|>'
print_info: EOT token = 151643 '<|end▁of▁sentence|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|end▁of▁sentence|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device MUSA0, is_swa = 0
load_tensors: layer 1 assigned to device MUSA0, is_swa = 0
load_tensors: layer 2 assigned to device MUSA0, is_swa = 0
load_tensors: layer 3 assigned to device MUSA0, is_swa = 0
load_tensors: layer 4 assigned to device MUSA0, is_swa = 0
load_tensors: layer 5 assigned to device MUSA0, is_swa = 0
load_tensors: layer 6 assigned to device MUSA0, is_swa = 0
load_tensors: layer 7 assigned to device MUSA0, is_swa = 0
load_tensors: layer 8 assigned to device MUSA0, is_swa = 0
load_tensors: layer 9 assigned to device MUSA0, is_swa = 0
load_tensors: layer 10 assigned to device MUSA0, is_swa = 0
load_tensors: layer 11 assigned to device MUSA0, is_swa = 0
load_tensors: layer 12 assigned to device MUSA0, is_swa = 0
load_tensors: layer 13 assigned to device MUSA0, is_swa = 0
load_tensors: layer 14 assigned to device MUSA0, is_swa = 0
load_tensors: layer 15 assigned to device MUSA0, is_swa = 0
load_tensors: layer 16 assigned to device MUSA0, is_swa = 0
load_tensors: layer 17 assigned to device MUSA0, is_swa = 0
load_tensors: layer 18 assigned to device MUSA0, is_swa = 0
load_tensors: layer 19 assigned to device MUSA0, is_swa = 0
load_tensors: layer 20 assigned to device MUSA0, is_swa = 0
load_tensors: layer 21 assigned to device MUSA0, is_swa = 0
load_tensors: layer 22 assigned to device MUSA0, is_swa = 0
load_tensors: layer 23 assigned to device MUSA0, is_swa = 0
load_tensors: layer 24 assigned to device MUSA0, is_swa = 0
load_tensors: layer 25 assigned to device MUSA0, is_swa = 0
load_tensors: layer 26 assigned to device MUSA0, is_swa = 0
load_tensors: layer 27 assigned to device MUSA0, is_swa = 0
load_tensors: layer 28 assigned to device MUSA0, is_swa = 0
load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type MUSA_Host, using CPU instead
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CPU_Mapped model buffer size = 292.36 MiB
load_tensors: MUSA0 model buffer size = 4168.09 MiB
....................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: MUSA_Host output buffer size = 0.58 MiB
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer 0: dev = MUSA0
llama_kv_cache_unified: layer 1: dev = MUSA0
llama_kv_cache_unified: layer 2: dev = MUSA0
llama_kv_cache_unified: layer 3: dev = MUSA0
llama_kv_cache_unified: layer 4: dev = MUSA0
llama_kv_cache_unified: layer 5: dev = MUSA0
llama_kv_cache_unified: layer 6: dev = MUSA0
llama_kv_cache_unified: layer 7: dev = MUSA0
llama_kv_cache_unified: layer 8: dev = MUSA0
llama_kv_cache_unified: layer 9: dev = MUSA0
llama_kv_cache_unified: layer 10: dev = MUSA0
llama_kv_cache_unified: layer 11: dev = MUSA0
llama_kv_cache_unified: layer 12: dev = MUSA0
llama_kv_cache_unified: layer 13: dev = MUSA0
llama_kv_cache_unified: layer 14: dev = MUSA0
llama_kv_cache_unified: layer 15: dev = MUSA0
llama_kv_cache_unified: layer 16: dev = MUSA0
llama_kv_cache_unified: layer 17: dev = MUSA0
llama_kv_cache_unified: layer 18: dev = MUSA0
llama_kv_cache_unified: layer 19: dev = MUSA0
llama_kv_cache_unified: layer 20: dev = MUSA0
llama_kv_cache_unified: layer 21: dev = MUSA0
llama_kv_cache_unified: layer 22: dev = MUSA0
llama_kv_cache_unified: layer 23: dev = MUSA0
llama_kv_cache_unified: layer 24: dev = MUSA0
llama_kv_cache_unified: layer 25: dev = MUSA0
llama_kv_cache_unified: layer 26: dev = MUSA0
llama_kv_cache_unified: layer 27: dev = MUSA0
llama_kv_cache_unified: MUSA0 KV buffer size = 224.00 MiB
llama_kv_cache_unified: size = 224.00 MiB ( 4096 cells, 28 layers, 1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
ggml_gallocr_reserve_n: reallocating MUSA0 buffer from size 0.00 MiB to 304.00 MiB
ggml_gallocr_reserve_n: reallocating MUSA_Host buffer from size 0.00 MiB to 15.01 MiB
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: MUSA0 compute buffer size = 304.00 MiB
llama_context: MUSA_Host compute buffer size = 15.01 MiB
llama_context: graph nodes = 1098
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
set_warmup: value = 0
main: llama threadpool init, n_threads = 6
attach_threadpool: call
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
You are a helpful assistant
<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>
system_info: n_threads = 6 (n_threads_batch = 6) / 12 | MUSA : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
n_ctx: 4096, add_bos: 1
tokenize the prompt
prompt: ""
tokens: [ '<beginofsentence>':151646 ]
recalculate the cached logits (check): embd_inp.size() 1, n_matching_session_tokens 0, embd_inp.size() 1, session_tokens.size() 0
main: interactive mode on.
sampler seed: 612160863
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to the AI.
- To return control without starting a new line, end your input with '/'.
- If you want to submit another line, end your input with '\'.
- Not using system message. To change it, set a different value via -sys PROMPT
embd_inp.size(): 1, n_consumed: 0
waiting for user input
> Hi there
buffer: 'Hi there'
formatted: '<|User|>Hi there<|Assistant|>'
input tokens: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
n_remain: -5
eval: [ '<beginofsentence>':151646 ]
n_past = 1
embd_inp.size(): 5, n_consumed: 1
eval: [ '<User>':151644, 'Hi':13048, ' there':1052, '<Assistant>':151645 ]
n_past = 5
n_remain: -6
자eval: [ '':25715 ]
n_past = 6
n_remain: -7
UINTeval: [ 'UINT':25712 ]
n_past = 7
n_remain: -8
UINTeval: [ 'UINT':25712 ]
n_past = 8
n_remain: -9
insightseval: [ ' insights':25709 ]
n_past = 9
n_remain: -10
insightseval: [ ' insights':25709 ]
n_past = 10
n_remain: -11
UINTeval: [ 'UINT':25712 ]
n_past = 11
n_remain: -12
Tooltipeval: [ 'Tooltip':25717 ]
n_past = 12
n_remain: -13
UINTeval: [ 'UINT':25712 ]
n_past = 13
n_remain: -14
insightseval: [ ' insights':25709 ]
n_past = 14
n_remain: -15
UINTeval: [ 'UINT':25712 ]
n_past = 15
n_remain: -16
insightseval: [ ' insights':25709 ]
n_past = 16
n_remain: -17
insightseval: [ ' insights':25709 ]
n_past = 17
n_remain: -18
insightseval: [ ' insights':25709 ]
n_past = 18
n_remain: -19
UINTeval: [ 'UINT':25712 ]
n_past = 19
n_remain: -20
UINTeval: [ 'UINT':25712 ]
n_past = 20
n_remain: -21
UINTeval: [ 'UINT':25712 ]
n_past = 21
n_remain: -22
UINTeval: [ 'UINT':25712 ]
n_past = 22
n_remain: -23
UINTeval: [ 'UINT':25712 ]
n_past = 23
n_remain: -24
insightseval: [ ' insights':25709 ]
n_past = 24
n_remain: -25
UINTwaiting for user input
> |
Uh oh!
There was an error while loading. Please reload this page.
Name and Version
root@f7cd9f1a2456:/ws# ./build/bin/llama-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 MUSA devices:
Device 0: MTT S80, compute capability 2.1, VMM: no
version: 5488 (e121edc)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Musa
Hardware
12th Gen Intel(R) Core(TM) i5-12400 + MTT S80
Models
DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf from unsloth
Problem description & steps to reproduce
git reset 2d77d88e70d017cd82c3f1a4517e3102e2028ac4 --hard
--fa
, everything goes fine-fa
, DeepSeek-R1-Distill-Qwen-7B-Q4_K_M.gguf can generate token correctly; other models (nvidia-llama-3_1-nemotron-nano-8b-v1-q4_k_m.gguf, qwen3_8b_q4_k_m.gguf) seems not have this issue; turn off kvcache by using--no-kv-offload
the issue is gone. I also noticed pp performance downgrade if kvcache is on && disable flash attention (default)First Bad Commit
2d77d88
Relevant log output
The text was updated successfully, but these errors were encountered: