Closed
Description
Hello,
This PR #1424 was supposed to fix this flash_attn = 0
thing but I'm still seeing it when loading a model on llamacpp_HF with the latest version (0.2.75)
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1728.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 832.00 MiB
llama_new_context_with_model: KV self size = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model: CUDA0 compute buffer size = 400.01 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 596.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 32.02 MiB
llama_new_context_with_model: graph nodes = 1208
llama_new_context_with_model: graph splits = 3
Metadata
Metadata
Assignees
Labels
No labels