8000 Fix: Propagate flash attn to model loader by dthuerck · Pull Request #1424 · abetlen/llama-cpp-python · GitHub
[go: up one dir, main page]

Skip to content

Fix: Propagate flash attn to model loader #1424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 3, 2024

Conversation

dthuerck
Copy link
Contributor
@dthuerck dthuerck commented May 3, 2024

I noticed that even though setting flash_attn to true in my model config file, llama.cpp kept reporting llama_new_context_with_model: flash_attn = 0. This super-small PR fixes that - turns out the setting wasn't passed on to the model loader.

@abetlen
Copy link
Owner
abetlen commented May 3, 2024

@dthuerck thank you!

@abetlen abetlen merged commit 2138561 into abetlen:main May 3, 2024
@BadisG
Copy link
BadisG commented May 8, 2024

I installed the latest version of llama_cpp_python (0.2.70) with this command:

pip install llama-cpp-python \
  --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

But after using it on oobabooga's software (llama_cpp_hf), I still have this flash_attn = 0 issue:

llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 8000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1728.00 MiB
llama_kv_cache_init:      CUDA1 KV buffer size =   832.00 MiB
llama_new_context_with_model: KV self size  = 2560.00 MiB, K (f16): 1280.00 MiB, V (f16): 1280.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.98 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
llama_new_context_with_model:      CUDA0 compute buffer size =   400.01 MiB
llama_new_context_with_model:      CUDA1 compute buffer size =   596.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    32.02 MiB
llama_new_context_with_model: graph nodes  = 1208
llama_new_context_with_model: graph splits = 3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0