8000 docs: Update Llama docs · nivibilla/llama-cpp-python@6308f21 · GitHub
[go: up one dir, main page]

Skip to content

Commit 6308f21

Browse files
committed
docs: Update Llama docs
1 parent f03a38e commit 6308f21

File tree

1 file changed

+15
-11
lines changed

1 file changed

+15
-11
lines changed

llama_cpp/llama.py

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -798,17 +798,21 @@ def __init__(
798798
vocab_only: Only load the vocabulary no weights.
799799
use_mmap: Use mmap if possible.
800800
use_mlock: Force the system to keep the model in RAM.
801-
seed: Random seed. -1 for random.
802-
n_ctx: Context size.
803-
n_batch: Batch size for prompt processing (must be >= 32 to use BLAS)
804-
n_threads: Number of threads to use. If None, the number of threads is automatically determined.
805-
n_threads_batch: Number of threads to use for batch processing. If None, use n_threads.
806-
rope_scaling_type: Type of rope scaling to use.
807-
rope_freq_base: Base frequency for rope sampling.
808-
rope_freq_scale: Scale factor for rope sampling.
809-
mul_mat_q: if true, use experimental mul_mat_q kernels
810-
f16_kv: Use half-precision for key/value cache.
811-
logits_all: Return logits for all tokens, not just the last token.
801+
seed: RNG seed, -1 for random
802+
n_ctx: Text context, 0 = from model
803+
n_batch: Prompt processing maximum batch size
804+
n_threads: Number of threads to use for generation
805+
n_threads_batch: Number of threads to use for batch processing
806+
rope_scaling_type: RoPE scaling type, from `enum llama_rope_scaling_type`. ref: https://github.com/ggerganov/llama.cpp/pull/2054
807+
rope_freq_base: RoPE base frequency, 0 = from model
808+
rope_freq_scale: RoPE frequency scaling factor, 0 = from model
809+
yarn_ext_factor: YaRN extrapolation mix factor, negative = from model
810+
yarn_attn_factor: YaRN magnitude scaling factor
811+
yarn_beta_fast: YaRN low correction dim
812+
yarn_beta_slow: YaRN high correction dim
813+
yarn_orig_ctx: YaRN original context size
814+
f16_kv: Use fp16 for KV cache, fp32 otherwise
815+
logits_all: Return logits for all tokens, not just the last token. Must be True for completion to return logprobs.
812816
embedding: Embedding mode only.
813817
last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque.
814818
lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.

0 commit comments

Comments
 (0)
0