You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
n_batch: Maximum number of prompt tokens to batch together when calling llama_eval.
261
264
n_gpu_layers: Number of layers to offload to GPU (-ngl). If -1, all layers are offloaded.
262
-
main_gpu: Main GPU to use.
263
-
tensor_split: Optional list of floats to split the model across multiple GPUs. If None, the model is not split.
265
+
main_gpu: The GPU that is used for scratch and small tensors.
266
+
tensor_split: How split tensors should be distributed across GPUs. If None, the model is not split.
267
+
vocab_only: Only load the vocabulary no weights.
268
+
use_mmap: Use mmap if possible.
269
+
use_mlock: Force the system to keep the model in RAM.
270
+
seed: Random seed. -1 for random.
271
+
n_ctx: Context size.
272
+
n_batch: Batch size for prompt processing (must be >= 32 to use BLAS)
273
+
n_threads: Number of threads to use. If None, the number of threads is automatically determined.
274
+
n_threads_batch: Number of threads to use for batch processing. If None, use n_threads.
275
+
rope_scaling_type: Type of rope scaling to use.
264
276
rope_freq_base: Base frequency for rope sampling.
265
277
rope_freq_scale: Scale factor for rope sampling.
266
-
low_vram: Use low VRAM mode.
267
278
mul_mat_q: if true, use experimental mul_mat_q kernels
268
279
f16_kv: Use half-precision for key/value cache.
269
280
logits_all: Return logits for all tokens, not just the last token.
270
-
vocab_only: Only load the vocabulary no weights.
271
-
use_mmap: Use mmap if possible.
272
-
use_mlock: Force the system to keep the model in RAM.
273
281
embedding: Embedding mode only.
274
-
n_threads: Number of threads to use. If None, the number of threads is automatically determined.
275
282
last_n_tokens_size: Maximum number of tokens to keep in the last_n_tokens deque.
276
283
lora_base: Optional path to base model, useful if using a quantized base model and you want to apply LoRA to an f16 model.
277
284
lora_path: Path to a LoRA file to apply to the model.
278
285
numa: Enable NUMA support. (NOTE: The initial value of this parameter is used for the remainder of the program as this value is set in llama_backend_init)
279
286
chat_format: String specifying the chat format to use when calling create_chat_completion.
0 commit comments