8000 Set n_batch to default values and reduce thread count: · coderonion/llama-cpp-python@c283edd · GitHub
[go: up one dir, main page]

Skip to content

Commit c283edd

Browse files
Set n_batch to default values and reduce thread count:
Change batch size to the llama.cpp default of 8. I've seen issues in llama.cpp where batch size affects quality of generations. (It shouldn't) But in case that's still an issue I changed to default. Set auto-determined num of threads to 1/2 system count. ggml will sometimes lock cores at 100% while doing nothing. This is being addressed, but can cause bad experience for user if pegged at 100%
1 parent b9b6dfd commit c283edd

File tree

2 files changed

+5
-5
lines changed

2 files changed

+5
-5
lines changed

examples/high_level_api/fastapi_server.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@
2727
class Settings(BaseSettings):
2828
model: str
2929
n_ctx: int = 2048
30-
n_batch: int = 2048
31-
n_threads: int = os.cpu_count() or 1
30+
n_batch: int = 8
31+
n_threads: int = int(os.cpu_count() / 2) or 1
3232
f16_kv: bool = True
33-
use_mlock: bool = True
33+
use_mlock: bool = False # This causes a silent failure on platforms that don't support mlock (e.g. Windows) took forever to figure out...
3434
embedding: bool = True
3535
last_n_tokens_size: int = 64
3636

llama_cpp/server/__main__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,9 @@ class Settings(BaseSettings):
2828
model: str
2929
n_ctx: int = 2048
3030
n_batch: int = 8
31-
n_threads: int = os.cpu_count() or 1
31+
n_threads: int = int(os.cpu_count() / 2) or 1
3232
f16_kv: bool = True
33-
use_mlock: bool = True
33+
use_mlock: bool = False # This causes a silent failure on platforms that don't support mlock (e.g. Windows) took forever to figure out...
3434
embedding: bool = True
3535
last_n_tokens_size: int = 64
3636

0 commit comments

Comments
 (0)
0