-
Notifications
You must be signed in to change notification settings - Fork 12k
Eval bug: repeated output for llama-server #12782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Labels
Comments
model='QwQ-32B'
port='8007'
prompt='write quick sort in python'
openai_api_base = "http://localhost:{}/v1".format(port)
client = OpenAI(api_key=key, base_url=openai_api_base)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
stop=["<|im_end|>", "<|endoftext|>", "</s>", "<|eot_id|>"],
stream=True
)
|
llama-server output |
Hi. Please try with moving the temperature sampler to the end of the samplers sequence. i.e.: |
请问这个问题是怎么解决的? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Name and Version
MODEL_ROOT=/mnt/backup/models
GGUF=Qwen/QwQ-32B-GGUF/qwq-32b-q4_k_m.gguf
PORT=8007
PARALLEL=4 # 并发度,超过会阻塞
MAX_LEN=8192 # 最大上下文
docker run -v $MODEL_ROOT:/models
-p $PORT:8007 ghcr.io/ggml-org/llama.cpp:server
-m /models/$GGUF
--port 8007 --host 0.0.0.0 -n $MAX_LEN
--parallel $PARALLEL
--threads 32
--ctx-size 16384
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.01
--top-k 40
--top-p 0.95
--no-cnv
--chat-template deepseek3
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"
Operating systems
Linux
GGML backends
CPU
Hardware
9135 + 4090
Pure CPU
Models
Qwen/QwQ-32B-GGUF/qwq-32b-q4_k_m.gguf
Problem description & steps to reproduce
when i run
ghcr.io/ggml-org/llama.cpp:server
, i got repeated result.First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: