You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I need to run a gguf model on an embedded device with limited resources.
I use the qwen2.5-0.5b-instruct-q4_0.gguf model.
After testing a bunch of combinations, this is the llama-cli command I use: ./llama.cpp/build/bin/llama-cli -m "$MODEL" -sys "$SYS_PROMPT" -p "$PROMPT" -co -c 700
And this is the full .sh code I run:
MODEL="gguf_qwen/qwen2.5-0.5b-instruct-q4_0.gguf"
CONTEXT="$(cat ../data/input_data.txt)"
# Only now build the full prompt
SYS_PROMPT="You are a helpful assistant. Be polite with the user. Use the following context to answer the question. If you can't answer based on the context, say 'Sorry, I am not able to provide this information.'
Context:
$CONTEXT"
./llama.cpp/build/bin/llama-cli -m "$MODEL" -sys "$SYS_PROMPT" -p "Greet the user." -co -c 700
while true; do
printf "> "
read QUESTION
[ "$QUESTION" = "exit" ] && break
PROMPT="Question: $QUESTION"
./llama.cpp/build/bin/llama-cli \
-m "$MODEL" \
-sys "$SYS_PROMPT" \
-p "$PROMPT" \
-co \
-c 700 \
-n 100 \
--repeat-penalty 1.3 \
--repeat-last-n 256 \
--temp 0.6 \
--top-k 40 \
--top-p 0.85
done
Setting the limitation at -c 700 is what works for my device.
This is the issue I have : as soon as the limitation is exceeded, the model is looping on the same token or set of tokens.
Here is an example:
Certainly, here are the details I have on your network:
1. Wi-Fi password: xxxxx
2. Gateway type: xxxx
3. Average RSSI:xxxx
4. Most used band: 2.4.4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
I tried a few things to try to cut this kind of output:
Uh oh!
There was an error while loading. Please reload this page.
Hello,
I need to run a gguf model on an embedded device with limited resources.
I use the qwen2.5-0.5b-instruct-q4_0.gguf model.
After testing a bunch of combinations, this is the llama-cli command I use:
./llama.cpp/build/bin/llama-cli -m "$MODEL" -sys "$SYS_PROMPT" -p "$PROMPT" -co -c 700
And this is the full .sh code I run:
Setting the limitation at -c 700 is what works for my device.
This is the issue I have : as soon as the limitation is exceeded, the model is looping on the same token or set of tokens.
Here is an example:
I tried a few things to try to cut this kind of output:
But none of them work to control the model behavior.
Is there something to do that I didn't think of?
Thank you in advance.
The text was updated successfully, but these errors were encountered: