-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Qwen3-8B and other models generate garbage output / repeat tokens (GGGGGG...) in llama.cpp via LM Studio Vulkan backend #13310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I hit the same behavior with Qwen3 and GLM-4, also with Vulkan on AMD. It doesn't always happen, but I've noticed that larger models hit this behavior more often. An 8B model that fully fits into my VRAM has no issues, a model that's partially offloaded does. GPU: AMD Radeon RX 7800 (16 GB VRAM) |
I’m also observing this, with all produced logits having NaN value, with different Nvidia GPUs |
Can somebody share a command line to reproduce this with llama-cli? I tried yesterday but wasn't able to reproduce it. |
I have a similar issue with llama-server. The launch flags are --jinja -c 15000 -ngl 99 --port 8000, nothing special in the parameters, so at first I thought the problem was with my setup. I'm using two GPUs – Intel and AMD – and I've noticed that after a short period of inactivity, the model gets unloaded from the Intel GPU into main RAM, and then reloaded back when a new request comes in. Often, the issue with the 'GGGGGGGGGGG' output shows up after a short idle period, usually after two or three good responses. But sometimes it happens right away. Restarting the server usually fixes it. |
I have the same issue too, however this is for GLM-4-32B using Q_4_K_S from bartowski's quant, however some investigation at different HF discussions appears the issue specifically for GLM-4-32B comes from batching as using the settings "-b 8 -ub 8" for llama-server appears to fix my issue. This happens to me both for Vulkan and ROCm. Not sure if this is actually a genuine fix but I found this out via this discussion in HF. Changing the batch size does fix the issue in both Vulkan and ROCm. Weirdly enough this doesn't seem to happen if quanting the KV-cache so it's probably a float issue. |
Thanks for pointing this out, though it's not really a fix, but a workaround that comes at a big cost in prompt processing speed.
Edit: seems like 8ae5ebc introduced the failing tests (turns out it's probably not related to this issue, I still get the "GGGGGGGGGGGGGGG" after reverting) |
Hi there, everyone, I discovered a weird solution after looking at what could make gemma special from all the other models i tried i noticed the gemma's quat was Q4_0 and all the other models were all Q4_K_M so i downloaded a Q3_K_L version of the Qwen model and so far i have not had any issues |
It turns out my problem is most likely not related to this bug. I made a small SYCL utility that constantly pings a small area of memory on the Intel card, preventing VRAM from being swapped to RAM. The server has been running for more than 24 hours without errors. |
I see garbage with all Qwen3 models I ran which are:
All of them output garbage. Latest version I for sure know works is 5141 (haven't tried much else though) |
I've conducted some tests on my setup which is open-webui -> llama-swap -> llama.cpp on NixOS 24.11 Here are the results: Work: 5141, 5170, 5174, So 5174 on my end is the last version that works perfectly fine, and 5175 is the first that doesn't. This could be a different issue really but idk, since what happens isn't really an immediate garbage out necessarily, but rather a quick degeneration into repetition, sometimes quickly sometimes after a few dozen tokens, but either way it devolves into repeating garbage, like here's one example I got on, I believe, 5175: Example 1
And this is on, I think, 5178 Example 2
And on 5174 I get proper responses consistently. In all tests my query way "Скажи привет" ("Say hello") sent via Open WebUI (sometimes I did use a system prompt sometimes I didn't, that didn't seem to have an effect on anything). Edit: Forgot to mention that I tried Qwen3 only, 14B and 30B MoE models. Here are the commands for the primary two models I tried:
(-ts 0,1 because I have two GPUs and for reason unknown llama-swap fails miserably when I add GGML_VK_VISIBLE_DEVICES to the configs via the env: directive) |
Thank you for bisecting it. @netrunnereve maybe we missed something in the GCN tune. |
@nasrally I went and tried a couple long generations with Qwen3 4B (that's the only Qwen that I have with me right now) and couldn't get it to repeat like that. Are you seeing this with other models like Mistral and Llama as well? I'm going to download bartowski/Qwen_Qwen3-14B-GGUF:Q6_K to try out but in the meantime please run a |
@nasrally Which mesa version are you using? With a Radeon Pro VII on Mesa 24.2.8 I have working Qwen3-14B, but Qwen3-30B-A3B is not working with FP16 enabled. If I disable FP16, it starts working. With Mesa 25.1.0 Qwen3-14B is completely broken, regardless of fp16. Qwen3-30B-A3B is also broken either way. I'm not sure if this is the same problem you have, but it's making testing difficult for me. This isn't the first time I'm having trouble with it, mesa has had a streak of breaking my Pro VII and I'm not sure what is causing this. On amdvlk I don't see any of these problems. |
@netrunnereve okay, so, I've tested gemma3 (google/gemma-3-12b-it-qat-q4_0-gguf:Q4_0) and mistral (unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL and bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_S) on 5175 initially. Gemma3 qat seemed okay pretty much all the way, I've had like a dozen exchanges and it was nothing unusual. Then I thought about it for a second, what if its just that K quants are broken, and tried bartowski/google_gemma-3-12b-it-GGUF:Q4_K_S, but it kinda was almost fine too, maybe a little funky, but fine overall, and then, unexpectedly so, bartowski/Qwen_Qwen3-14B-GGUF:Q4_0 was working fine... 🤔 Not good though, but it didn't devolve into anything crazy, just misunderstood queries more than the Q6_K version, as if it just couldn't "see" all of the tokens properly, but apart from that is was alright, practically unusable but not crazy But then... I added GGML_VK_DISABLE_F16=1 to llama-swap's env, and as if by magic, everything started working fine 🙈 So yeah, not that I know what all that implies, but this looks like an "fp16 issue"... Here are the P.S. I tested some more of unsloths DQ2.0 Mistral 3.1 Q4, and it does seem to perform worse than bartowski's even on 5174 |
@0cc4m Mesa is 24.2.8, and yeah, disabling fp16 solves it for me too 🧐. I think I do use amdvlk but I'm not really sure |
Look at the device info like, if it mentions |
Its radv, hmm. But on my NixOS install I do have Okay now I have to figure out how to enforce amdvlk, for now vulkaninfo fails when I force it with VK_ICD_FILENAME |
amdvlk no longer supports GCN, so you would need an old version. |
So, okay, the results are good but also not so much. I was able to get amdvlk 2023.Q3.3 running and it kinda sucks somewhat, I had to lower the number of offloaded layers for Qwen3 30B from 41 to 35 otherwise I got
which never happened on radv, because of that now it also runs a tad slower, but on 5175 without FP16 disabled it works perfectly fine. I guess it was in fact a different problem... Well at least now I know that radv sometimes sucks ass, I may be fine with sticking to radv with GGML_VK_DISABLE_F16=1 for now, but like NixOS 25.05 is coming in a week and I'll get mesa 25.0.6 which I assume would stop working :( |
Those I also went and tested out the Q6_K 14B Qwen and couldn't get it to repeat like that or generate garbage, though it seems like that model doesn't set the EOS properly as it doesn't stop after finishing it's response. So it'll think and answer the question all over again, but it doesn't mess up the output with gibberish. Interesting enough the prompt processing speed is super slow on my W8100 + RX 470 at 15 t/s, but that's another issue and I'm not going to worry about that here. BTW if you aren't sure if an issue is due to the model or the backend you can try running it on CPU only with the same prompt. There's no guarantees of course but the CPU implementation should be correct... right? |
No, this is a RADV problem, related to fp16 on GCN. Someone has to open an issue upstream. |
The strange part though is that the results are apparently more coherent before 5175, even though test-backend-ops has the same number of errors. At the core this is a driver issue but the tune might have some effect as well. |
Someone opened a relevant issue on mesa's gitlab literally like yesterday. I'm trying to register there to post more info but I can't get a confirmation email 🙄. Maybe some of you could also throw you two cents in there as you know way more than I do? |
Thank you, that's good to know. I'll keep an eye on that issue and help if needed. |
I see the |
Name and Version
When using llama.cpp (via LM Studio with Vulkan backend and GGUF models), some models — specifically Qwen3-8B, Mistral, Hermes, and LLaMA — repeatedly output GGGGGGGGGGGG or loop indefinitely, regardless of prompt. This does not happen with Gemma models.
Model Format:
GGUF Q4_K_M, Q5_K_M, Q6_K
Tokenizer: auto-loaded by LM Studio
What I've Tried:
Adding <|im_start|> formatting for Qwen
Using [INST]...[/INST] for Mistral/LLaMA
Setting stop sequences
Lowering temp, top-k, top-p, repeat penalty
Testing via both UI and raw curl
Re-downloading models from official sources
Hypothesis:
Could be:
A tokenizer mismatch
Broken prompt template injection
Quantization compatibility issue with Vulkan
Missing EOS/stop condition on some models
Operating systems
Windows
GGML backends
Vulkan
Hardware
LM Studio v0.3.15 (Build 11)
llama.cpp backend via Vulkan
GPU: AMD Radeon RX 5700 (7.98 GB VRAM)
RAM: 32 GB
OS: Windows 10 x64
Models
Qwen3-8B.Q4_K_M.gguf https://huggingface.co/lmstudio-community/Qwen3-8B-GGUF
Qwen2.5-7B-Instruct-1M-GGUF https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF
Problem description & steps to reproduce
Load Qwen3-8B.Q4_K_M.gguf or Mistral-7B-Instruct-v0.1.Q4_K_M.gguf
Run this prompt:
<|im_start|>user
Write a long message about the sun.<|im_end|>
<|im_start|>assistant
Output returns:
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...
✅ Works Fine:
gemma-3-4b-it-qat
gemma-3-12b-it-qat
First Bad Commit
No response
Relevant log output
The text was updated successfully, but these errors were encountered: