8000 Qwen3-8B and other models generate garbage output / repeat tokens (GGGGGG...) in llama.cpp via LM Studio Vulkan backend · Issue #13310 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Qwen3-8B and other models generate garbage output / repeat tokens (GGGGGG...) in llama.cpp via LM Studio Vulkan backend #13310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
fav-devs opened this issue May 5, 2025 · 25 comments

Comments

@fav-devs
Copy link
fav-devs commented May 5, 2025

Name and Version

When using llama.cpp (via LM Studio with Vulkan backend and GGUF models), some models — specifically Qwen3-8B, Mistral, Hermes, and LLaMA — repeatedly output GGGGGGGGGGGG or loop indefinitely, regardless of prompt. This does not happen with Gemma models.

Model Format:
GGUF Q4_K_M, Q5_K_M, Q6_K
Tokenizer: auto-loaded by LM Studio

What I've Tried:
Adding <|im_start|> formatting for Qwen
Using [INST]...[/INST] for Mistral/LLaMA
Setting stop sequences
Lowering temp, top-k, top-p, repeat penalty
Testing via both UI and raw curl
Re-downloading models from official sources

Hypothesis:
Could be:
A tokenizer mismatch
Broken prompt template injection
Quantization compatibility issue with Vulkan
Missing EOS/stop condition on some models

Operating systems

Windows

GGML backends

Vulkan

Hardware

LM Studio v0.3.15 (Build 11)
llama.cpp backend via Vulkan
GPU: AMD Radeon RX 5700 (7.98 GB VRAM)
RAM: 32 GB
OS: Windows 10 x64

Models

Qwen3-8B.Q4_K_M.gguf https://huggingface.co/lmstudio-community/Qwen3-8B-GGUF
Qwen2.5-7B-Instruct-1M-GGUF https://huggingface.co/lmstudio-community/Qwen2.5-7B-Instruct-1M-GGUF

Problem description & steps to reproduce

Load Qwen3-8B.Q4_K_M.gguf or Mistral-7B-Instruct-v0.1.Q4_K_M.gguf

Run this prompt:
<|im_start|>user
Write a long message about the sun.<|im_end|>
<|im_start|>assistant

Output returns:
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG...

✅ Works Fine:
gemma-3-4b-it-qat
gemma-3-12b-it-qat

First Bad Commit

No response

Relevant log output

write a long message

qwen3-8b@q3_k_l
Thoughts
Thought for 18.52 seconds

Okay, the user asked me to write a long message again. Let me check the history to understand what they need.

In the previous interaction, the user first said "hi" and then started typing a long string of "io". I responded with an HTML landing page example. Then they asked for a long message, which I provided as a reflective text on resilience, connection, and imperfection. Now they're asking again for a long message.

Wait, maybe the user is testing if I can generate longer content or perhaps they want another similar response. But since the last message was already quite lengthy, maybe they want something different this time? Or perhaps they made a mistake in their query. Alternatively, maybe they want more examples of long messages on various topics.

I need to ensure that my response is helpful and not redundant. The user might be looking for another example, but it's possible they just want the same type of message again. To avoid repetition, I should check if there's a specific context or theme they have in mind. However, since they haven't provided more details, I'll proceed by offering a new, different long message that expands on previous themes but introduces new elements.

Alternatively, maybe they want a message about a different topic like technology, nature, or personal growth. Let me think of another angle. Perhaps focusing on gratitude, mindfulness, or the journey of self-discovery. Ensuring that the message is well-structured, with clear sections and a coherent flow. Also, making sure it's engaging and thought-provoking.

I should also consider the user's possible needs. If they're working on a project, maybe they need content for a blog post or a speech. Or perhaps they want to explore different aspects of human experience. Including elements like personal anecdotes, philosophical reflections, and practical advice would make the message more relatable.

Another approach is to structure the message with clear sections, such as an introduction, body paragraphs on various themes, and a conclusion. Adding emotional depth and encouraging self-reflection can make the message impactful. I'll need to maintain a balance between being comprehensive and not overwhelming the reader.

Also, considering the previous response was about resilience and imperfection, maybe this time focusing on gratitude or embracing change. Ensuring that each paragraph transitions smoothly and maintains a consistent tone. Using rhetorical questions, metaphors, and personal insights can enhance the message's depth.

I should also check for any possible repetition and ensure that new ideas are introduced. Maybe include elements like mindfulness practices, overcoming fear, or finding purpose in life. Making sure the message is both informative and inspirational.

In summary, I'll create a new long message with a different theme, possibly focusing on gratitude and mindfulness, structured into sections with clear explanations and personal reflections to provide value to the user.

A Long Message: The Art of Being Present

Let me tell you something that has been on my mind for weeks now—something I’ve come to realize is both simple and profound. It’s about being present. Not in the way we often misunderstand it (as a checklist item, "I should meditate 10 minutes today"), but as a deep, unspoken truth: life is not about chasing what's next; it's about holding still enough to notice what is now.

You see, I’ve spent so much of my life in a hurry—to grow faster, achieve more, fix everything that feels broken, and avoid the uncomfortable truths of who I am. But here’s the catch: hurrying has always been a kind of denial. It’s an avoidance of the present moment, a refusal to sit with what is, instead of what we think it should be. And yet, every time I’ve forced myself to pause—whether for a breath, a cup of tea, or simply watching the4 leaves fall from a tree—I’ve discovered something remarkable: the world is alive in ways I never imagined.

Take this morning, for example. The sun was just rising when
8000
 I stepped outside. The air felt fresh, crisp with dew, and the sky had thatotros hues of lavender and gold that made me stop and breathe. A sparrow landed on the railing nearby, chirping a melody so pure it seemed like it had been waiting for me to listen. And in that moment, I was not a busy person rushing to work, nor an anxious mind worrying about deadlines or emails. I was simply here.

This is what presence looks like—it’s not about grand gestures or elaborate rituals. It’s the quiet act of noticing, the willingness to let go of control, and the courage to embrace uncertainty. It’s when you realize that even in chaos, there is rhythm; even in silence, there is sound; and even in loneliness, there is connection.

Let me share a story from my own life. A few years ago, I was going through one of those periods where everything felt off-kilter. My work was demanding, my relationships were strained, and I couldn’t shake the feeling that I was failing at being "enough." One evening, I decided to walk alone in the woods near my home. It wasn’t a planned trip—it just happened. As I wandered through the forest, I found myself drawn to a small stream. The water was clear, cool, and it flowed with an effortless grace. I sat down on the rocks and watched it for what felt like hours. No phone, no distractions, no need to "do" anything. Just being.

And in that stillness, something shifted. I didn’t find answers. I didn’t solve problems. But I found peace. I realized that my fear of not having enough wasn’t about the future—it was about the present. The stream taught me that even when things feel turbulent, there is a natural order to it. It’s not about controlling the flow; it’s about learning how to move with it.

This brings me to something I’ve been thinking about lately: the power of small, consistent actions. We often underestimate what happens when we show up daily, even in tiny ways. A morning walk, a few minutes of gratitude journaling, or simply allowing yourself to feel emotions without judgment—all these acts are not just habits; they’re rituals of presence. They remind us that life is not about perfection, but about showing up again and again.

And yet, I know this isn’t easy. The world around us is full of noise—social media, deadlines, comparisons, and the constant pressure to "keep going." It can be overwhelming. But here’s what I’ve learned: presence is a choice. It doesn’t require grand gestures or extraordinary circumstances. It just requires intention. It’s about choosing to notice the beauty in small things, like the way sunlight filters through leaves, the warmth of a hand on your shoulder, or the quiet comfort of being with someone who truly sees you.

Let me also talk about connection—because presence is not just about ourselves; it’s about others. We are inherently social beings, wired to connect. Yet so often, we get lost in our own heads, forgetting that we are part of something larger. I’ve seen how meaningful it can be to simply sit with someone, listen without judgment, or share a meal without the distractions of screens. These moments aren’t just about conversation; they’re about being with each other, in all our imperfections.

And here’s the truth: imperfection is not something to fix. It’s something to embrace. I used to think that life was about becoming better—about improving, growing, and achieving. But now I see it differently. Life is about being—about showing up as you are, even when you’re messy, even when you don’t have all the answers. It’s about allowing yourself to stumble, to fall, and to rise again without shame. Because every time we do that, we become more resilient, more compassionate, and more whole.

This is not to say there aren’t challenges—because life is hard, and it doesn’t always make sense. But I’ve come to believe that the most meaningful moments are often the ones we don’t plan. The unexpected call from a friend, the sudden realization during a walk, or the quiet joy of watching a child laugh. These are the things that remind us why we’re here—to love, to connect, and to findGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

29.99 tok/sec



1981 tokens



7.24s to first token



Stop reason: User Stopped
@theridane
Copy link

I hit the same behavior with Qwen3 and GLM-4, also with Vulkan on AMD. It doesn't always happen, but I've noticed that larger models hit this behavior more often. An 8B model that fully fits into my VRAM has no issues, a model that's partially offloaded does.

GPU: AMD Radeon RX 7800 (16 GB VRAM)
RAM: 128 GB
OS: Linux

@lcarrere
Copy link
Contributor
lcarrere commented May 6, 2025

I’m also observing this, with all produced logits having NaN value, with different Nvidia GPUs

@jeffbolznv
Copy link
Collaborator

Can somebody share a command line to reproduce this with llama-cli? I tried yesterday but wasn't able to reproduce it.

@characharm
Copy link
Contributor

I have a similar issue with llama-server. The launch flags are --jinja -c 15000 -ngl 99 --port 8000, nothing special in the parameters, so at first I thought the problem was with my setup. I'm using two GPUs – Intel and AMD – and I've noticed that after a short period of inactivity, the model gets unloaded from the Intel GPU into main RAM, and then reloaded back when a new request comes in. Often, the issue with the 'GGGGGGGGGGG' output shows up after a short idle period, usually after two or three good responses. But sometimes it happens right away. Restarting the server usually fixes it.

@starble-dev
Copy link

I have the same issue too, however this is for GLM-4-32B using Q_4_K_S from bartowski's quant, however some investigation at different HF discussions appears the issue specifically for GLM-4-32B comes from batching as using the settings "-b 8 -ub 8" for llama-server appears to fix my issue. This happens to me both for Vulkan and ROCm. Not sure if this is actually a genuine fix but I found this out via this discussion in HF.

Changing the batch size does fix the issue in both Vulkan and ROCm. Weirdly enough this doesn't seem to happen if quanting the KV-cache so it's probably a float issue.

@stduhpf
Copy link
Contributor
stduhpf commented May 11, 2025

however some investigation at different HF discussions appears the issue specifically for GLM-4-32B comes from batching as using the settings "-b 8 -ub 8" for llama-server appears to fix my issue.

Thanks for pointing this out, though it's not really a fix, but a workaround that comes at a big cost in prompt processing speed.

Probably related to this issue, I get a lot of failures with test-backend-ops, with f16 ADD,SUB,MUL, and DIV ops. I'll try to pin-point which commit introduced these failing tests to see if reverting it fixes this issue.

Edit: seems like 8ae5ebc introduced the failing tests (turns out it's probably not related to this issue, I still get the "GGGGGGGGGGGGGGG" after reverting)

@fav-devs
Copy link
Author

Hi there, everyone, I discovered a weird solution after looking at what could make gemma special from all the other models i tried i noticed the gemma's quat was Q4_0 and all the other models were all Q4_K_M so i downloaded a Q3_K_L version of the Qwen model and so far i have not had any issues

@characharm
Copy link
Contributor

It turns out my problem is most likely not related to this bug. I made a small SYCL utility that constantly pings a small area of memory on the Intel card, preventing VRAM from being swapped to RAM. The server has been running for more than 24 hours without errors.

@nasrally
Copy link

I see garbage with all Qwen3 models I ran which are:

  • bartowski/Qwen_Qwen3-14B-GGUF:Q4_K_L
  • bartowski/Qwen_Qwen3-14B-GGUF:Q6_K
  • unsloth/Qwen3-30B-A3B-GGUF:Q4_K_XL
  • mradermacher/Qwen3-30B-A4.5B-12-Cooks-GGUF:Q4_K_S

All of them output garbage. Latest version I for sure know works is 5141 (haven't tried much else though)

@nasrally
Copy link
nasrally commented May 16, 2025

I've conducted some tests on my setup which is open-webui -> llama-swap -> llama.cpp on NixOS 24.11
GPU: AMD Instinct MI50 (Vega 20, GCN5)
CPU: Intel E5- 8000 2699v3 (Haswell).

Here are the results:

Work: 5141, 5170, 5174,
Don't work: 5200, 5185, 5178, 5176, 5175

So 5174 on my end is the last version that works perfectly fine, and 5175 is the first that doesn't.

This could be a different issue really but idk, since what happens isn't really an immediate garbage out necessarily, but rather a quick degeneration into repetition, sometimes quickly sometimes after a few dozen tokens, but either way it devolves into repeating garbage, like here's one example I got on, I believe, 5175:

Example 1

Okay, the user sent "Привет" and my task is to respond with the user's language, which is Russian. I need to check if I'm a user here and my role is a user. The user is the user here, so my job is to respond as the user. The user has the user as their language. The user has the user as the language, but I'm a user, so I'm the user. The user is the user. The user is the user, the user, the user, the user, the user, the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is the user. The user is

And this is on, I think, 5178

Example 2
Дайте 1 год
отдайте
1 год
отдайте 1
1 год
отдайте
1 year
откройте 1
1 year
откройте
1 year
Да
после 

После 

после 

после 

1 year
1
1 

1
1 

1
1 

1 

1

And on 5174 I get proper responses consistently.

In all tests my query way "Скажи привет" ("Say hello") sent via Open WebUI (sometimes I did use a system prompt sometimes I didn't, that didn't seem to have an effect on anything).

Edit: Forgot to mention that I tried Qwen3 only, 14B and 30B MoE models. Here are the commands for the primary two models I tried:

llama-server -hf unsloth/Qwen3-30B-A3B-GGUF:Q4_K_XL --gpu-layers 41 -ts 0,1 --no-context-shift
      --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0
      --ctx-size 32000 --split-mode row --jinja --reasoning-format deepseek
      --threads 36 --no-mmap --no-warmup --host 127.5.0.1 --port 10101 --batch-size 384

llama-server -hf bartowski/Qwen_Qwen3-14B-GGUF:Q6_K --gpu-layers 40 -ts 0,1 --no-context-shift
      --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0.0
      --ctx-size 20000 --split-mode row --jinja --reasoning-format deepseek
      --threads 36 --no-mmap --no-warmup --host 127.5.0.1 --port 10101

(-ts 0,1 because I have two GPUs and for reason unknown llama-swap fails miserably when I add GGML_VK_VISIBLE_DEVICES to the configs via the env: directive)

@0cc4m
Copy link
Collaborator
0cc4m commented May 16, 2025

Thank you for bisecting it. @netrunnereve maybe we missed something in the GCN tune.

@netrunnereve
Copy link
Collaborator

@nasrally I went and tried a couple long generations with Qwen3 4B (that's the only Qwen that I have with me right now) and couldn't get it to repeat like that. Are you seeing this with other models like Mistral and Llama as well?

I'm going to download bartowski/Qwen_Qwen3-14B-GGUF:Q6_K to try out but in the meantime please run a test-backend-ops and do some tests with FP16 turned off using GGML_VK_DISABLE_F16=1. I mean I doubt that this is a FP16 issue but it's worth a try.

@0cc4m
Copy link
Collaborator
0cc4m commented May 17, 2025

@nasrally Which mesa version are you using?

With a Radeon Pro VII on Mesa 24.2.8 I have working Qwen3-14B, but Qwen3-30B-A3B is not working with FP16 enabled. If I disable FP16, it starts working.

With Mesa 25.1.0 Qwen3-14B is completely broken, regardless of fp16. Qwen3-30B-A3B is also broken either way.

I'm not sure if this is the same problem you have, but it's making testing difficult for me. This isn't the first time I'm having trouble with it, mesa has had a streak of breaking my Pro VII and I'm not sure what is causing this. On amdvlk I don't see any of these problems.

@nasrally
Copy link

@netrunnereve okay, so, I've tested gemma3 (google/gemma-3-12b-it-qat-q4_0-gguf:Q4_0) and mistral (unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL and bartowski/mistralai_Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_S) on 5175 initially.

Gemma3 qat seemed okay pretty much all the way, I've had like a dozen exchanges and it was nothing unusual.
Mistrals though, started kinda okay, but devolved into schizophasia after a couple queries and started making up stuff and misinterpreting every question pretty badly. I first tried unsloth's DQ2.0 version and it was bad, but I assumed it could be a bad quant again bc I had severe issues with their qwen3 14b q4xl version (q4 was the only problematic one), but bartowski's was the same.

Then I thought about it for a second, what if its just that K quants are broken, and tried bartowski/google_gemma-3-12b-it-GGUF:Q4_K_S, but it kinda was almost fine too, maybe a little funky, but fine overall, and then, unexpectedly so, bartowski/Qwen_Qwen3-14B-GGUF:Q4_0 was working fine... 🤔 Not good though, but it didn't devolve into anything crazy, just misunderstood queries more than the Q6_K version, as if it just couldn't "see" all of the tokens properly, but apart from that is was alright, practically unusable but not crazy

But then... I added GGML_VK_DISABLE_F16=1 to llama-swap's env, and as if by magic, everything started working fine 🙈 So yeah, not that I know what all that implies, but this looks like an "fp16 issue"...

Here are the test-backend-ops logs:

P.S. I tested some more of unsloths DQ2.0 Mistral 3.1 Q4, and it does seem to perform worse than bartowski's even on 5174

@nasrally
Copy link

@0cc4m Mesa is 24.2.8, and yeah, disabling fp16 solves it for me too 🧐. I think I do use amdvlk but I'm not really sure

@0cc4m
Copy link
Collaborator
0cc4m commented May 17, 2025

Look at the device info like, if it mentions (RADV ... in the device name, it's radv. If not, it's amdvlk or vulkan-amdgpu-pro.

@nasrally
Copy link

Its radv, hmm. But on my NixOS install I do have hardware.amdgpu.amdvlk.enable option enabled 🤔🤔

Okay now I have to figure out how to enforce amdvlk, for now vulkaninfo fails when I force it with VK_ICD_FILENAME

@0cc4m
Copy link
Collaborator
0cc4m commented May 17, 2025

amdvlk no longer supports GCN, so you would need an old version.

@nasrally
Copy link

So, okay, the results are good but also not so much. I was able to get amdvlk 2023.Q3.3 running and it kinda sucks somewhat, I had to lower the number of offloaded layers for Qwen3 30B from 41 to 35 otherwise I got

ggml_gallocr_reserve_n: failed to allocate Vulkan0 buffer of size 1743912960

which never happened on radv, because of that now it also runs a tad slower, but on 5175 without FP16 disabled it works perfectly fine. I guess it was in fact a different problem...

Well at least now I know that radv sometimes sucks ass, I may be fine with sticking to radv with GGML_VK_DISABLE_F16=1 for now, but like NixOS 25.05 is coming in a week and I'll get mesa 25.0.6 which I assume would stop working :(

@netrunnereve
Copy link
Collaborator

Those test-backend-ops results are terrible, especially with FP16 on. I would expect a lot of models to have wrong results on your system. Now I went through the code again and I don't see anything in my 5175 which would mess up FP16 but not FP32, and the only thing I can think of if that maybe the use of 256 rather than 128 threads for a 64x64 block causes some precision differences when we add up the FP16 numbers.

I also went and tested out the Q6_K 14B Qwen and couldn't get it to repeat like that or generate garbage, though it seems like that model doesn't set the EOS properly as it doesn't stop after finishing it's response. So it'll think and answer the question all over again, but it doesn't mess up the output with gibberish. Interesting enough the prompt processing speed is super slow on my W8100 + RX 470 at 15 t/s, but that's another issue and I'm not going to worry about that here.

BTW if you aren't sure if an issue is due to the model or the backend you can try running it on CPU only with the same prompt. There's no guarantees of course but the CPU implementation should be correct... right?

@0cc4m
Copy link
Collaborator
0cc4m commented May 17, 2025

No, this is a RADV problem, related to fp16 on GCN. Someone has to open an issue upstream.

@netrunnereve
Copy link
Collaborator

The strange part though is that the results are apparently more coherent before 5175, even though test-backend-ops has the same number of errors. At the core this is a driver issue but the tune might have some effect as well.

@nasrally
Copy link

Someone opened a relevant issue on mesa's gitlab literally like yesterday. I'm trying to register there to post more info but I can't get a confirmation email 🙄. Maybe some of you could also throw you two cents in there as you know way more than I do?

@0cc4m
Copy link
Collaborator
0cc4m commented May 18, 2025

Thank you, that's good to know. I'll keep an eye on that issue and help if needed.

@bjodah
Copy link
bjodah commented May 27, 2025

I see garbage with all Qwen3 models I ran which are:

* bartowski/Qwen_Qwen3-14B-GGUF:Q4_K_L

* bartowski/Qwen_Qwen3-14B-GGUF:Q6_K

* unsloth/Qwen3-30B-A3B-GGUF:Q4_K_XL

* mradermacher/Qwen3-30B-A4.5B-12-Cooks-GGUF:Q4_K_S

All of them output garbage. Latest version I for sure know works is 5141 (haven't tried much else though)

I see the GGGGGGGG... issue on --hf-repo bartowski/Qwen_Qwen3-14B-GGUF:Q6_K_L too (I'm using CUDA backend). If I switch to bartowski's Q8_0 the problem seems to go away. For my particular test case, unsloth's Q6_K_XL also works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

0