Eval bug: repeated output for llama-server #12782

zhaoyukoon · 2025-04-06T15:56:47Z

Name and Version

MODEL_ROOT=/mnt/backup/models
GGUF=Qwen/QwQ-32B-GGUF/qwq-32b-q4_k_m.gguf
PORT=8007
PARALLEL=4 # 并发度，超过会阻塞
MAX_LEN=8192 # 最大上下文

docker run -v $MODEL_ROOT:/models
-p $PORT:8007 ghcr.io/ggml-org/llama.cpp:server
-m /models/$GGUF
--port 8007 --host 0.0.0.0 -n $MAX_LEN
--parallel $PARALLEL
--threads 32
--ctx-size 16384
--seed 3407
--prio 2
--temp 0.6
--repeat-penalty 1.1
--dry-multiplier 0.5
--min-p 0.01
--top-k 40
--top-p 0.95
--no-cnv
--chat-template deepseek3
--samplers "top_k;top_p;min_p;temperature;dry;typ_p;xtc"

Operating systems

Linux

GGML backends

CPU

Hardware

9135 + 4090

Pure CPU

Models

Qwen/QwQ-32B-GGUF/qwq-32b-q4_k_m.gguf

Problem description & steps to reproduce

when i run ghcr.io/ggml-org/llama.cpp:server, i got repeated result.

First Bad Commit

No response

Relevant log output

<think>

Okay, I need to write a quicksort function in Python. Let me start by recalling how quicksort works. The basic idea is to pick a pivot element, then partition the array so that all elements less than the pivot come before it and all greater elements come after it. Then recursively sort the two partitions.

Hmm, first I should decide how to choose the pivot. The user didn't specify, so maybe I'll go with the last element as the pivot for simplicity. Alternatively, sometimes people use the middle element or a random one. But let's stick with the last element for now.

Next, the partitioning step. I remember that in the standard approach, you have a pointer that tracks the position where elements less than the pivot should go. You iterate through the array, and whenever you find an element less than the pivot, you swap it with the element at the pointer's position and increment the pointer. That way, after the loop, all elements before the pointer are smaller, and then you swap the pivot into its correct position.

Wait, right. Let me sketch that out. Suppose the array is arr, low is the starting index, high is the ending index. The pivot is arr[high]. Initialize i to low-1. Then loop from j = low to high-1: if arr[j] <= pivot, swap arr[j] with arr[i+1], then i +=1. Then after the loop, swap arr[i+1] with the pivot (arr[high]). This gives the pivot's correct position.

Wait, maybe I should test this logic with a small example. Let's say the array is [3, 2, 1]. Let me see. The pivot is 1 (last element). i starts at -1. j starts at 0. Since 3 > 1, do nothing. j increments to 1. arr[1] is 2, which is greater than 1, so nothing. Then after the loop, swap pivot with i+1, which is 0, so the array becomes [1, 2, 3]. Hmm, that worked. Wait, but in this case, the pivot was the smallest element, so the partitioning correctly placed it at the beginning.

Another example: [5, 3, 8, 4, 2]. Pivot is 2 (last element). But that would be bad because the pivot is the smallest. Let me think again. Wait, maybe the pivot should be selected as the last element of the subarray being sorted. So in the first pass, the entire array, the pivot is 2. Then the partitioning would need to move elements less than or equal to 2 to the left. So let's see. The array is 5,3,8,4,2. Starting with low=0, high=4 (index 4 is 2). i starts at -1. Then for j from 0 to 3: first element 5 is greater than pivot 2, so no swap. j increments to 1: 3 is greater than pivot, no. j=2: 8>2, no. j=3:4>2, no. So after all j, i is still -1. Then swap arr[i+1] (arr[0]) with the pivot. So arr[0] becomes 2, pivot is at position 0. Then the partitioning is done. But then the array becomes [2,3,8,4,5], but that's not correct because the elements after the pivot are still larger? Wait, maybe I made a mistake here. Wait, the pivot is 2. All elements less than or equal to the pivot should be before it. But since all elements except the pivot are greater than the pivot, except none actually. So the pivot ends up at position 0. Then the left partition is just up to 0, and the right is from 1 onwards. But the left partition is already sorted, and the right part is [3,8,4,5], which needs to be sorted. Hmm, that seems okay.

But maybe choosing the last element can be inefficient in some cases, but for the purpose of the code, that's okay.

Now, implementing this in Python. Let's think of the functions. Quicksort can be implemented with a helper function that does the sorting in place, using low and high indices. Alternatively, some people prefer to create new lists, but that uses more memory. Let's do the in-place approach.

The overall structure would be something like:

def quicksort(arr, low, high):
    if low < high:
        pi = partition(arr, low, high)
        quicksort(arr, low, pi-1)
        quicksort(arr, pi+1, high)

Then the partition function is the following:

The partition function is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is is^CTraceback (most recent call last):

The text was updated successfully, but these errors were encountered:

zhaoyukoon · 2025-04-06T15:57:41Z

model='QwQ-32B'
port='8007'

prompt='write quick sort in python'

openai_api_base = "http://localhost:{}/v1".format(port)
client = OpenAI(api_key=key, base_url=openai_api_base)
response = client.chat.completions.create(
    model=model,
    messages=[
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": prompt}
    ],
    stop=["<|im_end|>", "<|endoftext|>", "</s>", "<|eot_id|>"],
    stream=True
)

zhaoyukoon · 2025-04-06T16:00:23Z

srv  params_from_: Chat format: Content-only
slot launch_slot_: id  2 | task 1342 | processing task
slot update_slots: id  2 | task 1342 | new prompt, n_ctx_slot = 1024, n_keep = 0, n_prompt_tokens = 24
slot update_slots: id  2 | task 1342 | kv cache rm [0, end)
slot update_slots: id  2 | task 1342 | prompt processing progress, n_past = 24, n_tokens = 24, progress = 1.000000
slot update_slots: id  2 | task 1342 | prompt done, n_past = 24, n_tokens = 24
slot update_slots: id  2 | task 1342 | slot context shift, n_keep = 0, n_left = 1023, n_discard = 511
srv  cancel_tasks: cancel task, id_task = 1342
srv  log_server_r: request: POST /v1/chat/completions 172.17.0.1 200
slot      release: id  2 | task 1342 | stop processing: n_past = 761, truncated = 1
srv  update_slots: all slots are idle

llama-server output

betweenus · 2025-04-06T18:25:39Z

Hi. Please try with moving the temperature sampler to the end of the samplers sequence. i.e.:
--samplers "top_k;top_p;min_p;dry;typ_p;xtc;temperature"

FreedomLiX · 2025-05-19T02:37:48Z

请问这个问题是怎么解决的？

zhaoyukoon added the bug-unconfirmed label Apr 6, 2025

steampunque mentioned this issue Apr 10, 2025

Misc. bug: The model's reasoning performance has significantly decreased despite using different versions of the same model architecture, identical parameters, and the same set of questions. #12816

Open

github-actions bot added the stale label May 7, 2025

github-actions bot removed the stale label May 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: repeated output for llama-server #12782

Eval bug: repeated output for llama-server #12782

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Eval bug: repeated output for llama-server #12782

Eval bug: repeated output for llama-server #12782

Comments

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Uh oh!

Uh oh!

Uh oh!

Uh oh!