8000 Misc. bug: ./llama-server API max_completion_tokens Parameter Not Working · Issue #13700 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Misc. bug: ./llama-server API max_completion_tokens Parameter Not Working #13700

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ZV-Liu opened this issue May 22, 2025 · 2 comments
Open

Comments

@ZV-Liu
Copy link
ZV-Liu commented May 22, 2025

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
error: invalid argument: -V

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-server

Command line

./llama-server -m /output/Qwen3-8B-Q4_K_M.gguff --host 0.0.0.0 --port 8080

Problem description & steps to reproduce

Image

First Bad Commit

No response

Relevant log output

@prd-tuong-nguyen
Copy link

I think it should be "max_tokens": 50

< 6B9C /div>
@ZV-Liu
Copy link
Author
ZV-Liu commented May 22, 2025

max_tokens has been deprecated by OpenAI API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0