8000 Misc. bug: Overflow in Cast ( · Issue #13722 · ggml-org/llama.cpp · GitHub
[go: up one dir, main page]

Skip to content

Misc. bug: Overflow in Cast ( #13722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TheDarkTrumpet opened this issue May 23, 2025 · 3 comments
Closed

Misc. bug: Overflow in Cast ( #13722

TheDarkTrumpet opened this issue May 23, 2025 · 3 comments

Comments

@TheDarkTrumpet
Copy link

Name and Version

Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
Device 1: NVIDIA RTX 6000 Ada Generation, compute capability 8.9, VMM: yes
version: 5433 (759e37b)
built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Command line

Problem description & steps to reproduce

Error

gguf-py/gguf/lazy.py:217: RuntimeWarning: overflow encountered in cast

Description

Good morning. I've used Axolotl to work at training a 32B Lora model. From then, I've merged it into one large model. I have a "discussion" on axolotl-ai-cloud/axolotl#2705 where I discuss this some. The fine tuning went fine, the configuration is below. When I try to convert this to gguf, it throws this runtime warning. Which, when it quantizes the model, it fails entirely.

At first, I thought this may have been hardware - maybe a bad disk. But, I think I was wrong, because I bought a new NVME drive, and started fresh. Fresh build of llama.cpp and everything. Nothing fixed it.

Axolotl Config:

# Originally taken from: https://github.com/axolotl-ai-cloud/axolotl/blob/a27b909c5c1c2c561a8d503024b89afcce15226f/examples/qwen3/32b-qlora.yaml
base_model: Qwen/Qwen2.5-32B

plugins:
  - axolotl.integrations.cut_cross_entropy.CutCrossEntropyPlugin
strict: false

chat_template: qwen_25
datasets:
  - path: ./no_git/17_suggested_changes_with_new.csv
    type: alpaca
val_set_size: 0
eval_sample_packing: true  

output_dir: ./no_git/17_qwenmspe_changes/
dataset_prepared_path: last_run_prepared

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

load_in_4bit: true
adapter: qlora
lora_r: 32 # up from 16
lora_alpha: 64 # Up from 32
lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
  - down_proj
  - up_proj
lora_mlp_kernel: true
lora_qkv_kernel: true
lora_o_kernel: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 1
num_epochs: 10
optimizer: adamw_torch_4bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: auto
tf32: true

gradient_checkpointing: offload
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_steps: 50 # From 10
evals_per_epoch: 4
saves_per_epoch: 1
weight_decay: 0.01 # up from 0
special_tokens:
use_tensorboard: true

Commands for merge and after:

python3 -m axolotl.cli.merge_lora 17-train-qwen32b-lora.yaml --lora_model_dir="./no_git/17_qwenmspe_changes"

# v-- Breaks here with the warning I mentioned.
python3 convert_hf_to_gguf.py ~/path/to/17_qwenmspe_changes/merged/

llama-quantize ~/path/to/17_qwenmspe_changes/merged/Merged-33B-F16.gguf ~/path/to/17_qwenmspe_changes/msom-qwen32b-mspe-suggestions-Q5.gguf Q5_0

I'm at a bit of a loss at what could have caused all this. I had successful runs in the past, but now it's something else going on.

First Bad Commit

No response

Relevant log output

@slaren
Copy link
Member
slaren commented May 23, 2025

Maybe your model has values that cannot represented in F16. Try adding --outtype bf16 or --outtype f32 to the convert_hf_to_gguf.py command line.

@TheDarkTrumpet
Copy link
Author

Thanks for the reply and help. I'm in the middle of another test, seeing if it was something in my original training and this one (basically I had a much earlier iteration this was based off of when I encountered the issue). That old version ran fine, this one didn't. It's training. now, and if it throws a similar error I'll try that as well. I'll update today with the tests. Thanks agian.

@TheDarkTrumpet
Copy link
Author

I did some more testing. I think it has to do with a bad drive in the machine, and nothing wrong with the code. Two different sets of tests appear to make this the case. Closing issue. Thanks @slaren for helping. I'll keep in mind those options if i need them in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants
0