E408 [Bug] Extremely high CPU/VRAM usage and slow training with Qwen3.5 · Issue #4188 · unslothai/unsloth · GitHub
[go: up one dir, main page]

Skip to content

[Bug] Extremely high CPU/VRAM usage and slow training with Qwen3.5 #4188

@XueNianOfficial

Description

@XueNianOfficial

Description:

  1. Describe the bug / issue:
    I am experiencing unexpectedly high CPU usage, nearly maxed-out VRAM, and extremely slow training speeds when fine-tuning a Qwen3.5-9B based model using Unsloth.

During training:

VRAM Usage: Reaches ~31.3 GB out of 32 GB.
CPU Usage: Spikes to ~78% (which is unusually high).
Speed: Training is extremely slow, suspecting it might be hitting unified memory / swapping limits over PCIe.
2. Hardware & Environment:

OS: Windows 11
GPU: NVIDIA GeForce RTX 5090 (32GB VRAM)
System RAM: 96GB (53GB used during training)
Library versions: Latest unsloth, trl, torch, bitsandbytes
3. To Reproduce:
Here is the minimal training script I am using:

import os
os.environ['HF_HUB_OFFLINE'] = '1'

from unsloth import FastLanguageModel
import bitsandbytes as bnb
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

max_seq_length = 16384

dataset_path = r"E:\Unsloth\datasets.jsonl"
dataset = load_dataset("json", data_files=dataset_path, split="train")

base_model_path = r"E:\HuggingFace\hub\models--trohrbaugh--Qwen3.5-9B-heretic-v2\snapshots\c242553de7b8c09f95e8613429ab9faeb811eb6e"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = base_model_path,
    max_seq_length = max_seq_length,
    load_in_4bit = False,
    local_files_only = True,
    trust_remote_code = True,
)

tokenizer.pad_token_id = tokenizer.eos_token_id

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return { "text" : texts }

dataset = dataset.map(formatting_prompts_func, batched=True)

model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    max_seq_length = max_seq_length,
)

sft_config = SFTConfig(
    dataset_text_field="text",
    max_seq_length = max_seq_length,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    warmup_steps=5,
    num_train_epochs=1,
    learning_rate=2e-4,
    logging_steps=1,
    output_dir="outputs_qwen35",
    optim="adamw_8bit",
    seed=3407,
    dataset_num_proc=1,
)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = sft_config,
)

trainer.train()
  1. Expected behavior:
    With an RTX 5090 (32GB VRAM) and Unsloth optimizations, I expected the training to be much faster.
Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0