-
Notifications
You must be signed in to change notification settings - Fork 24.3k
Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed #150734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @felixliufei, do you have more details on the errors you're hitting / ideally a small repro script? This will help us help you - thanks! |
Can confirm this is happening when using nn.DataParallel with a 5090 and 4x4090s
|
Hey @jbschlosser I am also facing this issue on my local 2x 5090 setup, I have cuda 12.8 installed and have a fresh conda environment with
I am also able to replicate this using the scripts from https://github.com/The-AI-Summer/pytorch-ddp/tree/main on a cloud 2x 5090 setup from vast.ai using their docker image https://hub.docker.com/r/vastai/pytorch/ and running the pip install above for latest torch install.
I have noticed that for whatever reason, NCCL is running with cuda 12.2 which I think is incompatible with with the 5090s, I don't know if/why that would cause the illegal memory access though. I made sure that I have NCCL with cuda 12.8 on my setup, but I believe torch ships with its own version so it ignores my cuda setup. |
Same problem here. Any solution now? |
I'm facing a similar issue with 2x5090. The issue seems to be a function of parameter size. The script below reproduces the issue on my machine. import os
import torch
import
8000
torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
import argparse
class SimpleModel(nn.Module):
def __init__(self, hidden_size=128, ffn_hidden_size=512):
super().__init__()
self.proj1 = nn.Linear(32, hidden_size)
self.proj2 = nn.Linear(hidden_size, ffn_hidden_size)
self.proj3 = nn.Linear(ffn_hidden_size, 1)
def forward(self, x):
y = self.proj1(x)
y = self.proj2(y)
y = self.proj3(y)
return y
class DummyDataset(Dataset):
def __init__(self, size=1000):
self.data = torch.randn(size, 32)
self.targets = torch.randn(size, 1)
def __len__(self): return len(self.data)
def __getitem__(self, idx): return self.data[idx], self.targets[idx]
def setup(rank, world_size):
os.environ.update({'MASTER_ADDR': 'localhost', 'MASTER_PORT': '12355'})
dist.init_process_group("nccl", rank=rank, world_size=world_size)
def train(rank=0, world_size=1, use_ddp=False, hidden_size=128, ffn_hidden_size=512):
if use_ddp:
setup(rank, world_size)
device = f"cuda:{rank}"
torch.cuda.set_device(device)
else:
device = "cuda:0"
rank = 0
model = SimpleModel(hidden_size, ffn_hidden_size).to(device)
model = DDP(model, device_ids=[rank]) if use_ddp else model
print(model)
dataset = DummyDataset()
sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) if use_ddp else None
dataloader = DataLoader(dataset, batch_size=32, sampler=sampler, shuffle=not use_ddp)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
criterion = nn.MSELoss()
num_epochs = 4
for epoch in range(num_epochs):
if use_ddp: sampler.set_epoch(epoch)
for batch_idx, (data, target) in enumerate(dataloader):
data, target = data.to(device), target.to(device)
if use_ddp:
gathered_targets = [torch.zeros_like(target) for _ in range(world_size)]
dist.all_gather(gathered_targets, target)
optimizer.zero_grad()
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
loss = criterion(model(data), target)
loss.backward()
optimizer.step()
if batch_idx % 10 == 0:
print(f"Rank {rank}, Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
if use_ddp: dist.destroy_process_group()
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--use-ddp', action='store_true', help='Use DDP for multi-GPU training')
parser.add_argument('--hidden-size', type=int, default=128, help='Hidden size')
parser.add_argument('--ffn-hidden-size', type=int, default=512, help='FFN hidden size')
args = parser.parse_args()
if args.use_ddp:
world_size = torch.cuda.device_count()
if world_size < 2:
print("DDP mode requires at least 2 GPUs")
return
mp.spawn(train, args=(world_size, True, args.hidden_size, args.ffn_hidden_size), nprocs=world_size, join=True)
else:
if not torch.cuda.is_available():
print("No GPU available for training")
return
train(hidden_size=args.hidden_size, ffn_hidden_size=args.ffn_hidden_size)
if __name__ == "__main__":
main() Usage:
For reference, here is the successful output when using
Environment info:
|
Will leave this to the distributed experts; thanks for the repro script :) |
pip install --upgrade nvidia-nccl-cu12 The latest NCCL cleared the memory error for me - though I still can't go beyond 4 cards: CUDA_VISIBLE_DEVICES=0,1,2,3 python -c "import torch; torch._C._cuda_getDeviceCount()" works fine, but anything more (e.g. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7) gives me: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. |
@youngmae thank you! Just upgraded and confirmed working with 5 gpus. 4 x 4090 and 1 x 5090! |
No vm, but using a conda env. I pip installed pytorch within the environment though. Are you using a virtual environment as well or just installing straight to the OS. edit: just tried without a venv, still the same issue |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Describe the bug
Hi everyone,
I seem to have hit a roadblock and could use some help or clarification.
Environment:
Problem Description:
I am currently unable to successfully run training or fine-tuning jobs when using data parallelism on a system equipped with 4 NVIDIA 5090 GPUs and PyTorch 2.8. I have attempted to use standard DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), and also integrated DeepSpeed, but all attempts fail during the training/fine-tuning phase.
Interestingly, running inference tasks on the same multi-GPU setup works without issues. The problem appears specifically related to the training/fine-tuning process combined with data parallelism libraries.
Question:
Is there a known limitation or incompatibility with PyTorch 2.8 (or the associated libraries like DDP, FSDP, DeepSpeed) that prevents data parallel training/fine-tuning on a 4x NVIDIA 5090 configuration? Or could there be other configuration issues I might be overlooking?
Any insights, confirmation of compatibility, or suggestions for troubleshooting would be greatly appreciated. If specific error messages or a minimal reproducible code example would be helpful, please let me know, and I can try to provide them.
Thanks for your help
Versions
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360
The text was updated successfully, but these errors were encountered: