8000 Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed · Issue #150734 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Training/Fine-tuning fails with PyTorch 2.8 + 4x 5090 GPUs using DDP/FSDP/DeepSpeed #150734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
felixliufei opened this issue Apr 5, 2025 · 9 comments
Labels
module: ddp Issues/PRs related distributed data parallel training module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@felixliufei
Copy link
felixliufei commented Apr 5, 2025

🐛 Describe the bug

Hi everyone,
I seem to have hit a roadblock and could use some help or clarification.
Environment:

  • PyTorch Version: 2.8 (Is this correct? Please confirm the exact version)
  • GPUs: 4 x NVIDIA 5090
  • Parallelism Strategy Tried: DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), DeepSpeed
  • Task: Training / Fine-tuning (Inference works fine)
  • Other relevant environment details (Please add if possible):
    • Operating System: [Ubuntu 22.04]
    • CUDA Version: [12.8]
    • NVIDIA Driver Version: [570]
    • Python Version: [3.10]
      Problem Description:
      I am currently unable to successfully run training or fine-tuning jobs when using data parallelism on a system equipped with 4 NVIDIA 5090 GPUs and PyTorch 2.8. I have attempted to use standard DistributedDataParallel (DDP), FullyShardedDataParallel (FSDP), and also integrated DeepSpeed, but all attempts fail during the training/fine-tuning phase.
      Interestingly, running inference tasks on the same multi-GPU setup works without issues. The problem appears specifically related to the training/fine-tuning process combined with data parallelism libraries.
      Question:
      Is there a known limitation or incompatibility with PyTorch 2.8 (or the associated libraries like DDP, FSDP, DeepSpeed) that prevents data parallel training/fine-tuning on a 4x NVIDIA 5090 configuration? Or could there be other configuration issues I might be overlooking?
      Any insights, confirmation of compatibility, or suggestions for troubleshooting would be greatly appreciated. If specific error messages or a minimal reproducible code example would be helpful, please let me know, and I can try to provide them.
      Thanks for your help

Versions

wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py

For security purposes, please check the contents of collect_env.py before running it.

python collect_env.py

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @zhaojuanmao @mrshenli @rohan-varma @chauhang @mori360

@jbschlosser
Copy link
Contributor

Hey @felixliufei, do you have more details on the errors you're hitting / ideally a small repro script? This will help us help you - thanks!

@jbschlosser jbschlosser added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user oncall: distributed Add this issue/PR to distributed oncall triage queue labels Apr 7, 2025
@Enlux
Copy link
Enlux commented Apr 16, 2025

Can confirm this is happening when using nn.DataParallel with a 5090 and 4x4090s
No matter the combination of GPUs I use with 5090 I'm always getting a NCCL WARN Cuda failure 700 'an illegal memory access was encountered' error and gpu keeps spinning at 100% even though the python script has finished execution. Only thing that fixes it is restarting.
Logs:

NCCL INFO cudaDriverVersion 12080
dev:10869:10869 [0] NCCL INFO Bootstrap: Using enp7s18:192.168.1.103<0>
dev:10869:10869 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
dev:10869:11496 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
dev:10869:11496 [1] NCCL INFO NET/IB : No device found.
dev:10869:11496 [1] NCCL INFO NET/IB : Using [RO]; OOB enp7s18:192.168.1.103<0>
dev:10869:11496 [1] NCCL INFO NET/Socket : Using [0]enp7s18:192.168.1.103<0> [1]enp7s19:10.10.10.2<0> [2]veth83e1755:fe80::2c37:67ff:fe66:f9df%veth83e1755<0> [3]veth2a2a9e5:fe80::dc5a:f5ff:fe6f:2f98%veth2a2a9e5<0>
dev:10869:11496 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
dev:10869:11496 [1] NCCL INFO Using network Socket
dev:10869:11499 [4] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
dev:10869:11499 [4] NCCL INFO Using network Socket
dev:10869:11497 [2] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
dev:10869:11497 [2] NCCL INFO Using network Socket
dev:10869:11498 [3] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
dev:10869:11498 [3] NCCL INFO Using network Socket
dev:10869:11495 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
dev:10869:11495 [0] NCCL INFO Using network Socket
dev:10869:11496 [1] NCCL INFO ncclCommInitAll comm 0x1401152e0 rank 1 nranks 5 cudaDev 1 nvmlDev 2 busId 3000 commId 0x6a86e01cf77b68ee - Init START
dev:10869:11497 [2] NCCL INFO ncclCommInitAll comm 0x140198620 rank 2 nranks 5 cudaDev 2 nvmlDev 4 busId 5000 commId 0x6a86e01cf77b68ee - Init START
dev:10869:11498 [3] NCCL INFO ncclCommInitAll comm 0x14021bae0 rank 3 nranks 5 cudaDev 3 nvmlDev 0 busId 1000 commId 0x6a86e01cf77b68ee - Init START
dev:10869:11495 [0] NCCL INFO ncclCommInitAll comm 0x140094270 rank 0 nranks 5 cudaDev 0 nvmlDev 1 busId 2000 commId 0x6a86e01cf77b68ee - Init START
dev:10869:11499 [4] NCCL INFO ncclCommInitAll comm 0x14029eef0 rank 4 nranks 5 cudaDev 4 nvmlDev 3 busId 4000 commId 0x6a86e01cf77b68ee - Init START
dev:10869:11497 [2] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
dev:10869:11498 [3] NCCL INFO Bootstrap timings total 0.001235 (create 0.000041, send 0.000155, recv 0.000644, ring 0.000203, delay 0.000000)
dev:10869:11499 [4] NCCL INFO Bootstrap timings total 0.000900 (create 0.000046, send 0.000160, recv 0.000403, ring 0.000114, delay 0.000000)
dev:10869:11497 [2] NCCL INFO Bootstrap timings total 0.001281 (create 0.000042, send 0.000156, recv 0.000394, ring 0.000253, delay 0.000000)
dev:10869:11496 [1] NCCL INFO Bootstrap timings total 0.001343 (create 0.000041, send 0.000174, recv 0.000257, ring 0.000238, delay 0.000001)
dev:10869:11495 [0] NCCL INFO Bootstrap timings total 0.001287 (create 0.000039, send 0.000157, recv 0.000462, ring 0.000120, delay 0.000000)
dev:10869:11499 [4] NCCL INFO NVLS multicast support is not available on dev 4
dev:10869:11495 [0] NCCL INFO NVLS multicast support is not available on dev 0
dev:10869:11496 [1] NCCL INFO NVLS multicast support is not available on dev 1
dev:10869:11498 [3] NCCL INFO NVLS multicast support is not available on dev 3
dev:10869:11497 [2] NCCL INFO NVLS multicast support is not available on dev 2
dev:10869:11497 [2] NCCL INFO comm 0x140198620 rank 2 nRanks 5 nNodes 1 localRanks 5 localRank 2 MNNVL 0
dev:10869:11498 [3] NCCL INFO comm 0x14021bae0 rank 3 nRanks 5 nNodes 1 localRanks 5 localRank 3 MNNVL 0
dev:10869:11499 [4] NCCL INFO comm 0x14029eef0 rank 4 nRanks 5 nNodes 1 localRanks 5 localRank 4 MNNVL 0
dev:10869:11495 [0] NCCL INFO comm 0x140094270 rank 0 nRanks 5 nNodes 1 localRanks 5 localRank 0 MNNVL 0
dev:10869:11497 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1
dev:10869:11496 [1] NCCL INFO comm 0x1401152e0 rank 1 nRanks 5 nNodes 1 localRanks 5 localRank 1 MNNVL 0
dev:10869:11497 [2] NCCL INFO P2P Chunksize set to 131072
dev:10869:11498 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2
dev:10869:11498 [3] NCCL INFO P2P Chunksize set to 131072
dev:10869:11495 [0] NCCL INFO Channel 00/04 : 0 1 2 3 4
dev:10869:11495 [0] NCCL INFO Channel 01/04 : 0 1 2 3 4
dev:10869:11495 [0] NCCL INFO Channel 02/04 : 0 1 2 3 4
dev:10869:11495 [0] NCCL INFO Channel 03/04 : 0 1 2 3 4
dev:10869:11495 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1
dev:10869:11495 [0] NCCL INFO P2P Chunksize set to 131072
dev:10869:11499 [4] NCCL INFO Trees [0] -1/-1/-1->4->3 [1] -1/-1/-1->4->3 [2] -1/-1/-1->4->3 [3] -1/-1/-1->4->3
dev:10869:11499 [4] NCCL INFO P2P Chunksize set to 131072
dev:10869:11496 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0
dev:10869:11496 [1] NCCL INFO P2P Chunksize set to 131072
dev:10869:11495 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 1
dev:10869:11539 [3] NCCL INFO [Proxy Service] Device 3 CPU core 117
dev:10869:11544 [3] NCCL INFO [Proxy Service UDS] Device 3 CPU core 119
dev:10869:11541 [0] NCCL INFO [Proxy Service] Device 0 CPU core 110
dev:10869:11542 [2] NCCL INFO [Proxy Service UDS] Device 2 CPU core 118
dev:10869:11543 [4] NCCL INFO [Proxy Service] Device 4 CPU core 115
dev:10869:11540 [1] NCCL INFO [Proxy Service] Device 1 CPU core 108
dev:10869:11545 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 120
dev:10869:11538 [2] NCCL INFO [Proxy Service] Device 2 CPU core 116
dev:10869:11546 [4] NCCL INFO [Proxy Service UDS] Device 4 CPU core 109
dev:10869:11547 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 121
dev:10869:11497 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
dev:10869:11497 [2] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dev:10869:11498 [3] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
dev:10869:11498 [3] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dev:10869:11499 [4] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
dev:10869:11499 [4] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dev:10869:11496 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
dev:10869:11496 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dev:10869:11495 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 512 | 512
dev:10869:11495 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
dev:10869:11495 [0] NCCL INFO CC Off, workFifoBytes 1048576
dev:10869:11498 [3] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
dev:10869:11498 [3] NCCL INFO ncclCommInitAll comm 0x14021bae0 rank 3 nranks 5 cudaDev 3 nvmlDev 0 busId 1000 commId 0x6a86e01cf77b68ee - Init COMPLETE
dev:10869:11498 [3] NCCL INFO Init timings - ncclCommInitAll: rank 3 nranks 5 total 34.08 (kernels 33.97, alloc 0.07, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.01, rest 0.00)
dev:10869:11496 [1] NCCL INFO ncclCommInitAll comm 0x1401152e0 rank 1 nranks 5 cudaDev 1 nvmlDev 2 busId 3000 commId 0x6a86e01cf77b68ee - Init COMPLETE
dev:10869:11496 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 5 total 34.08 (kernels 33.97, alloc 0.07, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.01, rest 0.00)
dev:10869:11497 [2] NCCL INFO ncclCommInitAll comm 0x140198620 rank 2 nranks 5 cudaDev 2 nvmlDev 4 busId 5000 commId 0x6a86e01cf77b68ee - Init COMPLETE
dev:10869:11497 [2] NCCL INFO Init timings - ncclCommInitAll: rank 2 nranks 5 total 34.08 (kernels 33.97, alloc 0.07, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.01, rest 0.00)
dev:10869:11499 [4] NCCL INFO ncclCommInitAll comm 0x14029eef0 rank 4 nranks 5 cudaDev 4 nvmlDev 3 busId 4000 commId 0x6a86e01cf77b68ee - Init COMPLETE
dev:10869:11499 [4] NCCL INFO Init timings - ncclCommInitAll: rank 4 nranks 5 total 34.08 (kernels 33.97, alloc 0.07, bootstrap 0.00, allgathers 0.01, topo 0.01, graphs 0.00, connections 0.01, rest 0.00)
dev:10869:11495 [0] NCCL INFO ncclCommInitAll comm 0x140094270 rank 0 nranks 5 cudaDev 0 nvmlDev 1 busId 2000 commId 0x6a86e01cf77b68ee - Init COMPLETE
dev:10869:11495 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 5 total 34.08 (kernels 33.97, alloc 0.07, bootstrap 0.00, allgathers 0.00, topo 0.03, graphs 0.00, connections 0.01, rest 0.00)
dev:10869:11548 [4] NCCL INFO Channel 00 : 4[3] -> 0[1] via SHM/direct/direct
dev:10869:11549 [3] NCCL INFO Channel 00 : 3[0] -> 4[3] via SHM/direct/direct
dev:10869:11550 [2] NCCL INFO Channel 00 : 2[4] -> 3[0] via SHM/direct/direct
dev:10869:11551 [1] NCCL INFO Channel 00 : 1[2] -> 2[4] via SHM/direct/direct
dev:10869:11548 [4] NCCL INFO Channel 01 : 4[3] -> 0[1] via SHM/direct/direct
dev:10869:11552 [0] NCCL INFO Channel 00 : 0[1] -> 1[2] via SHM/direct/direct
dev:10869:11549 [3] NCCL INFO Channel 01 : 3[0] -> 4[3] via SHM/direct/direct
dev:10869:11550 [2] NCCL INFO Channel 01 : 2[4] -> 3[0] via SHM/direct/direct
dev:10869:11551 [1] NCCL INFO Channel 01 : 1[2] -> 2[4] via SHM/direct/direct
dev:10869:11548 [4] NCCL INFO Channel 02 : 4[3] -> 0[1] via SHM/direct/direct
dev:10869:11552 [0] NCCL INFO Channel 01 : 0[1] -> 1[2] via SHM/direct/direct
dev:10869:11549 [3] NCCL INFO Channel 02 : 3[0] -> 4[3] via SHM/direct/direct
dev:10869:11550 [2] NCCL INFO Channel 02 : 2[4] -> 3[0] via SHM/direct/direct
dev:10869:11551 [1] NCCL INFO Channel 02 : 1[2] -> 2[4] via SHM/direct/direct
dev:10869:11548 [4] NCCL INFO Channel 03 : 4[3] -> 0[1] via SHM/direct/direct
dev:10869:11552 [0] NCCL INFO Channel 02 : 0[1] -> 1[2] via SHM/direct/direct
dev:10869:11549 [3] NCCL INFO Channel 03 : 3[0] -> 4[3] via SHM/direct/direct
dev:10869:11550 [2] NCCL INFO Channel 03 : 2[4] -> 3[0] via SHM/direct/direct
dev:10869:11551 [1] NCCL INFO Channel 03 : 1[2] -> 2[4] via SHM/direct/direct
dev:10869:11552 [0] NCCL INFO Channel 03 : 0[1] -> 1[2] via SHM/direct/direct
dev:10869:11548 [4] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
dev:10869:11549 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
dev:10869:11552 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
dev:10869:11551 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
dev:10869:11550 [2] NCCL INFO Connected all rings,
8000
 use ring PXN 0 GDR 1

[2025-04-16 15:00:23] dev:10869:10869 [2] enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
dev:10869:10869 [2] NCCL INFO group.cc:241 -> 1
dev:10869:10869 [2] NCCL INFO group.cc:478 -> 1
dev:10869:10869 [2] NCCL INFO group.cc:581 -> 1
dev:10869:10869 [2] NCCL INFO group.cc:106 -> 1

@younader
Copy link

Hey @jbschlosser I am also facing this issue on my local 2x 5090 setup, I have cuda 12.8 installed and have a fresh conda environment with

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 -U

I am also able to replicate this using the scripts from https://github.com/The-AI-Summer/pytorch-ddp/tree/main on a cloud 2x 5090 setup from vast.ai using their docker image https://hub.docker.com/r/vastai/pytorch/ and running the pip install above for latest torch install.
Here are the logs from running:
NCCL_DEBUG=INFO python train_dp.py

/venv/main/lib/python3.10/site-packages/torchvision/models/_utils.py:135: UserWarning: Using 'weights' as positional parameter(s) is deprecated since 0.13 and may be removed in the future. Please use keyword parameter(s) instead.
  warnings.warn(
/venv/main/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Let's use 2 GPUs!
Start training...
f3ea4d093b5b:3043:3043 [0] NCCL INFO cudaDriverVersion 12080
f3ea4d093b5b:3043:3043 [0] NCCL INFO Bootstrap: Using eth0:172.17.0.2<0>
f3ea4d093b5b:3043:3043 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
f3ea4d093b5b:3043:3196 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
f3ea4d093b5b:3043:3196 [0] NCCL INFO Failed to open libibverbs.so[.1]
f3ea4d093b5b:3043:3196 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
f3ea4d093b5b:3043:3196 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
f3ea4d093b5b:3043:3196 [0] NCCL INFO Using network Socket
f3ea4d093b5b:3043:3197 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
f3ea4d093b5b:3043:3197 [1] NCCL INFO Using network Socket
f3ea4d093b5b:3043:3197 [1] NCCL INFO ncclCommInitAll comm 0x55735b49f440 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 21000 commId 0xa96a92d5a20ea16e - Init START
f3ea4d093b5b:3043:3196 [0] NCCL INFO ncclCommInitAll comm 0x557357df6190 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0xa96a92d5a20ea16e - Init START
f3ea4d093b5b:3043:3197 [1] NCCL INFO RAS client listening socket at ::1<28028>
f3ea4d093b5b:3043:3197 [1] NCCL INFO Bootstrap timings total 0.001284 (create 0.000052, send 0.000143, recv 0.000344, ring 0.000035, delay 0.000000)
f3ea4d093b5b:3043:3196 [0] NCCL INFO Bootstrap timings total 0.001234 (create 0.000037, send 0.000131, recv 0.000464, ring 0.000031, delay 0.000000)
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3197 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,ffffffff,ffffffff,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff,ffffffff
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,ffffffff,ffffffff,00000000,00000000,00000000,00000000,ffffffff,ffffffff,ffffffff,ffffffff
f3ea4d093b5b:3043:3197 [1] NCCL INFO comm 0x55735b49f440 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
f3ea4d093b5b:3043:3196 [0] NCCL INFO comm 0x557357df6190 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
f3ea4d093b5b:3043:3196 [0] NCCL INFO Channel 00/04 : 0 1
f3ea4d093b5b:3043:3196 [0] NCCL INFO Channel 01/04 : 0 1
f3ea4d093b5b:3043:3197 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P Chunksize set to 131072
f3ea4d093b5b:3043:3196 [0] NCCL INFO Channel 02/04 : 0 1
f3ea4d093b5b:3043:3196 [0] NCCL INFO Channel 03/04 : 0 1
f3ea4d093b5b:3043:3197 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P Chunksize set to 131072
f3ea4d093b5b:3043:3196 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3196 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 1
f3ea4d093b5b:3043:3526 [0] NCCL INFO [Proxy Service] Device 0 CPU core 14
f3ea4d093b5b:3043:3525 [1] NCCL INFO [Proxy Service] Device 1 CPU core 15
f3ea4d093b5b:3043:3527 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 272
f3ea4d093b5b:3043:3528 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 23
f3ea4d093b5b:3043:3196 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
f3ea4d093b5b:3043:3196 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
f3ea4d093b5b:3043:3197 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
f3ea4d093b5b:3043:3197 [1] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
f3ea4d093b5b:3043:3196 [0] NCCL INFO CC Off, workFifoBytes 1048576
f3ea4d093b5b:3043:3197 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
f3ea4d093b5b:3043:3197 [1] NCCL INFO ncclCommInitAll comm 0x55735b49f440 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 21000 commId 0xa96a92d5a20ea16e - Init COMPLETE
f3ea4d093b5b:3043:3196 [0] NCCL INFO ncclCommInitAll comm 0x557357df6190 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0xa96a92d5a20ea16e - Init COMPLETE
f3ea4d093b5b:3043:3197 [1] NCCL INFO Init timings - ncclCommInitAll: rank 1 nranks 2 total 427.26 (kernels 427.24, alloc 0.00, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
f3ea4d093b5b:3043:3196 [0] NCCL INFO Init timings - ncclCommInitAll: rank 0 nranks 2 total 427.26 (kernels 427.24, alloc 0.01, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs
8000
 0.00, connections 0.00, rest 0.00)
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO Channel 02 : 1[1] -> 0[0] via SHM/direct/direct
f3ea4d093b5b:3043:3529 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3529 [1] NCCL INFO Channel 03 : 1[1] -> 0[0] via SHM/direct/direct
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO Channel 02 : 0[0] -> 1[1] via SHM/direct/direct
f3ea4d093b5b:3043:3530 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
f3ea4d093b5b:3043:3530 [0] NCCL INFO Channel 03 : 0[0] -> 1[1] via SHM/direct/direct
f3ea4d093b5b:3043:3529 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
f3ea4d093b5b:3043:3530 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1

[2025-04-20 00:53:04] f3ea4d093b5b:3043:3043 [1] enqueue.cc:1592 NCCL WARN Cuda failure 'an illegal memory access was encountered'
f3ea4d093b5b:3043:3043 [1] NCCL INFO group.cc:250 -> 1
f3ea4d093b5b:3043:3043 [1] NCCL INFO group.cc:478 -> 1
f3ea4d093b5b:3043:3043 [1] NCCL INFO group.cc:581 -> 1
f3ea4d093b5b:3043:3043 [1] NCCL INFO group.cc:106 -> 1
Traceback (most recent call last):
  File "/root/train_dp.py", line 112, in <module>
    train(net, trainloader)
  File "/root/train_dp.py", line 60, in train
    outputs = net(images)
  File "/venv/main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/venv/main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/venv/main/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 193, in forward
    replicas = self.replicate(self.module, self.device_ids[: len(inputs)])
  File "/venv/main/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 200, in replicate
    return replicate(module, device_ids, not torch.is_grad_enabled())
  File "/venv/main/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 126, in replicate
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)
  File "/venv/main/lib/python3.10/site-packages/torch/nn/parallel/replicate.py", line 95, in _broadcast_coalesced_reshape
    tensor_copies = Broadcast.apply(devices, *tensors)
  File "/venv/main/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/venv/main/lib/python3.10/site-packages/torch/nn/parallel/_functions.py", line 23, in forward
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
  File "/venv/main/lib/python3.10/site-packages/torch/nn/parallel/comm.py", line 66, in broadcast_coalesced
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 1: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x88 (0x7ff96f9806a8 in /venv/main/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x55 (0x7ff96f91c223 in /venv/main/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3e2 (0x7ff96fda9402 in /venv/main/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1bf9f (0x7ff96fd73f9f in /venv/main/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x1d046 (0x7ff96fd75046 in /venv/main/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x1d3da (0x7ff96fd753da in /venv/main/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x45f782 (0x7ff962572782 in /venv/main/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7ff96f95d549 in /venv/main/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x723998 (0x7ff962836998 in /venv/main/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x723db1 (0x7ff962836db1 in /venv/main/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: <unknown function> + 0x29d90 (0x7ff970621d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: __libc_start_main + 0x80 (0x7ff970621e40 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted 

I have noticed that for whatever reason, NCCL is running with cuda 12.2 which I think is incompatible with with the 5090s, I don't know if/why that would cause the illegal memory access though. I made sure that I have NCCL with cuda 12.8 on my setup, but I believe torch ships with its own version so it ignores my cuda setup.

@fduwjj fduwjj added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 23, 2025
@Yingshu-Li
Copy link

Same problem here. Any solution now?

@TidalPaladin
Copy link

I'm facing a similar issue with 2x5090. The issue seems to be a function of parameter size. The script below reproduces the issue on my machine.

import os
import torch
import
8000
 torch.nn as nn
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.distributed import DistributedSampler
import argparse


class SimpleModel(nn.Module):
    def __init__(self, hidden_size=128, ffn_hidden_size=512):
        super().__init__()
        self.proj1 = nn.Linear(32, hidden_size)
        self.proj2 = nn.Linear(hidden_size, ffn_hidden_size)
        self.proj3 = nn.Linear(ffn_hidden_size, 1)

    def forward(self, x):
        y = self.proj1(x)
        y = self.proj2(y)
        y = self.proj3(y)
        return y

class DummyDataset(Dataset):
    def __init__(self, size=1000):
        self.data = torch.randn(size, 32)
        self.targets = torch.randn(size, 1)

    def __len__(self): return len(self.data)
    def __getitem__(self, idx): return self.data[idx], self.targets[idx]

def setup(rank, world_size):
    os.environ.update({'MASTER_ADDR': 'localhost', 'MASTER_PORT': '12355'})
    dist.init_process_group("nccl", rank=rank, world_size=world_size)

def train(rank=0, world_size=1, use_ddp=False, hidden_size=128, ffn_hidden_size=512):
    if use_ddp:
        setup(rank, world_size)
        device = f"cuda:{rank}"
        torch.cuda.set_device(device)
    else:
        device = "cuda:0"
        rank = 0
    
    model = SimpleModel(hidden_size, ffn_hidden_size).to(device)
    model = DDP(model, device_ids=[rank]) if use_ddp else model
    print(model)
    
    dataset = DummyDataset()
    sampler = DistributedSampler(dataset, num_replicas=world_size, rank=rank) if use_ddp else None
    dataloader = DataLoader(dataset, batch_size=32, sampler=sampler, shuffle=not use_ddp)
    
    optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
    criterion = nn.MSELoss()
    num_epochs = 4

    for epoch in range(num_epochs):
        if use_ddp: sampler.set_epoch(epoch)
        for batch_idx, (data, target) in enumerate(dataloader):
            data, target = data.to(device), target.to(device)

            if use_ddp:
                gathered_targets = [torch.zeros_like(target) for _ in range(world_size)]
                dist.all_gather(gathered_targets, target)

            optimizer.zero_grad()
            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
                loss = criterion(model(data), target)
                loss.backward()
            optimizer.step()

            if batch_idx % 10 == 0:
                print(f"Rank {rank}, Epoch {epoch}, Batch {batch_idx}, Loss: {loss.item():.4f}")
    
    if use_ddp: dist.destroy_process_group()

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--use-ddp', action='store_true', help='Use DDP for multi-GPU training')
    parser.add_argument('--hidden-size', type=int, default=128, help='Hidden size')
    parser.add_argument('--ffn-hidden-size', type=int, default=512, help='FFN hidden size')
    args = parser.parse_args()

    if args.use_ddp:
        world_size = torch.cuda.device_count()
        if world_size < 2:
            print("DDP mode requires at least 2 GPUs")
            return
        mp.spawn(train, args=(world_size, True, args.hidden_size, args.ffn_hidden_size), nprocs=world_size, join=True)
    else:
        if not torch.cuda.is_available():
            print("No GPU available for training")
            return
        train(hidden_size=args.hidden_size, ffn_hidden_size=args.ffn_hidden_size)

if __name__ == "__main__":
    main() 

Usage:

  • python script.py - No error (single GPU)
  • python script.py --use-ddp --hidden-size=64 --ffd-hidden-size=64 - No error
  • python script.py --use-ddp --hidden-size=128 --ffn-hidden-size=512 - Error below
obsidian:130930:130930 [0] NCCL INFO Bootstrap: Using eno1np0:192.168.0.152<0>
obsidian:130930:130930 [0] NCCL INFO cudaDriverVersion 12080
obsidian:130930:130930 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
obsidian:130930:130930 [0] NCCL INFO Comm config Blocking set to 1
obsidian:130931:130931 [1] NCCL INFO cudaDriverVersion 12080
obsidian:130931:130931 [1] NCCL INFO Bootstrap: Using eno1np0:192.168.0.152<0>
obsidian:130931:130931 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
obsidian:130931:130931 [1] NCCL INFO Comm config Blocking set to 1
obsidian:130931:131093 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
obsidian:130930:131092 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
obsidian:130930:131092 [0] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE [RO]; OOB eno1np0:192.168.0.152<0>
obsidian:130930:131092 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
obsidian:130930:131092 [0] NCCL INFO Using network IB
obsidian:130930:131092 [0] NCCL INFO ncclCommInitRankConfig comm 0xf2e5500 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0x8b0af5d13cf149b6 - Init START
obsidian:130931:131093 [1] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE [RO]; OOB eno1np0:192.168.0.152<0>
obsidian:130931:131093 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
obsidian:130931:131093 [1] NCCL INFO Using network IB
obsidian:130931:131093 [1] NCCL INFO ncclCommInitRankConfig comm 0x42b173c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0x8b0af5d13cf149b6 - Init START
obsidian:130930:131092 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
obsidian:130931:131093 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
obsidian:130931:131093 [1] NCCL INFO Bootstrap timings total 0.000858 (create 0.000036, send 0.000168, recv 0.000284, ring 0.000048, delay 0.000000)
obsidian:130930:131092 [0] NCCL INFO Bootstrap timings total 0.009204 (create 0.000036, send 0.000172, recv 0.008539, ring 0.000070, delay 0.000000)
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO comm 0xf2e5500 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
obsidian:130931:131093 [1] NCCL INFO comm 0x42b173c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
obsidian:130930:131092 [0] NCCL INFO Channel 00/02 : 0 1
obsidian:130930:131092 [0] NCCL INFO Channel 01/02 : 0 1
obsidian:130931:131093 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
obsidian:130930:131092 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
obsidian:130931:131093 [1] NCCL INFO P2P Chunksize set to 131072
obsidian:130930:131092 [0] NCCL INFO P2P Chunksize set to 131072
obsidian:130931:131093 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131092 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
obsidian:130931:131252 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 10
obsidian:130930:131253 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 127
obsidian:130931:131250 [1] NCCL INFO [Proxy Service] Device 1 CPU core 42
obsidian:130930:131251 [0] NCCL INFO [Proxy Service] Device 0 CPU core 27
obsidian:130931:131093 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
obsidian:130931:131093 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
obsidian:130930:131092 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
obsidian:130930:131092 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
obsidian:130930:131092 [0] NCCL INFO CC Off, workFifoBytes 1048576
obsidian:130931:131093 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
obsidian:130931:131093 [1] NCCL INFO ncclCommInitRankConfig comm 0x42b173c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0x8b0af5d13cf149b6 - Init COMPLETE
obsidian:130931:131093 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 33.34 (kernels 33.32, alloc 0.02, bootstrap 0.00, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
obsidian:130930:131092 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
obsidian:130930:131092 [0] NCCL INFO ncclCommInitRankConfig comm 0xf2e5500 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0x8b0af5d13cf149b6 - Init COMPLETE
obsidian:130930:131092 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 33.35 (kernels 33.32, alloc 0.00, bootstrap 0.01, allgathers 0.00, topo 0.01, graphs 0.00, connections 0.00, rest 0.00)
obsidian:130930:131254 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131255 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131254 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131255 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131254 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131255 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131254 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
obsidian:130930:131254 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130931:131255 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
obsidian:130931:131255 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:130930:131254 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
obsidian:130931:131255 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
obsidian:130931:131255 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
obsidian:130930:131254 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1

[2025-04-30 01:23:35] obsidian:130930:130930 [0] enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
obsidian:130930:130930 [0] NCCL INFO group.cc:241 -> 1
obsidian:130930:130930 [0] NCCL INFO group.cc:478 -> 1
obsidian:130930:130930 [0] NCCL INFO group.cc:581 -> 1
obsidian:130930:130930 [0] NCCL INFO enqueue.cc:2299 -> 1
terminate called after throwing an instance of 'c10::Error'

[2025-04-30 01:23:35] obsidian:130931:130931 [1] enqueue.cc:1556 NCCL WARN Cuda failure 700 'an illegal memory access was encountered'
obsidian:130931:130931 [1] NCCL INFO group.cc:241 -> 1
obsidian:130931:130931 [1] NCCL INFO group.cc:478 -> 1
obsidian:130931:130931 [1] NCCL INFO group.cc:581 -> 1
obsidian:130931:130931 [1] NCCL INFO enqueue.cc:2299 -> 1
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7d4e617785e8 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x7d4e6170d4a2 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x7d4e6dfa5422 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1e79f (0x7d4e6df6d79f in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20060 (0x7d4e6df6f060 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2028c (0x7d4e6df6f28c in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x44d142 (0x7d4e6084d142 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7d4e61752f39 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x1629e70 (0x7d4e4d629e70 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x13661f2 (0x7d4e4d3661f2 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xc337a0 (0x7d4e610337a0 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x37f17d (0x7d4e6077f17d in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: <unknown function> + 0x2a1ca (0x7d4e6ec2a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: __libc_start_main + 0x8b (0x7d4e6ec2a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: _start + 0x29 (0x6000a9 in /home/chase/Documents/mit-ub-mammo/.venv/bin/python)

  what():  CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:43 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x704d31d785e8 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xe0 (0x704d31d0d4a2 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3c2 (0x704d3e5a5422 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1e79f (0x704d3e56d79f in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x20060 (0x704d3e56f060 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #5: <unknown function> + 0x2028c (0x704d3e56f28c in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10_cuda.so)
frame #6: <unknown function> + 0x44d142 (0x704d30e4d142 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x704d31d52f39 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x1629e70 (0x704d1dc29e70 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x13661f2 (0x704d1d9661f2 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_cpu.so)
frame #10: <unknown function> + 0xc337a0 (0x704d316337a0 in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
frame #11: <unknown function> + 0x37f17d (0x704d30d7f17d in /home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: <unknown function> + 0x2a1ca (0x704d3f22a1ca in /lib/x86_64-linux-gnu/libc.so.6)
frame #24: __libc_start_main + 0x8b (0x704d3f22a28b in /lib/x86_64-linux-gnu/libc.so.6)
frame #25: _start + 0x29 (0x6000a9 in /home/chase/Documents/mit-ub-mammo/.venv/bin/python)

W0430 01:23:36.837000 130859 .venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py:169] Terminating process 130931 via signal SIGTERM
Traceback (most recent call last):
  File "/home/chase/Documents/mit-ub-mammo/ddp_repro.py", line 98, in <module>
    main()
    ^^^^^^
  File "/home/chase/Documents/mit-ub-mammo/ddp_repro.py", line 90, in main
    mp.spawn(train, args=(world_size, True, args.hidden_size, args.ffn_hidden_size), nprocs=world_size, join=True)
  File "/home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
              ^^^^^^^^^^^^^^
  File "/home/chase/Documents/mit-ub-mammo/.venv/lib/python3.12/site-packages/torch/multiprocessing/spawn.py", line 196, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

For reference, here is the successful output when using hidden_size=ffn_hidden_size=64:

obsidian:132212:132212 [0] NCCL INFO Bootstrap: Using eno1np0:192.168.0.152<0>
obsidian:132212:132212 [0] NCCL INFO cudaDriverVersion 12080
obsidian:132212:132212 [0] NCCL INFO NCCL version 2.26.2+cuda12.2
obsidian:132214:132214 [1] NCCL INFO cudaDriverVersion 12080
obsidian:132212:132212 [0] NCCL INFO Comm config Blocking set to 1
obsidian:132214:132214 [1] NCCL INFO Bootstrap: Using eno1np0:192.168.0.152<0>
obsidian:132214:132214 [1] NCCL INFO NCCL version 2.26.2+cuda12.2
obsidian:132214:132214 [1] NCCL INFO Comm config Blocking set to 1
obsidian:132214:132385 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
obsidian:132212:132384 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal net plugin.
obsidian:132212:132384 [0] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE [RO]; OOB eno1np0:192.168.0.152<0>
obsidian:132212:132384 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
obsidian:132212:132384 [0] NCCL INFO Using network IB
obsidian:132212:132384 [0] NCCL INFO ncclCommInitRankConfig comm 0x10828290 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0xc2f3aa3f7847f0e7 - Init START
obsidian:132214:132385 [1] NCCL INFO NET/IB : Using [0]bnxt_re0:1/RoCE [RO]; OOB eno1np0:192.168.0.152<0>
obsidian:132214:132385 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
obsidian:132214:132385 [1] NCCL INFO Using network IB
obsidian:132214:132385 [1] NCCL INFO ncclCommInitRankConfig comm 0x129767c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0xc2f3aa3f7847f0e7 - Init START
obsidian:132212:132384 [0] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
obsidian:132214:132385 [1] NCCL INFO RAS client listening socket at 127.0.0.1<28028>
obsidian:132212:132384 [0] NCCL INFO Bootstrap timings total 0.010850 (create 0.000032, send 0.000139, recv 0.010224, ring 0.000053, delay 0.000000)
obsidian:132214:132385 [1] NCCL INFO Bootstrap timings total 0.000855 (create 0.000035, send 0.000154, recv 0.000306, ring 0.000042, delay 0.000000)
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:1
8000
32385 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO comm 0x10828290 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
obsidian:132214:132385 [1] NCCL INFO comm 0x129767c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
obsidian:132212:132384 [0] NCCL INFO Channel 00/02 : 0 1
obsidian:132212:132384 [0] NCCL INFO Channel 01/02 : 0 1
obsidian:132214:132385 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
obsidian:132212:132384 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
obsidian:132214:132385 [1] NCCL INFO P2P Chunksize set to 131072
obsidian:132212:132384 [0] NCCL INFO P2P Chunksize set to 131072
obsidian:132214:132385 [1] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132384 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 0 directMode 0
obsidian:132214:132520 [1] NCCL INFO [Proxy Service] Device 1 CPU core 22
obsidian:132214:132522 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 7
obsidian:132212:132523 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 45
obsidian:132212:132521 [0] NCCL INFO [Proxy Service] Device 0 CPU core 56
obsidian:132214:132385 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
obsidian:132214:132385 [1] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
obsidian:132212:132384 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
obsidian:132212:132384 [0] NCCL INFO 2 coll channels, 2 collnet channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
obsidian:132212:132384 [0] NCCL INFO CC Off, workFifoBytes 1048576
obsidian:132214:132385 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
obsidian:132214:132385 [1] NCCL INFO ncclCommInitRankConfig comm 0x129767c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId c1000 commId 0xc2f3aa3f7847f0e7 - Init COMPLETE
obsidian:132214:132385 [1] NCCL INFO Init timings - ncclCommInitRankConfig: rank 1 nranks 2 total 33.33 (kernels 33.30, alloc 0.02, bootstrap 0.00, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
obsidian:132212:132384 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
obsidian:132212:132384 [0] NCCL INFO ncclCommInitRankConfig comm 0x10828290 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 81000 commId 0xc2f3aa3f7847f0e7 - Init COMPLETE
obsidian:132212:132384 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 2 total 33.33 (kernels 33.31, alloc 0.00, bootstrap 0.01, allgathers 0.00, topo 0.00, graphs 0.00, connections 0.00, rest 0.00)
obsidian:132214:132524 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132525 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132524 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132525 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132525 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132524 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132525 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
obsidian:132212:132525 [0] NCCL INFO P2P is disabled between connected GPUs 0 and 1. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132214:132524 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
obsidian:132214:132524 [1] NCCL INFO P2P is disabled between connected GPUs 1 and 0. You can repress this message with NCCL_IGNORE_DISABLED_P2P=1.
obsidian:132212:132525 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
obsidian:132214:132524 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
obsidian:132212:132525 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
obsidian:132214:132524 [1] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
DistributedDataParallel(
  (module): SimpleModel(
    (proj1): Linear(in_features=32, out_features=64, bias=True)
    (proj2): Linear(in_features=64, out_features=64, bias=True)
    (proj3): Linear(in_features=64, out_features=1, bias=True)
  )
)
DistributedDataParallel(
  (module): SimpleModel(
    (proj1): Linear(in_features=32, out_features=64, bias=True)
    (proj2): Linear(in_features=64, out_features=64, bias=True)
    (proj3): Linear(in_features=64, out_features=1, bias=True)
  )
)
Rank 0, Epoch 0, Batch 0, Loss: 0.9754
Rank 1, Epoch 0, Batch 0, Loss: 1.0029
Rank 1, Epoch 0, Batch 10, Loss: 1.1040
Rank 0, Epoch 0, Batch 10, Loss: 0.9561
Rank 1, Epoch 1, Batch 0, Loss: 1.4073
Rank 0, Epoch 1, Batch 0, Loss: 1.1108
Rank 1, Epoch 1, Batch 10, Loss: 1.0181
Rank 0, Epoch 1, Batch 10, Loss: 1.2757
Rank 1, Epoch 2, Batch 0, Loss: 1.0212
Rank 0, Epoch 2, Batch 0, Loss: 0.8153
Rank 1, Epoch 2, Batch 10, Loss: 0.9735
Rank 0, Epoch 2, Batch 10, Loss: 0.9293
Rank 1, Epoch 3, Batch 0, Loss: 1.0410
Rank 0, Epoch 3, Batch 0, Loss: 0.6550
Rank 1, Epoch 3, Batch 10, Loss: 1.2370
Rank 0, Epoch 3, Batch 10, Loss: 0.8594
[W430 01:26:19.732150588 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
obsidian:132212:132541 [0] NCCL INFO misc/socket.cc:64 -> 3
obsidian:132212:132541 [0] NCCL INFO misc/socket.cc:80 -> 3
obsidian:132212:132541 [0] NCCL INFO misc/socket.cc:829 -> 3
obsidian:132212:132541 [0] NCCL INFO misc/socket.cc:64 -> 3
obsidian:132212:132541 [0] NCCL INFO misc/socket.cc:80 -> 3
obsidian:132212:132541 [0] NCCL INFO misc/socket.cc:829 -> 3
obsidian:132212:132521 [0] NCCL INFO misc/socket.cc:881 -> 3
obsidian:132214:132543 [1] NCCL INFO misc/socket.cc:64 -> 3
obsidian:132214:132543 [1] NCCL INFO misc/socket.cc:80 -> 3
obsidian:132214:132543 [1] NCCL INFO misc/socket.cc:829 -> 3
obsidian:132212:132521 [0] NCCL INFO misc/socket.cc:881 -> 3
obsidian:132214:132543 [1] NCCL INFO misc/socket.cc:64 -> 3
obsidian:132214:132543 [1] NCCL INFO misc/socket.cc:80 -> 3
obsidian:132214:132543 [1] NCCL INFO misc/socket.cc:829 -> 3
obsidian:132214:132520 [1] NCCL INFO misc/socket.cc:881 -> 3
obsidian:132212:132541 [0] NCCL INFO comm 0x10828290 rank 0 nranks 2 cudaDev 0 busId 81000 - Abort COMPLETE
obsidian:132214:132543 [1] NCCL INFO comm 0x129767c0 rank 1 nranks 2 cudaDev 1 busId c1000 - Abort COMPLETE

Environment info:

ollecting environment information...
PyTorch version: 2.7.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: 18.1.3 (1ubuntu1)
CMake version: version 3.28.3
Libc version: glibc-2.39

Python version: 3.12.9 (main, Feb 12 2025, 14:50:50) [Clang 19.1.6 ] (64-bit runtime)
Python platform: Linux-6.8.0-58-generic-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 5090
GPU 1: NVIDIA GeForce RTX 5090

Nvidia driver version: 570.133.20
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.8.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.8.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7763 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   2
Core(s) per socket:                   64
Socket(s):                            1
Stepping:                             1
Frequency boost:                      enabled
CPU(s) scaling MHz:                   53%
CPU max MHz:                          3529.0520
CPU min MHz:                          1500.0000
BogoMIPS:                             4899.89
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap
Virtualization:                       AMD-V
L1d cache:                            2 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             32 MiB (64 instances)
L3 cache:                             256 MiB (8 instances)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; Safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] flake8==7.2.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.2.5
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pytorch-lightning==2.5.1.post0
[pip3] torch==2.7.0+cu128
[pip3] torchmetrics==1.7.1
[pip3] torchvision==0.22.0+cu128
[pip3] transformer_engine_torch==2.2.0
[pip3] triton==3.3.0
[conda] Could not collect

@jbschlosser jbschlosser added module: ddp Issues/PRs related distributed data parallel training module: fsdp and removed needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Apr 30, 2025
@jbschlosser
Copy link
Contributor

Will leave this to the distributed experts; thanks for the repro script :)

< 8000 /svg>

@youngmae
Copy link

pip install --upgrade nvidia-nccl-cu12

The latest NCCL cleared the memory error for me - though I still can't go beyond 4 cards:

CUDA_VISIBLE_DEVICES=0,1,2,3 python -c "import torch; torch._C._cuda_getDeviceCount()" works fine, but anything more (e.g. CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7) gives me: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu.

@Enlux
Copy link
Enlux commented May 14, 2025

@youngmae thank you! Just upgraded and confirmed working with 5 gpus. 4 x 4090 and 1 x 5090!
I had the issue you're describing on proxmox /w passthrough. Decided to go bare metal instead and issue disappeared. Are you by any chance passing them to a vm?

@youngmae
Copy link
youngmae commented May 15, 2025

No vm, but using a conda env. I pip installed pytorch within the environment though. Are you using a virtual environment as well or just installing straight to the OS.

edit: just tried without a venv, still the same issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ddp Issues/PRs related distributed data parallel training module: fsdp oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants
0