-
Notifications
You must be signed in to change notification settings - Fork 24.2k
NCCL init hits CUDA failure 'invalid argument' on 12.2 driver #150852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is it possible to get a complete |
Hi @kiskra-nvidia thanks! I added the following Pass/Fail matrix:
|
@kiskra-nvidia here is the INFO log with SUBSYS=ALL It is a 8 x H100 machine with NVSwitch. I ran the test on 4 GPUs. |
Thanks! No obvious clues in these logs, unfortunately. Sometimes we see weird NVLS errors like this if the devices were not (re-)initialized in the right order (such as if the fabric manager is restarted without restarting the GPUs). Can you make sure that the initialization follows the order specified in NVIDIA Fabric Manager, section 4.3? Also, when you say that it works with one driver version but not another, is that on the same node? |
No they are on different nodes. |
Demilestoning the issue as it has a workaround and there aren't much could be done on the PyTorch side |
Hi this should be fixed now with the latest nightlies with NCCL 2.26.5, @kwen2501 would you mind recheck if this issue persists? Thanks |
🐛 Describe the bug
Error seen with nightly build, e.g. torch==2.8.0.dev20250327+cu126
Mini repro:
Command line:
Fails with 12.2 driver:
Driver Version: 535.154.05 CUDA Version: 12.2
Works with 12.4 driver:
Driver Version: 550.90.07 CUDA Version: 12.4
Line 254 in nvls.cc:
Versions
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k
The text was updated successfully, but these errors were encountered: