10000 Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS by kwen2501 · Pull Request #154568 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Add NVSHMEM to PYTORCH_EXTRA_INSTALL_REQUIREMENTS #154568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

kwen2501
Copy link
Contributor
@kwen2501 kwen2501 commented May 28, 2025

@kwen2501 kwen2501 requested a review from a team as a code owner May 28, 2025 21:55
Copy link
pytorch-bot bot commented May 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154568

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 51f8222 with merge base 241f8dc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 28, 2025
@kwen2501 kwen2501 requested review from Skylion007, atalman and malfet May 28, 2025 22:17
@kwen2501
Copy link
Contributor Author

@Skylion007 @atalman do you mind having a look?
@atalman can you please help uploading the wheels to S3?
Thanks a lot!

@kwen2501 kwen2501 added release notes: distributed (c10d) release notes category and removed topic: not user facing topic category labels May 28, 2025
@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label May 28, 2025
kwen2501 added a commit that referenced this pull request May 28, 2025
ghstack-source-id: e471569
Pull Request resolved: #154568
@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2025
@@ -53,6 +53,7 @@
"nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu11==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kwen2501 perhaps we should not modify 11.8 builds at this point we are planning on dropping support for them #147383

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed from 11.8

@Skylion007
Copy link
Collaborator
Skylion007 commented May 29, 2025

@kwen2501 You also need to modify the rpaths in: .ci/manywheel/build_cuda.sh. See this PR: #138547

@kwen2501
Copy link
Contributor Author

@Skylion007 thanks! Added rpath.

kwen2501 added a commit that referenced this pull request May 29, 2025
ghstack-source-id: c81bf59
Pull Request resolved: #154568
Copy link
Contributor
@atalman atalman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kwen2501 I believe it need to be added to: Bundling with cudnn and cublas. in .ci/manywheel/build_cuda.sh use case as well

atalman added a commit to pytorch/test-infra that referenced this pull request Jun 3, 2025
@kwen2501
Copy link
Contributor Author
kwen2501 commented Jun 4, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

angelayi pushed a commit to angelayi/pytorch that referenced this pull request Jun 5, 2025
@youkaichao
Copy link
Collaborator

hmmm there are often scenarios where people patch and build nvshmem themselves. would pytorch bringing in a native dependency of nvshmem break such usage?

for example, vllm's recipes to use nvshmem is:

https://github.com/vllm-project/vllm/blob/5c8d34a42cff68dde652128726f7450032b8f474/tools/ep_kernels/install_python_libraries.sh#L33

@kwen2501

@kwen2501
Copy link
Contributor Author
kwen2501 commented Jul 10, 2025

@youkaichao thanks for raising the concern. Are those patches improvements / extensions to NVSHMEM? If so, would DeepEp be interested in upstreaming them to NVSHMEM? (It would be easier for DeepEp to maintain their codebase too.)
cc @albanD

@youkaichao
Copy link
Collaborator

@kwen2501 i think nvshmem 3.3 has integrated these patches. I haven't fully understand what would happen if multiple versions / instances of nvshmem exist in the same program yet.

@seth-howell
Copy link

We have already incorporated the changes done by DeepEP in NVSHMEM 3.3. There was one change about "receive queue support" that was ABI breaking but it was recently confirmed that they are not using that feature anymore and that it can be removed (deepseek-ai/DeepEP#147).

We do need to create a patch for DeepEP to get rid of those changes and use upstreamed NVSHMEM directly instead. I am working on that. Once that is done, the PyTorch integration and DeepEP usage should be just fine. I will post a link to the PR I open to keep you updated.

Other than this, DeepEP carries their own version of the device-side ibgda_device.cu file which uses an internal NVSHMEM IBGDA API to be able to do QP selection. In NVSHMEM 3.4, we are working on exposing a standard NVSHMEM API for doing QP selection. DeepEP will then be free to use the exposed API rather than their internal implementation. But this does not impact PyTorch integration.

@seth-howell
Copy link

FYI - Opened Friday: deepseek-ai/DeepEP#295

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged release notes: distributed (c10d) release notes category topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0