8000 Define USE_C10D_XCCL and USE_XCCL in pytorch by Chao1Han · Pull Request #147593 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Define USE_C10D_XCCL and USE_XCCL in pytorch #147593

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

Chao1Han
Copy link
Contributor
@Chao1Han Chao1Han commented Feb 21, 2025

Motivation:

Add USE_XCCL and USE_C10D_XCCL to enable support of XCCL backend building in stock PyTorch, similar to USE_NCCL and USE_C10D_NCCL.
By default, USE_XCCL is OFF and allowed set to ON explicitly.

cc @gujinghui @EikanWang @fengyuan14 @guangyey

Copy link
pytorch-bot bot commented Feb 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/147593

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 373f4df with merge base 014726d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@guangyey guangyey marked this pull request as draft February 21, 2025 06:22
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Feb 21, 2025
Copy link
pytorch-bot bot commented Feb 21, 2025

To add the ciflow label ciflow/xpu please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label Feb 21, 2025
@guangyey guangyey added the release notes: xpu release notes category label Feb 21, 2025
@guangyey
Copy link
Collaborator
guangyey commented Feb 21, 2025

I'm not sure if this change will pass the CI build without modifying torch-xpu-ops.

@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Feb 21, 2025
@pkourdis
Copy link

I was wondering if both the XCCL and NCCL summaries should be grouped inside the if below:

  message(STATUS "  USE_DISTRIBUTED       : ${USE_DISTRIBUTED}")
  if(${USE_DISTRIBUTED})
    message(STATUS "    USE_MPI               : ${USE_MPI}")
    message(STATUS "    USE_GLOO              : ${USE_GLOO}")
    message(STATUS "    USE_GLOO_WITH_OPENSSL : ${USE_GLOO_WITH_OPENSSL}")
    message(STATUS "    USE_TENSORPIPE        : ${USE_TENSORPIPE}")
  endif()

@Chao1Han
Copy link
Contributor Author

I was wondering if both the XCCL and NCCL summaries should be grouped inside the if below:

  message(STATUS "  USE_DISTRIBUTED       : ${USE_DISTRIBUTED}")
  if(${USE_DISTRIBUTED})
    message(STATUS "    USE_MPI               : ${USE_MPI}")
    message(STATUS "    USE_GLOO              : ${USE_GLOO}")
    message(STATUS "    USE_GLOO_WITH_OPENSSL : ${USE_GLOO_WITH_OPENSSL}")
    message(STATUS "    USE_TENSORPIPE        : ${USE_TENSORPIPE}")
  endif()

I agree with outputting the USE_XCCL status here. The XCCL library path might not be something users care about, so there's no need to print that information separately. However, for NCCL and UCC, USE_SYSTEM_NCCL determines whether NCCL is built from source, which might not be suitable to group together in the same place. What’s your opinion? @zhangxiaoli73

@guangyey guangyey marked this pull request as ready for review February 24, 2025 01:17
@zhangxiaoli73
Copy link
Contributor

I was wondering if both the XCCL and NCCL summaries should be grouped inside the if below:

  message(STATUS "  USE_DISTRIBUTED       : ${USE_DISTRIBUTED}")
  if(${USE_DISTRIBUTED})
    message(STATUS "    USE_MPI               : ${USE_MPI}")
    message(STATUS "    USE_GLOO              : ${USE_GLOO}")
    message(STATUS "    USE_GLOO_WITH_OPENSSL : ${USE_GLOO_WITH_OPENSSL}")
    message(STATUS "    USE_TENSORPIPE        : ${USE_TENSORPIPE}")
  endif()

Currently, what we are doing for XCCL backend is on par with NCCL backend. If you think it's better to group all those information under this if condition, I think we can have a separate PR to track.

@pkourdis
Copy link

I was wondering if both the XCCL and NCCL summaries should be grouped inside the if below:

  message(STATUS "  USE_DISTRIBUTED       : ${USE_DISTRIBUTED}")
  if(${USE_DISTRIBUTED})
    message(STATUS "    USE_MPI               : ${USE_MPI}")
    message(STATUS "    USE_GLOO              : ${USE_GLOO}")
    message(STATUS "    USE_GLOO_WITH_OPENSSL : ${USE_GLOO_WITH_OPENSSL}")
    message(STATUS "    USE_TENSORPIPE        : ${USE_TENSORPIPE}")
  endif()

Currently, what we are doing for XCCL backend is on par with NCCL backend. If you think it's better to group all those information under this if condition, I think we can have a separate PR to track.

That works also - thank you for the feedback.

@Chao1Han
Copy link
Contributor Author
Chao1Han commented Mar 5, 2025

@malfet @atalman @albanD
Could you help review this pr?

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should be generated thru some sort of a template, but sure, why not

@guangyey
Copy link
Collaborator
guangyey commented Mar 5, 2025

@pytorchbot rebase

@guangyey
Copy link
Collaborator
guangyey commented Mar 5, 2025

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Mar 5, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xccl-cmake onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xccl-cmake && git pull --rebase)

@pytorch-bot pytorch-bot bot removed ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks labels Mar 5, 2025
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xccl-cmake onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xccl-cmake && git pull --rebase)

@pytorch-bot pytorch-bot bot removed the ciflow/xpu Run XPU CI tasks label May 7, 2025
github-merge-queue bot pushed a commit to intel/torch-xpu-ops that referenced this pull request May 8, 2025
for pytorch/pytorch#147593 pass ci

---------

Co-authored-by: Yutao Xu <yutao.xu@intel.com>
Co-authored-by: mengfei25 <mengfei.li@Intel.com>
@Chao1Han
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xccl-cmake onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xccl-cmake && git pull --rebase)

@Chao1Han
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2025
@cyyever
Copy link
Collaborator
cyyever commented May 15, 2025

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xccl-cmake onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout xccl-cmake && git pull --rebase)

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2025
@cyyever
Copy link
Collaborator
cyyever commented May 15, 2025

@pytorchbot merge -i

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 15, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

The merge job was canceled or timed out. This most often happen if two merge requests were issued for the same PR, or if merge job was waiting for more than 6 hours for tests to finish. In later case, please do not hesitate to reissue the merge command
For more information see pytorch-bot wiki.

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 0 checks:

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@cyyever cyyever added the module: xpu Intel XPU related issues label May 15, 2025
@github-project-automation github-project-automation bot moved this from Approved to Done in PyTorch Intel May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged module: xpu Intel XPU related issues open source release notes: xpu release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

9 participants
0