8000 avoid redundant event records and event blocks by taozhiwei · Pull Request #119359 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

avoid redundant event records and event blocks #119359

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

taozhiwei
Copy link
Contributor
@taozhiwei taozhiwei commented Feb 7, 2024

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/ivalue_inl.h#L952

 void markCompleted(
      IValue value,
      c10::optional<std::vector<WeakStorage>> storages = c10::nullopt) {
 # ....
  for (const c10::Device& device : usedDevices) {
      c10::Event event(impl_.type());
      event.record(impl_.getStream(device));
      events_.push_back(std::move(event));
    }
 # ....
}

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/ivalue_inl.h#L1197

 void synchronizeWithCurrentStreams() {
    for (c10::Event& event : events_) {
      event.block(impl_.getStream(event.device()));
    }
 # ....
}

When usedDevices and impl_ is same device, both record and block will be executed on the same stream, which is redundant.
Also, I would like to know under what circumstances different devices will be used? Are the records and blocks necessary here.

Additionally, I also mentioned a suspected bug related to this.
#119266

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @ptrblck

Copy link
pytorch-bot bot commented Feb 7, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119359

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit e52245b with merge base becfda0 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@cpuhrsch cpuhrsch requested a review from wconstab February 8, 2024 18:48
@cpuhrsch cpuhrsch added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda Related to torch.cuda, and CUDA support in general labels Feb 8, 2024
@cpuhrsch cpuhrsch requested a review from ptrblck February 8, 2024 18:56
@kwen2501
Copy link
Contributor
kwen2501 commented Feb 8, 2024

Also, I would like to know under what circumstances different devices will be used?

Related to this comment, we are removing the multi-device support from ProcessGroupNCCL:
#119099
#119421

@wconstab
Copy link
Contributor
wconstab commented Feb 9, 2024

this looks ok to me but i'm not very familiar with this code. maybe @ezyang can tag someone more familiar

@ezyang ezyang requested a review from IvanKobzarev February 9, 2024 03:46
@IvanKobzarev
Copy link
Contributor

Looks good for me.
My feeling that this is double check/double sync and can be converted to assert that devices match.

@wconstab
Copy link
Contributor

cc @kwen2501

@ezyang
Copy link
Contributor
ezyang commented Feb 19, 2024

Not sure if the test failures are real but they do look suspicious

@ezyang
Copy link
Contributor
ezyang commented Feb 19, 2024

@pytorchbot merge -r

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 19, 2024
@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased tzwfeature2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tzwfeature2 && git pull --rebase)

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@ezyang ezyang added the topic: not user facing topic category label Feb 19, 2024
@ezyang
Copy link
Contributor
ezyang commented Feb 19, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@taozhiwei
Copy link
Contributor Author
taozhiwei commented Mar 7, 2024

https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2394

 {
    c10::cuda::CUDAMultiStreamGuard streamGuard(ncclStream);
    std::vector<at::Device> devices{device};
    work->future_ = c10::make_intrusive<at::ivalue::Future>(
        c10::ListType::create(c10::TensorType::get()), devices);

    // Add a callback that runs profiling end callbacks. wrapCallback() in CUDA
    // future blocks the stream this callback runs on the corresponding
    // ncclEndEvents_ ensuring appropriate synchronization.
    if (work->recordFunctionEndCallback_) {
      work->future_->addCallback(
          [work](at::ivalue::Future& /* unused */) {
            work->recordFunctionEndCallback_();
          },
          // uses_future = false allows us to skip synchronization in
          // ivalue::Future, but is only valid as long as the lambda doesn't use
          // the "Future" argument.
          /*uses_future=*/false);
    }
    work->future_->markCompleted(at::IValue(*work->outputs_));
  }
``
CUDAMultiStreamGuard will change current stream, so record and block are different streams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request module: cuda Related to torch.cuda, and CUDA support in general oncall: distributed Add this issue/PR to distributed oncall triage queue open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants
0