avoid redundant event records and event blocks #119359

taozhiwei · 2024-02-07T07:06:18Z

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/ivalue_inl.h#L952

 void markCompleted(
      IValue value,
      c10::optional<std::vector<WeakStorage>> storages = c10::nullopt) {
 # ....
  for (const c10::Device& device : usedDevices) {
      c10::Event event(impl_.type());
      event.record(impl_.getStream(device));
      events_.push_back(std::move(event));
    }
 # ....
}

https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/ivalue_inl.h#L1197

 void synchronizeWithCurrentStreams() {
    for (c10::Event& event : events_) {
      event.block(impl_.getStream(event.device()));
    }
 # ....
}

When usedDevices and impl_ is same device, both record and block will be executed on the same stream, which is redundant.
Also, I would like to know under what circumstances different devices will be used? Are the records and blocks necessary here.

Additionally, I also mentioned a suspected bug related to this.
#119266

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @ptrblck

pytorch-bot · 2024-02-07T07:06:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119359

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit e52245b with merge base becfda0 ():

NEW FAILURES - The following jobs have failed:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_DistributedDataParallel_SyncBatchNorm_Channels_Last
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 2, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_DistributedDataParallel_SyncBatchNorm
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_hook_parity_post_localSGD

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 · 2024-02-08T20:19:13Z

Also, I would like to know under what circumstances different devices will be used?

Related to this comment, we are removing the multi-device support from ProcessGroupNCCL:
#119099
#119421

wconstab · 2024-02-09T00:52:49Z

this looks ok to me but i'm not very familiar with this code. maybe @ezyang can tag someone more familiar

IvanKobzarev · 2024-02-09T16:24:50Z

Looks good for me.
My feeling that this is double check/double sync and can be converted to assert that devices match.

wconstab · 2024-02-15T14:30:00Z

cc @kwen2501

ezyang · 2024-02-19T00:18:36Z

Not sure if the test failures are real but they do look suspicious

ezyang · 2024-02-19T00:18:42Z

@pytorchbot merge -r

pytorchmergebot · 2024-02-19T00:21:20Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-02-19T00:21:23Z

Successfully rebased tzwfeature2 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout tzwfeature2 && git pull --rebase)

pytorchmergebot · 2024-02-19T00:23:02Z

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team

Raised by workflow job

ezyang · 2024-02-19T01:40:29Z

@pytorchbot merge

pytorchmergebot · 2024-02-19T01:42:20Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-02-19T01:59:52Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

taozhiwei · 2024-03-07T07:44:41Z

https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L2394

 {
    c10::cuda::CUDAMultiStreamGuard streamGuard(ncclStream);
    std::vector<at::Device> devices{device};
    work->future_ = c10::make_intrusive<at::ivalue::Future>(
        c10::ListType::create(c10::TensorType::get()), devices);

    // Add a callback that runs profiling end callbacks. wrapCallback() in CUDA
    // future blocks the stream this callback runs on the corresponding
    // ncclEndEvents_ ensuring appropriate synchronization.
    if (work->recordFunctionEndCallback_) {
      work->future_->addCallback(
          [work](at::ivalue::Future& /* unused */) {
            work->recordFunctionEndCallback_();
          },
          // uses_future = false allows us to skip synchronization in
          // ivalue::Future, but is only valid as long as the lambda doesn't use
          // the "Future" argument.
          /*uses_future=*/false);
    }
    work->future_->markCompleted(at::IValue(*work->outputs_));
  }
``
CUDAMultiStreamGuard will change current stream, so record and block are different streams.

pytorchbot added the open source label Feb 7, 2024

cpuhrsch requested a review from wconstab February 8, 2024 18:48

cpuhrsch added oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda Related to torch.cuda, and CUDA support in general labels Feb 8, 2024

cpuhrsch requested a review from ptrblck February 8, 2024 18:56

ezyang requested a review from IvanKobzarev February 9, 2024 03:46

IvanKobzarev approved these changes Feb 9, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Feb 19, 2024

8000

avoid redundant event records and event blocks

e52245b

pytorchmergebot force-pushed the tzwfeature2 branch from 75effa2 to e52245b Compare February 19, 2024 00:21

pytorchmergebot added the merging label Feb 19, 2024

pytorchmergebot removed the merging label Feb 19, 2024

ezyang added the topic: not user facing topic category label Feb 19, 2024

pytorchmergebot added the merging label Feb 19, 2024

pytorchmergebot removed the merging label Feb 19, 2024

taozhiwei closed this Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avoid redundant event records and event blocks #119359

avoid redundant event records and event blocks #119359

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

avoid redundant event records and event blocks #119359

avoid redundant event records and event blocks #119359

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/119359

❌ 3 New Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Uh oh!

Uh oh!