8000 Enable XPUEvent elapsed_time function by guangyey · Pull Request #134666 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Enable XPUEvent elapsed_time function #134666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

guangyey
Copy link
Collaborator
@guangyey guangyey commented Aug 28, 2024

Stack from ghstack (oldest at bottom):

Motivation

This PR aims to enable elapsed_time function for XPUEvent.

Additional Context

This PR depends on toolchain oneAPI 2025.0.

cc @gujinghui @EikanWang @fengyuan14

Copy link
pytorch-bot bot commented Aug 28, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134666

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4952ae0 with merge base 565a794 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

guangyey added a commit that referenced this pull request Aug 28, 2024
ghstack-source-id: f3ca80d
Pull Request resolved: #134666
@guangyey guangyey marked this pull request as draft August 28, 2024 10:14
@guangyey guangyey added the ciflow/xpu Run XPU CI tasks label Aug 28, 2024
[ghstack-poisoned]
guangyey added a commit that referenced this pull request Sep 11, 2024
ghstack-source-id: e70cf67
Pull Request resolved: #134666
void assignEvent(sycl::queue& queue) {
if (enable_timing_) {
event_ = std::make_unique<sycl::event>(
sycl::ext::oneapi::experimental::submit_profiling_tag(queue));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know when they will move this API from experiment 8000 al namespace?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it should be a long-term plan. Let me confirm with the compiler team.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler team has a plan to move them from experimental since 2025.1

@@ -128,7 +125,7 @@ struct TORCH_XPU_API XPUEvent {
}
}

float elapsed_time(const XPUEvent& other) const {
double elapsed_time(const XPUEvent& other) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why double? align with other device type?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align to frontend API. refer to

return PyFloat_FromDouble(self->xpu_event.elapsed_time(other->xpu_event));

[ghstack-poisoned]
@ZzEeKkAa
Copy link
ZzEeKkAa commented Sep 12, 2024

Was this code tested somehow? I'm getting runtime crashes for pytorch with 2025 compiler (not only related to this particular branch).

UPD: my issue was related to #135818

Comment on lines 143 to 144
double elapsedTime(void* event1, void* event2, const DeviceIndex device_index)
const override {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the two events have been associated with a particular device, why do we require device_index here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cuda's elapsedTime needs to know device_index to switch the context to destroy cuda event. We have to align it as this function is an implementation of the same interface.

Comment on lines 145 to 147
TORCH_CHECK(
event1 && event2,
"Both events must be recorded before calculating elapsed time.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the two events need to be checked if they are associated with the device_index.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two events will be checked in

TORCH_CHECK(
. And it is legal to measure time between events in different device index.

@guangyey guangyey added the release notes: xpu release notes category label Nov 11, 2024
@guangyey guangyey added this to the 2.6.0 milestone Nov 11, 2024
@guangyey guangyey marked this pull request as ready for review November 11, 2024 03:34
@guangyey guangyey added the module: xpu Intel XPU related issues label Nov 11, 2024
event_ = std::make_unique<sycl::event>(
sycl::ext::oneapi::experimental::submit_profiling_tag(queue));
} else {
event_ = std::make_unique<sycl::event>(queue.ext_oneapi_submit_barrier());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We plan to use ext_oneapi_get_last_event to replace ext_oneapi_submit_barrier when recording an event without the enable_timing feature.
This code change will happen while CI uplifts to 2025.0.

guangyey added a commit that referenced this pull request Nov 11, 2024
ghstack-source-id: b7ce1b3
Pull Request resolved: #134666
[ghstack-poisoned]
[ghstack-poisoned]
@guangyey
Copy link
Collaborator Author

@pytorchbot revert "It seems to raise an internal failure."

Copy link
pytorch-bot bot commented Nov 16, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -m/--message, -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@guangyey
Copy link
Collaborator Author

@pytorchbot revert -m "It seems to raise an internal failure."

Copy link
pytorch-bot bot commented Nov 16, 2024

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

@guangyey
Copy link
Collaborator Author

@pytorchbot revert -m "It seems to raise an internal failure." -c weird

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

@pytorchmergebot
Copy link
Collaborator

Reverting PR 134666 failed

Reason: Command git -C /home/runner/work/pytorch/pytorch revert --no-edit 4bbd6da33101a8d709f1d2921ad8ae6f9b0dc166 returned non-zero exit code 1

Auto-merging aten/src/ATen/xpu/XPUEvent.h
CONFLICT (content): Merge conflict in aten/src/ATen/xpu/XPUEvent.h
Auto-merging c10/xpu/impl/XPUGuardImpl.h
CONFLICT (content): Merge conflict in c10/xpu/impl/XPUGuardImpl.h
Auto-merging test/test_xpu.py
error: could not revert 4bbd6da3310... Enable XPUEvent elapsed_time function (#134666)
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git revert --continue".
hint: You can instead skip this commit with "git revert --skip".
hint: To abort and get back to the state before "git revert",
hint: run "git revert --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

pytorchmergebot pushed a commit that referenced this pull request Nov 18, 2024
# Motivation
This PR raises an internal UT failure on XPU.
This reverts commit 4bbd6da.
# Additional Context
refer to #140814

Pull Request resolved: #140872
Approved by: https://github.com/EikanWang
youssef62 pushed a commit to youssef62/pytorch that referenced this pull request Nov 23, 2024
…rch#140872)

# Motivation
This PR raises an internal UT failure on XPU.
This reverts commit 4bbd6da.
# Additional Context
refer to pytorch#140814

Pull Request resolved: pytorch#140872
Approved by: https://github.com/EikanWang
pytorchmergebot pushed a commit that referenced this pull request Nov 28, 2024
# Motivation
This PR intends to reland #134666 that has been reverted in #140872
We reverted it because I forgot to support `elapsed_time` for `XPUGuardImpl`, which resulted in `c10::Event` not supporting' elapsed_time' and blocking XPU CI.

# Additional Context
We split #134666 into two parts: one part, PR #140865, supports `elapsed_time` for `torch.Event` and another one, this PR, supports for `torch.xpu.elapsed_time`.

Pull Request resolved: #140873
Approved by: https://github.com/gujinghui
ghstack dependencies: #140865
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
# Motivation
This PR aims to enable `elapsed_time` function for `XPUEvent`.

# Additional Context
This PR depends on toolchain oneAPI 2025.0.

Pull Request resolved: pytorch#134666
Approved by: https://github.com/EikanWang, https://github.com/ezyang
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
…rch#140872)

# Motivation
This PR raises an internal UT failure on XPU.
This reverts commit 4bbd6da.
# Additional Context
refer to pytorch#140814

Pull Request resolved: pytorch#140872
Approved by: https://github.com/EikanWang
pobin6 pushed a commit to pobin6/pytorch that referenced this pull request Dec 5, 2024
# Motivation
This PR intends to reland pytorch#134666 that has been reverted in pytorch#140872
We reverted it because I forgot to support `elapsed_time` for `XPUGuardImpl`, which resulted in `c10::Event` not supporting' elapsed_time' and blocking XPU CI.

# Additional Context
We split pytorch#134666 into two parts: one part, PR pytorch#140865, supports `elapsed_time` for `torch.Event` and another one, this PR, supports for `torch.xpu.elapsed_time`.

Pull Request resolved: pytorch#140873
Approved by: https://github.com/gujinghui
ghstack dependencies: pytorch#140865
@github-actions github-actions bot deleted the gh/guangyey/65/head branch December 18, 2024 02:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: xpu Intel XPU related issues open source release notes: xpu release notes category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants
0