-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Enable XPUEvent elapsed_time function #134666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/134666
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 4952ae0 with merge base 565a794 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
void assignEvent(sycl::queue& queue) { | ||
if (enable_timing_) { | ||
event_ = std::make_unique<sycl::event>( | ||
sycl::ext::oneapi::experimental::submit_profiling_tag(queue)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know when they will move this API from experiment 8000 al namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it should be a long-term plan. Let me confirm with the compiler team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler team has a plan to move them from experimental since 2025.1
@@ -128,7 +125,7 @@ struct TORCH_XPU_API XPUEvent { | |||
} | |||
} | |||
|
|||
float elapsed_time(const XPUEvent& other) const { | |||
double elapsed_time(const XPUEvent& other) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why double? align with other device type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align to frontend API. refer to
pytorch/torch/csrc/xpu/Event.cpp
Line 100 in 26e5572
return PyFloat_FromDouble(self->xpu_event.elapsed_time(other->xpu_event)); |
Was this code tested somehow? I'm getting runtime crashes for pytorch with 2025 compiler (not only related to this particular branch). UPD: my issue was related to #135818 |
c10/xpu/impl/XPUGuardImpl.h
Outdated
double elapsedTime(void* event1, void* event2, const DeviceIndex device_index) | ||
const override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the two events have been associated with a particular device, why do we require device_index
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cuda's elapsedTime
needs to know device_index to switch the context to destroy cuda event. We have to align it as this function is an implementation of the same interface.
c10/xpu/impl/XPUGuardImpl.h
Outdated
TORCH_CHECK( | ||
event1 && event2, | ||
"Both events must be recorded before calculating elapsed time."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the two events need to be checked if they are associated with the device_index
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two events will be checked in
pytorch/c10/core/impl/InlineEvent.h
Line 114 in 565a794
TORCH_CHECK( |
event_ = std::make_unique<sycl::event>( | ||
sycl::ext::oneapi::experimental::submit_profiling_tag(queue)); | ||
} else { | ||
event_ = std::make_unique<sycl::event>(queue.ext_oneapi_submit_barrier()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to use ext_oneapi_get_last_event
to replace ext_oneapi_submit_barrier
when recording an event without the enable_timing
feature.
This code change will happen while CI uplifts to 2025.0.
Patch conflict due to pytorch/pytorch#134666 Benchmark CI: * https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11855565969 (elapsed_time failed) * https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11856129996 (legacy profiler works) Inductor CI: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11856999355 (passed) Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
@pytorchbot revert "It seems to raise an internal failure." |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -m "It seems to raise an internal failure." |
❌ 🤖 pytorchbot command failed:
Try |
@pytorchbot revert -m "It seems to raise an internal failure." -c weird |
@pytorchbot successfully started a revert job. Check the current status here. |
Reverting PR 134666 failedReason: Command
Details for Dev Infra teamRaised by workflow job |
# Motivation This PR raises an internal UT failure on XPU. This reverts commit 4bbd6da. # Additional Context refer to #140814 Pull Request resolved: #140872 Approved by: https://github.com/EikanWang
…rch#140872) # Motivation This PR raises an internal UT failure on XPU. This reverts commit 4bbd6da. # Additional Context refer to pytorch#140814 Pull Request resolved: pytorch#140872 Approved by: https://github.com/EikanWang
# Motivation This PR intends to reland #134666 that has been reverted in #140872 We reverted it because I forgot to support `elapsed_time` for `XPUGuardImpl`, which resulted in `c10::Event` not supporting' elapsed_time' and blocking XPU CI. # Additional Context We split #134666 into two parts: one part, PR #140865, supports `elapsed_time` for `torch.Event` and another one, this PR, supports for `torch.xpu.elapsed_time`. Pull Request resolved: #140873 Approved by: https://github.com/gujinghui ghstack dependencies: #140865
# Motivation This PR aims to enable `elapsed_time` function for `XPUEvent`. # Additional Context This PR depends on toolchain oneAPI 2025.0. Pull Request resolved: pytorch#134666 Approved by: https://github.com/EikanWang, https://github.com/ezyang
…rch#140872) # Motivation This PR raises an internal UT failure on XPU. This reverts commit 4bbd6da. # Additional Context refer to pytorch#140814 Pull Request resolved: pytorch#140872 Approved by: https://github.com/EikanWang
# Motivation This PR intends to reland pytorch#134666 that has been reverted in pytorch#140872 We reverted it because I forgot to support `elapsed_time` for `XPUGuardImpl`, which resulted in `c10::Event` not supporting' elapsed_time' and blocking XPU CI. # Additional Context We split pytorch#134666 into two parts: one part, PR pytorch#140865, supports `elapsed_time` for `torch.Event` and another one, this PR, supports for `torch.xpu.elapsed_time`. Pull Request resolved: pytorch#140873 Approved by: https://github.com/gujinghui ghstack dependencies: pytorch#140865
Stack from ghstack (oldest at bottom):
Motivation
This PR aims to enable
elapsed_time
function forXPUEvent
.Additional Context
This PR depends on toolchain oneAPI 2025.0.
cc @gujinghui @EikanWang @fengyuan14