10000 [Intel GPU] trigger tf32 no-gpu warn only when setting true by ZhiweiYan-96 · Pull Request #149926 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

ZhiweiYan-96
Copy link
Collaborator
@ZhiweiYan-96 ZhiweiYan-96 commented Mar 25, 2025

Fix issue #149829

Detail

In torch.export initialization stage, the context variable of torch.backends.mkldn would be initialized at function _ignore_backend_decomps in torch/export/_trace.py.

It should be wrong to trigger no-gpu warning when trying to setting the value to False in a CPU-Only environment. The right behavior is raising warning only when user try to turn it on if no GPU.

Stack from ghstack (oldest at bottom):

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Mar 25, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149926

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit f15d95b with merge base 86dcdf9 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ZhiweiYan-96 added a commit that referenced this pull request Mar 25, 2025
@ZhiweiYan-96 ZhiweiYan-96 requested a review from EikanWang March 25, 2025 05:29
@justinchuby
Copy link
Collaborator

Thanks! Would it be possible to cherry pick this into 2.7?

@ZhiweiYan-96
Copy link
Collaborator Author

Thanks! Would it be possible to cherry pick this into 2.7?

@justinchuby Sure, I would cherry-pick it when this PR finishes review. Thank you again for pointing out this issue and help us enhance the quality.

@guangyey guangyey added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/inductor labels Mar 25, 2025
@guangyey guangyey moved this from Pre-Review Required to Review Required in PyTorch Intel Mar 25, 2025
@EikanWang EikanWang marked this pull request as ready for review March 25, 2025 06:58
@ZhiweiYan-96
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/ZhiweiYan-96/57/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/149926)

pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2025
@guangyey
Copy link
Collaborator

@pytorchbot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/ZhiweiYan-96/57/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/149926)

pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2025
@@ -145,7 +145,9 @@ void Context::setAllowTF32OneDNN(bool b){
#ifdef USE_XPU
allow_tf32_onednn = b;
#else
TORCH_WARN("TF32 acceleration on top of oneDNN is available for Intel GPUs. The current Torch version does not have Intel GPU Support.");
if(b){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(b){
if (b) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the lint.

Copy link
Collaborator
@EikanWang EikanWang Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the lint issue not captured?

Copy link
Collaborator Author
@ZhiweiYan-96 ZhiweiYan-96 Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the lint.

I wrongly thought you use some git button to fix the lint issue since you resolve the comment... hah, I will fix them, thanks for reminding.

Why is the lint issue not captured?

@EikanWang this file is not in .lintrunner.toml . (You may see other if(b) like case in this file) . Maybe this is a choice of other maintainers...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in the new commit

Copy link
Collaborator
@guangyey guangyey Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Refer to

pytorch/.lintrunner.toml

Lines 55 to 57 in a8d0c5c

code = 'CLANGFORMAT'
include_patterns = [
'aten/src/ATen/*.h',
, all .cpp files in aten/src/ATen are NOT included from clangformat.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the lint.

I wrongly thought you use some git button to fix the lint issue since you resolve the comment... hah, I will fix them, thanks for reminding.

Haha, just a reminder. I will break the ghstack merge label if I use the git button. So I revert my patch.

[ghstack-poisoned]
ZhiweiYan-96 added a commit that referenced this pull request Mar 25, 2025
@guangyey guangyey requested a review from malfet March 26, 2025 03:37
@guangyey
Copy link
Collaborator

@malfet May I know if this hotfix is reasonable for you?

@guangyey
Copy link
Collaborator

@malfet May I know if this hotfix is reasonable for you? It will suppress the unexpected warning message.

@justinchuby justinchuby added this to the 2.7.0 milestone Mar 26, 2025
@ZhiweiYan-96
Copy link
Collaborator Author

Hi, @malfet could you please help take a look on this PR? Thanks!

@guangyey guangyey requested a review from atalman March 28, 2025 03:10
@guangyey
Copy link
Collaborator

@malfet @atalman This PR aims to suppress an unexpected warning message and is targeted for the release branch. May I know if you have any comments.

@ZhiweiYan-96
Copy link
Collaborator Author

hi, @malfet @atalman Could you take a look at this PR when you have time? Appreciation for your suggestions.
This PR introduces minor and straightforward change to trigger the warning logic on tf32 setting.
It will fix #149829 and it is nice to be land in 2.7 release version. Thanks.

@malfet
Copy link
Contributor
malfet commented Mar 28, 2025

@guangyey do you know why this code is used in cpu-only code?

@ZhiweiYan-96
Copy link
Collaborator Author
ZhiweiYan-96 commented Mar 28, 2025

@guangyey do you know why this code is used in cpu-only code?

hi @malfet , it is because torch.export initializes flags of backends. The backends intialization code is shared by cpu and xpu. Following is a simple backtrace from pdb.

-> return _export(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1072)wrapper()
-> ep = fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/exported_program.py(122)wrapper()
-> return fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(2111)_export()
-> ep = _export_for_training(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1072)wrapper()
-> ep = fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/exported_program.py(122)wrapper()
-> return fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1973)_export_for_training()
-> export_artifact = export_func(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1916)_non_strict_export()
-> aten_export_artifact = _to_aten_func(  # type: ignore[operator]
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1696)_export_to_aten_ir_make_fx()
-> with torch.nn.utils.stateless._reparametrize_module(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/contextlib.py(142)__exit__()
-> next(self.gen)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(173)_ignore_backend_decomps()
-> torch.backends.mkldnn.set_flags(*orig_mkldnn_flag)

@malfet
Copy link
Contributor
malfet commented Mar 28, 2025

@ZhiweiYan-96 imo better solution would have been for torch._C._get_onednn_allow_tf32() to return None if compiled without XPU

% git diff
diff --git a/torch/csrc/Module.cpp b/torch/csrc/Module.cpp
index 9d39a5872b6..bdb720377d4 100644
--- a/torch/csrc/Module.cpp
+++ b/torch/csrc/Module.cpp
@@ -965,10 +965,14 @@ static PyObject* THPModule_setAllowTF32OneDNN(
 static PyObject* THPModule_allowTF32OneDNN(
     PyObject* _unused,
     PyObject* noargs) {
+#ifndef USE_XPU
   if (at::globalContext().allowTF32OneDNN())
     Py_RETURN_TRUE;
   else
     Py_RETURN_FALSE;
+#else
+  Py_RETURN_NONE;
+#endif
 }

Also, if you want to cherry-pick into 2.7, please add regression test

@ZhiweiYan-96
Copy link
Collaborator Author

@malfet Thanks for you suggestions, and it seems more reasonable than current change. I will check your advice and push commit, thanks.

@malfet
Copy link
Contributor
malfet commented Mar 31, 2025

@ZhiweiYan-96 any updates?

@ZhiweiYan-96
Copy link
Collaborator Author

hi @malfet I saw you push a #150358 for closing the issue. I miss the tracking here due to some urgent affairs. Sincere appreciation for your help.

@github-project-automation github-project-automation bot moved this from Review Required to Done in PyTorch Intel Apr 1, 2025
@ZhiweiYan-96
Copy link
Collaborator Author

The issue has been fixed in #150358

@github-actions github-actions bot deleted the gh/ZhiweiYan-96/57/head branch May 2, 2025 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks open source topic: not user facing topic category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

TF32 acceleration on top of oneDNN is avai 40B1 lable for Intel GPUs. The current Torch version does not have Intel GPU Support
7 participants
0