[Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

ZhiweiYan-96 · 2025-03-25T05:28:09Z

Detail

In torch.export initialization stage, the context variable of torch.backends.mkldn would be initialized at function _ignore_backend_decomps in torch/export/_trace.py.

It should be wrong to trigger no-gpu warning when trying to setting the value to False in a CPU-Only environment. The right behavior is raising warning only when user try to turn it on if no GPU.

Stack from ghstack (oldest at bottom):

-> [Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

[ghstack-poisoned]

pytorch-bot · 2025-03-25T05:28:13Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149926

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit f15d95b with merge base 86dcdf9 ():

NEW FAILURE - The following job has failed:

xpu / linux-jammy-xpu-2025.0-py3.9 / build (gh)
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/sstream:152:52: error: expected value in expression

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / linux-jammy-xpu-2025.0-py3.9 / build (gh) (trunk failure)
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/sstream:152:52: error: expected value in expression

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (#149370)
Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 2e37c38 Pull Request resolved: #149926

justinchuby · 2025-03-25T05:30:13Z

Thanks! Would it be possible to cherry pick this into 2.7?

ZhiweiYan-96 · 2025-03-25T05:48:42Z

Thanks! Would it be possible to cherry pick this into 2.7?

@justinchuby Sure, I would cherry-pick it when this PR finishes review. Thank you again for pointing out this issue and help us enhance the quality.

aten/src/ATen/Context.cpp

ZhiweiYan-96 · 2025-03-25T07:16:08Z

@pytorchbot rebase

pytorchmergebot · 2025-03-25T07:17:33Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2025-03-25T07:17:42Z

Successfully rebased gh/ZhiweiYan-96/57/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/149926)

ghstack-source-id: 2e37c38 Pull Request resolved: #149926

aten/src/ATen/Context.cpp

guangyey · 2025-03-25T09:32:32Z

@pytorchbot rebase -b main

pytorchmergebot · 2025-03-25T09:34:04Z

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2025-03-25T09:34:16Z

Successfully rebased gh/ZhiweiYan-96/57/orig onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/149926)

ghstack-source-id: 4b9daa6 Pull Request resolved: #149926

guangyey · 2025-03-25T09:34:49Z

aten/src/ATen/Context.cpp

@@ -145,7 +145,9 @@ void Context::setAllowTF32OneDNN(bool b){
 #ifdef USE_XPU
  allow_tf32_onednn = b;
 #else
-  TORCH_WARN("TF32 acceleration on top of oneDNN is available for Intel GPUs. The current Torch version does not have Intel GPU Support.");
+  if(b){


Suggested change

if(b){

if (b) {

Please fix the lint.

Why is the lint issue not captured?

Please fix the lint.

I wrongly thought you use some git button to fix the lint issue since you resolve the comment... hah, I will fix them, thanks for reminding.

Why is the lint issue not captured?

@EikanWang this file is not in .lintrunner.toml . (You may see other if(b) like case in this file) . Maybe this is a choice of other maintainers...

fixed in the new commit

No. Refer to

pytorch/.lintrunner.toml

Lines 55 to 57 in a8d0c5c

code = 'CLANGFORMAT'

include_patterns = [

'aten/src/ATen/*.h',

, all .cpp files in aten/src/ATen are NOT included from clangformat.

Please fix the lint.

I wrongly thought you use some git button to fix the lint issue since you resolve the comment... hah, I will fix them, thanks for reminding.

Haha, just a reminder. I will break the ghstack merge label if I use the git button. So I revert my patch.

[ghstack-poisoned]

ghstack-source-id: 7f99dea Pull Request resolved: #149926

guangyey · 2025-03-26T03:38:34Z

@malfet May I know if this hotfix is reasonable for you?

guangyey · 2025-03-26T03:40:02Z

@malfet May I know if this hotfix is reasonable for you? It will suppress the unexpected warning message.

ZhiweiYan-96 · 2025-03-27T06:16:16Z

Hi, @malfet could you please help take a look on this PR? Thanks!

guangyey · 2025-03-28T03:12:32Z

@malfet @atalman This PR aims to suppress an unexpected warning message and is targeted for the release branch. May I know if you have any comments.

ZhiweiYan-96 · 2025-03-28T03:16:56Z

hi, @malfet @atalman Could you take a look at this PR when you have time? Appreciation for your suggestions.
This PR introduces minor and straightforward change to trigger the warning logic on tf32 setting.
It will fix #149829 and it is nice to be land in 2.7 release version. Thanks.

malfet · 2025-03-28T03:18:24Z

@guangyey do you know why this code is used in cpu-only code?

ZhiweiYan-96 · 2025-03-28T03:23:31Z

@guangyey do you know why this code is used in cpu-only code?

hi @malfet , it is because torch.export initializes flags of backends. The backends intialization code is shared by cpu and xpu. Following is a simple backtrace from pdb.

-> return _export(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1072)wrapper()
-> ep = fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/exported_program.py(122)wrapper()
-> return fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(2111)_export()
-> ep = _export_for_training(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1072)wrapper()
-> ep = fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/exported_program.py(122)wrapper()
-> return fn(*args, **kwargs)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1973)_export_for_training()
-> export_artifact = export_func(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1916)_non_strict_export()
-> aten_export_artifact = _to_aten_func(  # type: ignore[operator]
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(1696)_export_to_aten_ir_make_fx()
-> with torch.nn.utils.stateless._reparametrize_module(
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/contextlib.py(142)__exit__()
-> next(self.gen)
  /4T-720/conda_envs/zhiwei-int4/lib/python3.10/site-packages/torch/export/_trace.py(173)_ignore_backend_decomps()
-> torch.backends.mkldnn.set_flags(*orig_mkldnn_flag)

malfet · 2025-03-28T03:24:42Z

@ZhiweiYan-96 imo better solution would have been for torch._C._get_onednn_allow_tf32() to return None if compiled without XPU

% git diff
diff --git a/torch/csrc/Module.cpp b/torch/csrc/Module.cpp
index 9d39a5872b6..bdb720377d4 100644
--- a/torch/csrc/Module.cpp
+++ b/torch/csrc/Module.cpp
@@ -965,10 +965,14 @@ static PyObject* THPModule_setAllowTF32OneDNN(
 static PyObject* THPModule_allowTF32OneDNN(
     PyObject* _unused,
     PyObject* noargs) {
+#ifndef USE_XPU
   if (at::globalContext().allowTF32OneDNN())
     Py_RETURN_TRUE;
   else
     Py_RETURN_FALSE;
+#else
+  Py_RETURN_NONE;
+#endif
 }

Also, if you want to cherry-pick into 2.7, please add regression test

ZhiweiYan-96 · 2025-03-28T03:49:45Z

malfet · 2025-03-31T20:40:41Z

@ZhiweiYan-96 any updates?

ZhiweiYan-96 · 2025-04-01T02:30:05Z

hi @malfet I saw you push a #150358 for closing the issue. I miss the tracking here due to some urgent affairs. Sincere appreciation for your help.

ZhiweiYan-96 · 2025-04-01T02:30:27Z

The issue has been fixed in #150358

Update

c844b23

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Mar 25, 2025

[Intel GPU] trigger tf32 no-gpu warn only when setting true

ce00285

ghstack-source-id: 2e37c38 Pull Request resolved: #149926

ZhiweiYan-96 requested a review from EikanWang March 25, 2025 05:29

pytorchbot added the open source label Mar 25, 2025

ZhiweiYan-96 added ciflow/xpu Run XPU CI tasks ciflow/inductor topic: not user facing topic category labels Mar 25, 2025

ZhiweiYan-96 added this to PyTorch Intel Mar 25, 2025

ZhiweiYan-96 moved this to Pre-Review Required in PyTorch Intel Mar 25, 2025

ZhiweiYan-96 marked this pull request as draft March 25, 2025 06:04

ZhiweiYan-96 linked an issue Mar 25, 2025 that may be closed by this pull request

TF32 acceleration on top of oneDNN is available for Intel GPUs. The current Torch version does not have Intel GPU Support #149829

Closed

EikanWang approved these changes Mar 25, 2025

View reviewed changes

guangyey revi 8000 ewed Mar 25, 2025

View reviewed changes

aten/src/ATen/Context.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

da91b07

guangyey added ciflow/trunk Trigger trunk jobs on your pull request and removed ciflow/inductor labels Mar 25, 2025

guangyey moved this from Pre-Review Required to Review Required in PyTorch Intel Mar 25, 2025

EikanWang marked this pull request as ready for review March 25, 2025 06:58

Update

eaa80dc

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2025

[Intel GPU] trigger tf32 no-gpu warn only when setting true

4c326b1

ghstack-source-id: 2e37c38 Pull Request resolved: #149926

guangyey reviewed Mar 25, 2025

View reviewed changes

aten/src/ATen/Context.cpp Outdated Show resolved Hide resolved

Apply suggestions from code review

f762102

Update

6ed8781

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Mar 25, 2025

[Intel GPU] trigger tf32 no-gpu warn only when setting true

27218f5

ghstack-source-id: 4b9daa6 Pull Request resolved: #149926

guangyey reviewed Mar 25, 2025

View reviewed changes

Update

f15d95b

[ghstack-poisoned]

ZhiweiYan-96 added a commit that referenced this pull request Mar 25, 2025

[Intel GPU] trigger tf32 no-gpu warn only when setting true

e25f057

ghstack-source-id: 7f99dea Pull Request resolved: #149926

guangyey approved these changes Mar 26, 2025

View reviewed changes

guangyey requested a review from malfet March 26, 2025 03:37

justinchuby added this to the 2.7.0 milestone Mar 26, 2025

guangyey requested a review from atalman March 28, 2025 03:10

ZhiweiYan-96 closed this Apr 1, 2025

github-project-automation bot moved this from Review Required to Done in PyTorch Intel Apr 1, 2025

atalman mentioned this pull request Apr 3, 2025

Release 2.7.0 validations checklist and cherry-picks #150628

Closed

65 tasks

github-actions bot deleted the gh/ZhiweiYan-96/57/head branch May 2, 2025 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

[Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	code = 'CLANGFORMAT'
	include_patterns = [
	'aten/src/ATen/*.h',

[Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

[Intel GPU] trigger tf32 no-gpu warn only when setting true #149926

Uh oh!

Conversation

Uh oh!

Detail

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/149926

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!