update xeon launch script for Intel(R) Xeon 6 support #133835

jingxu10 · 2024-08-19T05:30:23Z

Add support to Intel(R) Xeon 6 (GNR) and Core CPUs with E and P cores.

Add a shortcut exec torch-xeon-launcher to python -m torch.backends.xeon.run_cpu. Both commands work.
Simplified launcher script arguments. Keeps deprecated arguments working with deprecation warning messages for backward compatibility.
Restructured CPU information collection methodology. Modularize CPU topology detection and instance-cores mapping algorithm to support GNR and Intel Core platforms' E and P cores. The new implementation is more robust.

The BC Lint error is a false positive.
Hi @albanD @seemethere, this PR contains a restructured implementation of the xeon launch script, thus cannot be split. The PR Size Check doesn't apply.

pytorch-bot · 2024-08-19T05:30:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133835

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures

As of commit ca03c05 with merge base a777dea ():

NEW FAILURES - The following jobs have failed:

BC Lint / bc_linter (gh)
##[error]Process completed with exit code 1.
Check Labels / Check labels (gh)
# This PR needs a release notes: label
Lint / pr-sanity-checks (gh)
##[error]Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jingxu10 · 2024-08-22T02:20:19Z

Hi @malfet @albanD @kiukchung Any suggestions how should we step forward? Thank you.

malfet · 2024-10-16T23:05:53Z

torch/backends/xeon/_cpu_info.py

+        if platform.system() == "Windows":
+            raise RuntimeError("Windows platform is not supported!!!")
+        elif platform.system() == "Linux":


IMO this should be if platform.system() == "Linux" and platform.machine() == "x86_64": ... else: raise RuntimeError(f"Unsupported platform {platform.system()}") to cover both Windows, MacOS as well as LinuxARM

malfet · 2024-10-16T23:06:14Z

torch/backends/xeon/_cpu_info.py

+            if lscpu_txt.strip() == "":
+                args = ["lscpu", "--all", "--extended"]
+                my_env = os.environ.copy()
+                my_env["LC_ALL"] = "C"
+                lscpu_info = subprocess.check_output(
+                    args, env=my_env, universal_newlines=True
+                )


Why this information can be queried from lscpu rather than using cpuinfo?

github-actions · 2024-10-17T20:50:36Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

malfet

My biggest challenge with this change, is that it lscpu --extended output is not really defined anywhere.

I.e. on my machine it produces following output

 lscpu --extend --all
CPU CLUSTER CORE L1d:L1i:L2 ONLINE
  0       0    0 0:0:0         yes
  1       0    1 1:1:0         yes
  2       0    2 2:2:0         yes
  3       0    3 3:3:0         yes
  4       0    4 4:4:0         yes
  5       0    5 5:5:0         yes
  6       0    6 6:6:0         yes
  7       0    7 7:7:0         yes
  8       0    8 8:8:0         yes
  9       0    9 9:9:0         yes

whereas lscpu --parse=CPU,Core,Socket,Node always produces results in the same manner

lscpu --extended --json seems to be a good middle ground

github-actions · 2024-12-17T02:08:36Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorchbot added the open source label Aug 19, 2024

jingxu10 force-pushed the jingxu10/xeon_launch branch from f693c87 to 1cdb6a2 Compare August 19, 2024 07:22

jingxu10 marked this pull request as draft August 19, 2024 08:47

jingxu10 force-pushed the jingxu10/xeon_launch branch 2 times, most recently from 6369d4f to 99de183 Compare August 19, 2024 10:27

jingxu10 marked this pull request as ready for review August 19, 2024 10:28

jingxu10 force-pushed the jingxu10/xeon_launch branch 3 times, most recently from 5bc6c70 to 48e382b Compare August 20, 2024 23:13

soulitzer requested review from albanD and seemethere August 21, 2024 00:03

soulitzer added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Aug 21, 2024

jingxu10 force-pushed the jingxu10/xeon_launch branch 4 times, most recently from 0440dba to 51158c8 Compare August 22, 2024 00:09

albanD requested review from malfet and removed request for albanD August 23, 2024 16:15

jingxu10 changed the title ~~update xeon launch script for GNR support~~ update xeon launch script for Intel Xeon 6 support Oct 16, 2024

malfet reviewed Oct 16, 2024

View reviewed changes

jingxu10 force-pushed the jingxu10/xeon_launch branch from 3774dd2 to c807c1e Compare October 17, 2024 20:50

jingxu10 force-pushed the jingxu10/xeon_launch branch from c807c1e to ca03c05 Compare October 17, 2024 20:54

jingxu10 changed the title ~~update xeon launch script for Intel Xeon 6 support~~ update xeon launch script for Intel(R) Xeon 6 support Oct 17, 2024

update xeon launch script for Intel(R) Xeon 6 support

ca03c05

malfet requested changes Oct 18, 2024

View reviewed changes

github-actions bot added the Stale label Dec 17, 2024

github-actions bot closed this Jan 16, 2025

jingxu10 mentioned this pull request Jun 23, 2025

NUMA binding integration with elastic agent and torchrun #149334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

update xeon launch script for Intel(R) Xeon 6 support #133835

update xeon launch script for Intel(R) Xeon 6 support #133835

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

update xeon launch script for Intel(R) Xeon 6 support #133835

update xeon launch script for Intel(R) Xeon 6 support #133835

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/133835

❌ 3 New Failures

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This PR needs a release notes: label

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

This PR needs a `release notes:` label