8000 [AOTI] Fix #140546 and support AOTI package load for Intel GPU. by etaf · Pull Request #140664 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[AOTI] Fix #140546 and support AOTI package load for Intel GPU. #140664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 31 commits into from

Conversation

[ghstack-poisoned]
Copy link
pytorch-bot bot commented Nov 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/140664

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure

As of commit 2abadda with merge base f84e533 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]
@etaf etaf marked this pull request as ready for review November 14, 2024 08:38
@etaf etaf requested a review from desertfire November 14, 2024 08:39
@etaf etaf added the ciflow/xpu Run XPU CI tasks label Nov 14, 2024
const std::string& model_so_path,
size_t num_models,
const std::string& device_str,
const std::string& cubin_dir) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::string& cubin_dir) {
const std::string& kernel_bin_dir) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved


std::vector<at::Tensor> run(
virtual std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It changes the API definition. According to the issue description, the null stream means the current stream for CUDA. The same semantic should also be applicable for Intel GPU. Since this PR has fixed the virtual function routine, the runner instance could dispatch run function to the right device container runner.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This relates to our definition of the run API. At first I thought it would be more generic to define it as std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs, AOTInductorStreamHandle stream_handle = nullptr);
. But this means that in model_container_runner_xpu.h, model_container_runner_xpu.h, model_container_runner_cpu.h, we need to define this override API as well. Since AOTIModelContainerRunnerCpu/Xpu/Cuda will be used in pybind, we'll get compilation error:

torch/csrc/inductor/aoti_runner/pybind.cpp:21:11: required from here
/usr/include/c++/11/type_traits:1422:38: error: invalid use of incomplete type ‘struct AOTInductorStreamOpaque’
1422 | : public integral_constant<bool, __is_base_of(_Base, _Derived)>
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from /home/xinanlin/xinanlin/pytorch/torch/csrc/inductor/aoti_runner/model_container_runner.h:5,
from torch/csrc/inductor/aoti_runner/model_container_runner_cpu.h:4,
from torch/csrc/inductor/aoti_runner/pybind.cpp:1:
torch/csrc/inductor/aoti_runtime/interface.h:16:8: note: forward declaration of ‘struct AOTInductorStreamOpaque’

The current design of this PR defines the interface of cpu/xpu/cuda container runner as std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs); is just keep the same before PR, so I think there is no issue to land it.

And if we prefer the interface run as std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs, AOTInductorStreamHandle stream_handle = nullptr); , we can create an adapter between pybind interface and the AOTI runner container for the container to resolve the above compilation issue in a new PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @EikanWang , this is an API breaking change. I think if you declare stream_handle as void*, it should solve your compilation issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @desertfire I've updated the PR, could you please have a review? BTW could you please also review this series of PRs? They are all for AOTi XPU and are ready for review.

@etaf etaf added the topic: not user facing topic category label Nov 14, 2024
[ghstack-poisoned]
[ghstack-poisoned]

std::vector<at::Tensor> run(
virtual std::vector<at::Tensor> run(const std::vector<at::Tensor>& inputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @EikanWang , this is an API breaking change. I think if you declare stream_handle as void*, it should solve your compilation issue.

etaf added 2 commits November 19, 2024 22:27
[ghstack-poisoned]
[ghstack-poisoned]
@etaf etaf requested review from desertfire and EikanWang November 20, 2024 07:54
etaf added 2 commits November 20, 2024 00:03
[ghstack-poisoned]
[ghstack-poisoned]
pytorchmergebot added a commit that referenced this pull request Dec 9, 2024
…U. (#140664)"

This reverts commit 91d3054.

Reverted #140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](#140269 (comment)))
@pytorchmergebot pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Dec 9, 2024
… GPU."

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026 


Fix #140546



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 9, 2024
…e load for Intel GPU."

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026 


Fix #140546



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 9, 2024
@desertfire
Copy link
Contributor

Relanding, see #140269 (comment)

@desertfire
Copy link
Contributor

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 2 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

… GPU."

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026 


Fix #140546



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 9, 2024
…e load for Intel GPU."

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* #140686
* __->__ #140664
* #140269
* #140268
* #135320
* #135318
* #139026 


Fix #140546



cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang aakhundov

[ghstack-poisoned]
desertfire added a commit that referenced this pull request Dec 9, 2024
@etaf
Copy link
Collaborator Author
etaf commented Dec 10, 2024

The failed job: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh)
inductor/test_torchinductor_opinfo.py::TestInductorOpInfoXPU::test_comprehensive_masked_cumprod_xpu_float16

is a known issue #141861 and has been fixed in main branch by #142348.

Please ignore it.

@desertfire can we land this PR after CI finished?

@desertfire
Copy link
Contributor

The failed job: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu) (gh) inductor/test_torchinductor_opinfo.py::TestInductorOpInfoXPU::test_comprehensive_masked_cumprod_xpu_float16

is a known issue #141861 and has been fixed in main branch by #142348.

Please ignore it.

@desertfire can we land this PR after CI finished?

SGTM

@etaf
Copy link
Collaborator Author
etaf commented Dec 10, 2024

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 1 checks: xpu / linux-jammy-xpu-2025.0-py3.9 / test (default, 4, 4, linux.idc.xpu)

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

mori360 pushed a commit to mori360/pytorch that referenced this pull request Dec 11, 2024
bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
…ntel GPU. (pytorch#140664)"

This reverts commit 91d3054.

Reverted pytorch#140664 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](pytorch#140269 (comment)))
bluenote10 pushed a commit to bluenote10/pytorch that referenced this pull request Dec 14, 2024
Esquains pushed a commit to Esquains/study1 that referenced this pull request Dec 15, 2024
@github-actions github-actions bot deleted the gh/etaf/63/head branch January 11, 2025 02:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-no-td Do not run TD on this PR ciflow/inductor ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks Merged module: inductor open source Reverted topic: not user facing topic category
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants
0