10000 Fix `USE_STATIC_MKL` lost functionality by xuhancn · Pull Request #138996 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Fix USE_STATIC_MKL lost functionality #138996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

xuhancn
Copy link
Collaborator
@xuhancn xuhancn commented Oct 26, 2024

Currently, USE_STATIC_MKL is lost functionality to control static or shared link mkl of PyTorch. The reason is cmake/Modules/FindMKL.cmake code ignore USE_STATIC_MKL cmake variable. And search MKL libraries with many work around.

This PR is target to fix this issue. It is important to PyTorch XPU version build, we expected that:

  1. In CPU and CUDA build, link MKL staticly.
  2. In XPU build, link MKL shared link. We would have oneAPI environment, we can re-use shared MKL binaries.

The MKL config, we can reference to Intel official online tool: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.htm

OS Link Type Linked MKL Binaries
Windows static mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
Windows shared mkl_intel_lp64_dll.lib mkl_intel 8000 _thread_dll.lib mkl_core_dll.lib libiomp5md.lib
Linux static libmkl_intel_lp64.a libmkl_core.a libpthread.a libm.so libdl.a
Linux shared libmkl_intel_lp64.so libmkl_gnu_thread.so libmkl_core.so libpthread.a libm.so libdl.a

After fixed USE_STATIC_MKL option, we need to install correctly MKL version. Otherwise, it shouldn't find MKL binaries. To install MKL:
Install MKL static version on Windows/Linux:

pip install mkl-include mkl-static

Install MKL shared version on Windows/Linux:

pip install mkl mkl-devel mkl-include 

Changes:

  1. Fix USE_STATIC_MKL lost functionality on Linux.
  2. Fix USE_STATIC_MKL lost functionality on Windows.
  3. Set USE_STATIC_MKL default value to ON, we recommanded to link MKL statically.
  4. Add related document to ReadMe.

TODO:
Setup correct USE_STATIC_MKL to CI system.

  1. Merge print USE_STATIC_MKL for further debug. #138902 to help debug CI.
  2. Setup USE_STATIC_MKL correctly in CI, need to match correct installed MKL version.
  3. Merge this PR after all CI passed.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Copy link
pytorch-bot bot commented Oct 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138996

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Unrelated Failure

As of commit 0743eb8 with merge base 56e1c23 (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xuhancn xuhancn added module: mkl Related to our MKL support ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel topic: not user facing topic category labels Oct 26, 2024
@xuhancn xuhancn requested a review from EikanWang October 27, 2024 01:53
@xuhancn xuhancn marked this pull request as ready for review October 27, 2024 08:38
@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from a83f021 to b920061 Compare October 28, 2024 01:58
@xuhancn xuhancn requested a review from chuanqi129 October 28, 2024 05:40
@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from dffb3eb to 6661809 Compare October 28, 2024 15:08
@xuhancn xuhancn requested a review from a team as a code owner October 28, 2024 15:08
@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from 6661809 to 03e7cd2 Compare October 28, 2024 15:52
CMakeLists.txt Outdated
@@ -330,7 +330,7 @@ cmake_dependent_option(
set(MKLDNN_ENABLE_CONCURRENT_EXEC ${USE_MKLDNN})
cmake_dependent_option(USE_MKLDNN_CBLAS "Use CBLAS in MKLDNN" OFF "USE_MKLDNN"
OFF)
option(USE_STATIC_MKL "Prefer to link with MKL statically (Unix only)" OFF)
option(USE_STATIC_MKL "Prefer to link with MKL statically (recommanded)." ON)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rec911ended

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang not understand it. And I'm still fixing CI now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a typo. But also, we don't want static MKL, this is a deliberate decision

@ezyang ezyang requested a review from malfet October 28, 2024 16:41
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 28, 2024
@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from 03e7cd2 to a2799f3 Compare October 28, 2024 17:50
@xuhancn xuhancn marked this pull request as draft October 28, 2024 17:55
Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before changing the default, let's discuss benefits/drawback of doing it one way vs another. If I'm building locally, dynamic is always preferred, isn't it?
The only time you want to link with MKL statically if you ship binaries, as that makes your life much easier, but downside is bulkier releases, and considering PyPI size limit we want to rely on https://pypi.org/project/mkl/ which requires dynamic linking

@xuhancn
Copy link
Collaborator Author
xuhancn commented Oct 28, 2024

Before changing the default, let's discuss benefits/drawback of doing it one way vs another. If I'm building locally, dynamic is always preferred, isn't it? The only time you want to link with MKL statically if you ship binaries, as that makes your life much easier, but downside is bulkier releases, and considering PyPI size limit we want to rely on https://pypi.org/project/mkl/ which requires dynamic linking

Thanks for your reply, then let keep using MKL shared as default config.

@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from b26c81b to 9f66386 Compare October 29, 2024 02:48
pytorchmergebot pushed a commit that referenced this pull request Oct 30, 2024
Fixes #138994

We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused #138994

For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: #138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts.

Pull Request resolved: #139204
Approved by: https://github.com/ezyang
rahulsingh-intel pushed a commit to rahulsingh-intel/pytorch that referenced this pull request Nov 5, 2024
Fixes pytorch#138994

We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused pytorch#138994

For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: pytorch#138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts.

Pull Request resolved: pytorch#139204
Approved by: https://github.com/ezyang
@xuhancn
Copy link
Collaborator Author
xuhancn commented Nov 14, 2024

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xu_fix_USE_STATIC_MKL_lost_functionality onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_USE_STATIC_MKL_lost_functionality && git pull --rebase)

@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from 6a705bf to 00d0a9b Compare February 27, 2025 08:13
@xuhancn
Copy link
Collaborator Author
xuhancn commented Mar 11, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xu_fix_USE_STATIC_MKL_lost_functionality onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_USE_STATIC_MKL_lost_functionality && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from 0a552d4 to af28920 Compare March 11, 2025 01:52
@xuhancn
Copy link
Collaborator Author
xuhancn commented Mar 11, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased xu_fix_USE_STATIC_MKL_lost_functionality onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_USE_STATIC_MKL_lost_functionality && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from af28920 to 26e1837 Compare March 11, 2025 10:21
@xuhancn
Copy link
Collaborator Author
xuhancn commented Mar 28, 2025

After some debug work, we fixed almost CI issues, except pytorch_xla.
For this remaining one pytorch_xla issue, I did some debug and make a status below.

pytorch_xla is failed at custom_op cmake test project build.
image
The error shows it can't find the MKL related libraries.

image
For MKL:

  1. CMake outputs shows it can't find MKL, and then it uses default configuration.
  2. I add some CMake message print shows, TORCH_LIBRARIES didn't involve MKL lib.

Due to I didn't have pytorch_xla env, I use pytorch_linux environment to simulate it.
image
On my Linux environment, I found that:

  1. There no MKL involve information like pytorch_xla
  2. TORCH_LIBRARIES as same as pytorch_xla(expected).

For the difference information as below:

-- MKL_ARCH: None, set to ` intel64` by default
-- MKL_ROOT /opt/conda
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default

I found it is expected print from MKLconfig.cmake.
image

For MKLconfig.cmake, I discussed our MKL expert @CuiYifeng , It is only expected existing in Intel oneAPI sourced environment. (pip installed MKL would not carry this file.)
So, the question is that, why pytorch_xla environment has MKLconfig.cmake file, which is expected in Intel oneAPI sourced environment???

@chuanqi129 will help us check the CI/CD environment for MKLconfig.cmake in pytorch_xla, let's wait for a while.

@xuhancn
Copy link
Collaborator Author
xuhancn commented May 16, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/138996/head returned non-zero exit code 1

Rebasing (1/15)
Rebasing (2/15)
Auto-merging .ci/manywheel/build_common.sh
CONFLICT (modify/delete): .ci/pytorch/windows/condaenv.bat deleted in HEAD and modified in 17e88ba9c92 (setup USE_STATIC_MKL=1).  Version 17e88ba9c92 (setup USE_STATIC_MKL=1) of .ci/pytorch/windows/condaenv.bat left in tree.
error: could not apply 17e88ba9c92... setup USE_STATIC_MKL=1
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 17e88ba9c92... setup USE_STATIC_MKL=1

Raised by https://github.com/pytorch/pytorch/actions/runs/15062453526

@xuhancn xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from a095c9a to 0743eb8 Compare May 16, 2025 20:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request ciflow/xpu Run XPU CI tasks intel This tag is for PR from Intel module: mkl Related to our MKL support open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0