Fix `USE_STATIC_MKL` lost functionality #138996

xuhancn · 2024-10-26T19:16:45Z

Currently, USE_STATIC_MKL is lost functionality to control static or shared link mkl of PyTorch. The reason is cmake/Modules/FindMKL.cmake code ignore USE_STATIC_MKL cmake variable. And search MKL libraries with many work around.

This PR is target to fix this issue. It is important to PyTorch XPU version build, we expected that:

In CPU and CUDA build, link MKL staticly.
In XPU build, link MKL shared link. We would have oneAPI environment, we can re-use shared MKL binaries.

The MKL config, we can reference to Intel official online tool: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.htm

OS	Link Type	Linked MKL Binaries
Windows	static	mkl_intel_lp64.lib mkl_intel_thread.lib mkl_core.lib libiomp5md.lib
Windows	shared	mkl_intel_lp64_dll.lib mkl_intel 8000 _thread_dll.lib mkl_core_dll.lib libiomp5md.lib
Linux	static	libmkl_intel_lp64.a libmkl_core.a libpthread.a libm.so libdl.a
Linux	shared	libmkl_intel_lp64.so libmkl_gnu_thread.so libmkl_core.so libpthread.a libm.so libdl.a

After fixed USE_STATIC_MKL option, we need to install correctly MKL version. Otherwise, it shouldn't find MKL binaries. To install MKL:
Install MKL static version on Windows/Linux:

pip install mkl-include mkl-static

Install MKL shared version on Windows/Linux:

pip install mkl mkl-devel mkl-include

Changes:

Fix USE_STATIC_MKL lost functionality on Linux.
Fix USE_STATIC_MKL lost functionality on Windows.
Set USE_STATIC_MKL default value to ON, we recommanded to link MKL statically.
Add related document to ReadMe.

TODO:
Setup correct USE_STATIC_MKL to CI system.

Merge print USE_STATIC_MKL for further debug. #138902 to help debug CI.
Setup USE_STATIC_MKL correctly in CI, need to match correct installed MKL version.
Merge this PR after all CI passed.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-10-26T19:16:49Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138996

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CUDA not found in NVIDIA runners

❌ 1 New Failure, 1 Unrelated Failure

As of commit 0743eb8 with merge base 56e1c23 ():

NEW FAILURE - The following job has failed:

pull / linux-focal-py3_9-clang9-xla / build (gh)
/var/lib/jenkins/workspace/third_party/googletest/googlemock/src/gmock-internal-utils.cc:186:36: error: too few arguments to function call, expected 2, have 1

FLAKY - The following job failed but was likely due to flakiness present on trunk:

xpu / linux-jammy-xpu-2025.1-py3.9 / test (default, 5, 6, linux.idc.xpu) (gh) (disabled by #153608 but the issue was closed recently and a rebase is needed to make it pass)
inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_checkpointing_without_reentrant_dataparallel

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang · 2024-10-28T16:41:30Z

CMakeLists.txt

@@ -330,7 +330,7 @@ cmake_dependent_option(
 set(MKLDNN_ENABLE_CONCURRENT_EXEC ${USE_MKLDNN})
 cmake_dependent_option(USE_MKLDNN_CBLAS "Use CBLAS in MKLDNN" OFF "USE_MKLDNN"
                       OFF)
-option(USE_STATIC_MKL "Prefer to link with MKL statically (Unix only)" OFF)
+option(USE_STATIC_MKL "Prefer to link with MKL statically (recommanded)." ON)


rec911ended

@ezyang not understand it. And I'm still fixing CI now.

There is a typo. But also, we don't want static MKL, this is a deliberate decision

malfet

Before changing the default, let's discuss benefits/drawback of doing it one way vs another. If I'm building locally, dynamic is always preferred, isn't it?
The only time you want to link with MKL statically if you ship binaries, as that makes your life much easier, but downside is bulkier releases, and considering PyPI size limit we want to rely on https://pypi.org/project/mkl/ which requires dynamic linking

xuhancn · 2024-10-28T18:14:12Z

Before changing the default, let's discuss benefits/drawback of doing it one way vs another. If I'm building locally, dynamic is always preferred, isn't it? The only time you want to link with MKL statically if you ship binaries, as that makes your life much easier, but downside is bulkier releases, and considering PyPI size limit we want to rely on https://pypi.org/project/mkl/ which requires dynamic linking

Thanks for your reply, then let keep using MKL shared as default config.

Fixes #138994 We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused #138994 For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: #138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts. Pull Request resolved: #139204 Approved by: https://github.com/ezyang

Fixes pytorch#138994 We can turn off `USE_MIMALLOC_ON_MKL` temporary. Due to it caused pytorch#138994 For totally fixed, we need fix `USE_STATIC_MKL` lost functionality issue: pytorch#138996, and then get the correctly MKL linking type(shared/static). It still need some time to pass all CI and builder scripts. Pull Request resolved: pytorch#139204 Approved by: https://github.com/ezyang

xuhancn · 2024-11-14T13:57:27Z

@pytorchbot rebase

pytorchmergebot · 2024-11-14T13:59:09Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2024-11-14T13:59:13Z

Successfully rebased xu_fix_USE_STATIC_MKL_lost_functionality onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_USE_STATIC_MKL_lost_functionality && git pull --rebase)

xuhancn · 2025-03-11T01:50:34Z

@pytorchbot rebase

pytorchmergebot · 2025-03-11T01:51:59Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-11T01:52:03Z

Successfully rebased xu_fix_USE_STATIC_MKL_lost_functionality onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_USE_STATIC_MKL_lost_functionality && git pull --rebase)

xuhancn · 2025-03-11T10:19:56Z

@pytorchbot rebase

pytorchmergebot · 2025-03-11T10:21:26Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-03-11T10:21:30Z

Successfully rebased xu_fix_USE_STATIC_MKL_lost_functionality onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout xu_fix_USE_STATIC_MKL_lost_functionality && git pull --rebase)

xuhancn · 2025-03-28T02:42:22Z

After some debug work, we fixed almost CI issues, except pytorch_xla.
For this remaining one pytorch_xla issue, I did some debug and make a status below.

pytorch_xla is failed at custom_op cmake test project build.

The error shows it can't find the MKL related libraries.

For MKL:

CMake outputs shows it can't find MKL, and then it uses default configuration.
I add some CMake message print shows, TORCH_LIBRARIES didn't involve MKL lib.

Due to I didn't have pytorch_xla env, I use pytorch_linux environment to simulate it.

On my Linux environment, I found that:

There no MKL involve information like pytorch_xla
TORCH_LIBRARIES as same as pytorch_xla(expected).

For the difference information as below:

-- MKL_ARCH: None, set to ` intel64` by default
-- MKL_ROOT /opt/conda
-- MKL_LINK: None, set to ` dynamic` by default
-- MKL_INTERFACE_FULL: None, set to ` intel_ilp64` by default
-- MKL_THREADING: None, set to ` intel_thread` by default
-- MKL_MPI: None, set to ` intelmpi` by default

I found it is expected print from MKLconfig.cmake.

For MKLconfig.cmake, I discussed our MKL expert @CuiYifeng , It is only expected existing in Intel oneAPI sourced environment. (pip installed MKL would not carry this file.)
So, the question is that, why pytorch_xla environment has MKLconfig.cmake file, which is expected in Intel oneAPI sourced environment???

@chuanqi129 will help us check the CI/CD environment for MKLconfig.cmake in pytorch_xla, let's wait for a while.

xuhancn · 2025-05-16T06:49:37Z

@pytorchbot rebase

pytorchmergebot · 2025-05-16T06:51:13Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-05-16T06:51:15Z

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/138996/head returned non-zero exit code 1

Rebasing (1/15)
Rebasing (2/15)
Auto-merging .ci/manywheel/build_common.sh
CONFLICT (modify/delete): .ci/pytorch/windows/condaenv.bat deleted in HEAD and modified in 17e88ba9c92 (setup USE_STATIC_MKL=1).  Version 17e88ba9c92 (setup USE_STATIC_MKL=1) of .ci/pytorch/windows/condaenv.bat left in tree.
error: could not apply 17e88ba9c92... setup USE_STATIC_MKL=1
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 17e88ba9c92... setup USE_STATIC_MKL=1

Raised by https://github.com/pytorch/pytorch/actions/runs/15062453526

* fix USE_STATIC_MKL on Linux. * fix USE_STATIC_MKL on Windows. * keep set USE_STATIC_MKL off. * fix shared mkl version number.

remove debug log. Work around MKL for CUDA.

xuhancn added module: mkl Related to our MKL support ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel topic: not user facing topic category labels Oct 26, 2024

pytorchbot added the open source label Oct 26, 2024

xuhancn requested a review from EikanWang October 27, 2024 01:53

xuhancn marked this pull request as ready for review October 27, 2024 08:38

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from a83f021 to b920061 Compare October 28, 2024 01:58

xuhancn requested a review from chuanqi129 October 28, 2024 05:40

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from dffb3eb to 6661809 Compare October 28, 2024 15:08

xuhancn requested a review from a team as a code owner October 28, 2024 15:08

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from 6661809 to 03e7cd2 Compare October 28, 2024 15:52

ezyang reviewed Oct 28, 2024

View reviewed changes

ezyang requested a review from malfet October 28, 2024 16:41

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Oct 28, 2024

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from 03e7cd2 to a2799f3 Compare October 28, 2024 17:50

xuhancn marked this pull request as draft October 28, 2024 17:55

malfet requested changes Oct 28, 2024

View reviewed changes

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from b26c81b to 9f66386 Compare October 29, 2024 02:48

This was referenced Oct 29, 2024

[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

Closed

turn off USE_MIMALLOC_ON_MKL temporary. #139204

Closed

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from 6a705bf to 00d0a9b Compare February 27, 2025 08:13

pytorchmergebot force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from 0a552d4 to af28920 Compare March 11, 2025 01:52

pytorchmergebot force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch from af28920 to 26e1837 Compare March 11, 2025 10:21

xuhancn added 14 commits May 16, 2025 14:52

Fix USE_STATIC_MKL

6c62bbe

* fix USE_STATIC_MKL on Linux. * fix USE_STATIC_MKL on Windows. * keep set USE_STATIC_MKL off. * fix shared mkl version number.

setup USE_STATIC_MKL=1

321a672

update code.

705e8fc

clean env

e007242

setup USE_STATIC_MKL=1 for wheel build.

bc02622

update code.

5922525

try to fix torch_cuda_linalg mkl dependency.

6e98405

update code.

cca60e6

test code.

05b5b36

try to fix issue 146551

ecf23bc

use static mkl for XPU build.

4e98ef5

try to fix xla.

c3fef53

add comments for torch_cuda_linalg fixing.

0743eb8

remove debug log. Work around MKL for CUDA.

xuhancn force-pushed the xu_fix_USE_STATIC_MKL_lost_functionality branch 2 times, most recently from a095c9a to 0743eb8 Compare May 16, 2025 20:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `USE_STATIC_MKL` lost functionality #138996

Fix `USE_STATIC_MKL` lost functionality #138996

Fix USE_STATIC_MKL lost functionality #138996

Are you sure you want to change the base?

Fix USE_STATIC_MKL lost functionality #138996

Conversation

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138996

❗ 1 Active SEVs

❌ 1 New Failure, 1 Unrelated Failure

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fix `USE_STATIC_MKL` lost functionality #138996

Fix `USE_STATIC_MKL` lost functionality #138996