[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

xuhancn · 2024-10-20T10:17:51Z

We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the ResNet50 is typical case of it.

Let's focus on the ResNet50, and collect the profiling log:

(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  --
8000
----------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624

We found the major kernel consume CPU resource is aten::mkldnn_convolution. It was dispatched to MKLDNN.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover MKL and MKLDNN's intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize MKL and MKLDNN's intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:

Add cmake option: USE_MIMALLOC_ON_MKL, It is sub-option of USE_MIMALLOC.
Wrap and export mi_malloc APIs in C10, when USE_MIMALLOC_ON_MKL is ON.
Add MklAllocationHelp.cpp to register allocation APIs to MKL, when USE_MIMALLOC_ON_MKL is ON.

For oneDNN, it is still tracking in this proposal: uxlfoundation/oneDNN#1898

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-10-20T10:17:54Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138419

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[PRE-EMPTIVE] Experimenting with new runners linux.aws.a100 on inductor-perf-compare.yml

❌ 1 New Failure, 1 Unrelated Failure

As of commit 8094594 with merge base 675e16e ():

NEW FAILURE - The following job has failed:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh)
distributed/test_inductor_collectives.py::TestCollectivesMultiProc::test_allreduce_inductor_cudagraph_trees

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.g4dn.12xlarge.nvidia.gpu) (gh) (similar failure)
distributed/_tools/test_runtime_estimator.py::TestRuntimeEstimator::test_transformer_runtime

This comment was automatically generated by Dr. CI and updates every 15 minutes.

* add cmake option: USE_MIMALLOC_ON_MKL * wrap and export mi_malloc APIs in C10. * add MklAllocationHelp.cpp to register allocation APIs to MKL.

aten/src/ATen/native/mkl/MklAllocationHelp.cpp

jgong5 · 2024-10-22T03:02:26Z

c10/core/impl/alloc_cpu.h

+#ifdef USE_MIMALLOC_ON_MKL
+namespace mi_malloc_wrapper {
+C10_API void* c10_mi_malloc(size_t size);
+C10_API void* c10_mi_calloc(size_t count, size_t size);
+C10_API void* c10_mi_realloc(void* p, size_t newsize);
+C10_API void* c10_mi_malloc_aligned(size_t size, size_t alignment);
+C10_API void c10_mi_free(void* p);
+} // namespace mi_malloc_wrapper
+#endif


Why do we need this?

We have integrated mimalloc into c10.dll in pervious PR. In this PR we need optimize the kernels in libtorch_cpu.dll.
This C++ header is export mimalloc function from c10.dll, and let libtorch_cpu.dll import functions.

What kernels in libtorch_cpu.dll need these exported API? I don't see the dependencies in this PR.

OK, it looks good. I thought the allocations are all registered in c10.

jgong5 · 2024-10-22T08:35:02Z

c10/core/impl/alloc_cpu.h

+#ifdef USE_MIMALLOC_ON_MKL
+namespace mi_malloc_wrapper {
+C10_API void* c10_mi_malloc(size_t size);
+C10_API void* c10_mi_calloc(size_t count, size_t size);
+C10_API void* c10_mi_realloc(void* p, size_t newsize);
+C10_API void* c10_mi_malloc_aligned(size_t size, size_t alignment);
+C10_API void c10_mi_free(void* p);
+} // namespace mi_malloc_wrapper
+#endif


OK, it looks good. I thought the allocations are all registered in c10.

xuhancn · 2024-10-24T03:47:07Z

hi @malfet, @ezyang
Could you please help on review this PR?

ezyang

This is kind of annoying but I guess we can do Windows crimes

ezyang · 2024-10-24T03:49:23Z

@pytorchbot merge

pytorchmergebot · 2024-10-24T03:51:04Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2024-10-24T03:51:22Z

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 3, 3, linux.g4dn.12xlarge.nvidia.gpu)

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

xuhancn · 2024-10-24T05:28:11Z

@pytorchbot merge -f "ignore unrelated failure."

pytorchmergebot · 2024-10-24T05:29:30Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

gshimansky · 2024-10-25T20:25:01Z

Which version of oneAPI are you using to build pytorch? I am using oneAPI binaries from https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html and I am getting linking errors:

   Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_malloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_calloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_realloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_free referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
bin\torch_cpu.dll : fatal error LNK1120: 4 unresolved externals

Looking at CMakeFiles/torch_cpu.rsp I see that mkl_core_dll.lib is used for linking and it doesn't define these i_* symbols.

xuhancn · 2024-10-26T18:59:01Z

Which version of oneAPI are you using to build pytorch? I am using oneAPI binaries from https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html and I am getting linking errors:

   Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_malloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_calloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_realloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_free referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
bin\torch_cpu.dll : fatal error LNK1120: 4 unresolved externals

Looking at CMakeFiles/torch_cpu.rsp I see that mkl_core_dll.lib is used for linking and it doesn't define these i_* symbols.

Hi @gshimansky,
Please comments them

pytorch/aten/src/ATen/native/mkl/MklAllocationHelper.cpp

Lines 18 to 21 in 3a6f014

    
           i_malloc  = c10::mi_malloc_wrapper::c10_mi_malloc; 
        
           i_calloc  = c10::mi_malloc_wrapper::c10_mi_calloc; 
        
           i_realloc = c10::mi_malloc_wrapper::c10_mi_realloc; 
        
           i_free    = c10::mi_malloc_wrapper::c10_mi_free;

temparory. I'm refactor MKL cmake file, and expected to fix it later.

gshimansky · 2024-10-28T20:04:50Z

For now I've turned this feature off in cmake with USE_MIMALLOC OFF. Please let me know when it works with dynamic MKL.

xuhancn · 2024-10-29T05:52:23Z

For now I've turned this feature off in cmake with USE_MIMALLOC OFF. Please let me know when it works with dynamic MKL.

Hi @gshimansky you can only set USE_MIMALLOC_ON_MKL=OFF to get better performance.
Currently, USE_STATIC_MKL is lost functionality, so we can't get the existing MKL build/link type, and handle this issue. We will fix this issue after #138996 merged.

xuhancn · 2024-10-31T09:56:22Z

@gshimansky We disabled it temporary: #139204

xuhancn added module: windows Windows support for PyTorch module: mkl Related to our MKL support ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm) labels Oct 20, 2024

xuhancn changed the title ~~[Windows][cpu] mkl use mimalloc~~ [Windows][cpu] mkl use mimalloc as allocator on Windows Oct 20, 2024

pytorchbot added the open source label Oct 20, 2024

xuhancn force-pushed the xu_mkl_use_mimalloc branch from 3a8736e to 6d601a9 Compare October 20, 2024 10:53

xuhancn requested a review from jgong5 October 20, 2024 13:01

MKL use mimalloc as allocator on Windows

bb746b1

* add cmake option: USE_MIMALLOC_ON_MKL * wrap and export mi_malloc APIs in C10. * add MklAllocationHelp.cpp to register allocation APIs to MKL.

xuhancn force-pushed the xu_mkl_use_mimalloc branch from bc7aeee to bb746b1 Compare October 21, 2024 14:59

jgong5 reviewed Oct 22, 2024

View reviewed changes

xuhancn requested a review from jgong5 October 22, 2024 04:47

jgong5 approved these changes Oct 22, 2024

View reviewed changes

rename MklAllocationHelp.cpp to MklAllocationHelper.cpp

8094594

xuhancn requested review from malfet and ezyang October 22, 2024 08:43

ezyang approved these changes Oct 24, 2024

View reviewed changes

pytorchmergebot added the merging label Oct 24, 2024

pytorchmergebot removed the merging label Oct 24, 2024

pytorchmergebot added the merging label Oct 24, 2024

pytorchmergebot added the Merged label Oct 24, 2024

pytorchmergebot closed this in 96b30dc Oct 24, 2024

pytorchmergebot removed the merging label Oct 24, 2024

xuhancn deleted the xu_mkl_use_mimalloc branch October 24, 2024 05:31

xuhancn mentioned this pull request Oct 26, 2024

Build failed when use shared MKL library #138994

Closed

xuhancn mentioned this pull request Oct 31, 2024

[Proposal] Add cpu alloc/free callback to support customlize memory alloctor APIs. uxlfoundation/oneDNN#1898

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138419

❗ 1 Active SEVs

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Merge started

Uh oh!

Merge failed

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!