8000 [Windows][cpu] mkl use mimalloc as allocator on Windows by xuhancn · Pull Request #138419 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[Windows][cpu] mkl use mimalloc as allocator on Windows #138419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

xuhancn
Copy link
Collaborator
@xuhancn xuhancn commented Oct 20, 2024

We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the ResNet50 is typical case of it.

Let's focus on the ResNet50, and collect the profiling log:

(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  --
8000
----------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624

We found the major kernel consume CPU resource is aten::mkldnn_convolution. It was dispatched to MKLDNN.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover MKL and MKLDNN's intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize MKL and MKLDNN's intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:

  1. Add cmake option: USE_MIMALLOC_ON_MKL, It is sub-option of USE_MIMALLOC.
  2. Wrap and export mi_malloc APIs in C10, when USE_MIMALLOC_ON_MKL is ON.
  3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when USE_MIMALLOC_ON_MKL is ON.

For oneDNN, it is still tracking in this proposal: uxlfoundation/oneDNN#1898

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Copy link
pytorch-bot bot commented Oct 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/138419

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 New Failure, 1 Unrelated Failure

As of commit 8094594 with merge base 675e16e (image):

NEW FAILURE - The following job has failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@xuhancn xuhancn added module: windows Windows support for PyTorch module: mkl Related to our MKL support ciflow/trunk Trigger trunk jobs on your pull request topic: not user facing topic category intel This tag is for PR from Intel module: cpu CPU specific problem (e.g., perf, algorithm) labels Oct 20, 2024
@xuhancn xuhancn changed the title [Windows][cpu] mkl use mimalloc [Windows][cpu] mkl use mimalloc as allocator on Windows Oct 20, 2024
@xuhancn xuhancn force-pushed the xu_mkl_use_mimalloc branch from 3a8736e to 6d601a9 Compare October 20, 2024 10:53
@xuhancn xuhancn requested a review from jgong5 October 20, 2024 13:01
* add cmake option: USE_MIMALLOC_ON_MKL
* wrap and export mi_malloc APIs in C10.
* add MklAllocationHelp.cpp to register allocation APIs to MKL.
@xuhancn xuhancn force-pushed the xu_mkl_use_mimalloc branch from bc7aeee to bb746b1 Compare October 21, 2024 14:59
Comment on lines +12 to +20
#ifdef USE_MIMALLOC_ON_MKL
namespace mi_malloc_wrapper {
C10_API void* c10_mi_malloc(size_t size);
C10_API void* c10_mi_calloc(size_t count, size_t size);
C10_API void* c10_mi_realloc(void* p, size_t newsize);
C10_API void* c10_mi_malloc_aligned(size_t size, size_t alignment);
C10_API void c10_mi_free(void* p);
} // namespace mi_malloc_wrapper
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have integrated mimalloc into c10.dll in pervious PR. In this PR we need optimize the kernels in libtorch_cpu.dll.
This C++ header is export mimalloc function from c10.dll, and let libtorch_cpu.dll import functions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kernels in libtorch_cpu.dll need these exported API? I don't see the dependencies in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it looks good. I thought the allocations are all registered in c10.

@xuhancn xuhancn requested a review from jgong5 October 22, 2024 04:47
Comment on lines +12 to +20
#ifdef USE_MIMALLOC_ON_MKL
namespace mi_malloc_wrapper {
C10_API void* c10_mi_malloc(size_t size);
C10_API void* c10_mi_calloc(size_t count, size_t size);
C10_API void* c10_mi_realloc(void* p, size_t newsize);
C10_API void* c10_mi_malloc_aligned(size_t size, size_t alignment);
C10_API void c10_mi_free(void* p);
} // namespace mi_malloc_wrapper
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it looks good. I thought the allocations are all registered in c10.

@xuhancn xuhancn requested review from malfet and ezyang October 22, 2024 08:43
@xuhancn
Copy link
Collaborator Author
xuhancn commented Oct 24, 2024

hi @malfet, @ezyang
Could you please help on review this PR?

Copy link
Contributor
@ezyang ezyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of annoying but I guess we can do Windows crimes

@ezyang
Copy link
Contributor
ezyang commented Oct 24, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@xuhancn
Copy link
Collaborator Author
xuhancn commented Oct 24, 2024

@pytorchbot merge -f "ignore unrelated failure."

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@xuhancn xuhancn deleted the xu_mkl_use_mimalloc branch October 24, 2024 05:31
@gshimansky
Copy link

Which version of oneAPI are you using to build pytorch? I am using oneAPI binaries from https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html and I am getting linking errors:

   Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_malloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_calloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_realloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_free referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
bin\torch_cpu.dll : fatal error LNK1120: 4 unresolved externals

Looking at CMakeFiles/torch_cpu.rsp I see that mkl_core_dll.lib is used for linking and it doesn't define these i_* symbols.

@xuhancn
Copy link
Collaborator Author
xuhancn commented Oct 26, 2024

Which version of oneAPI are you using to build pytorch? I am using oneAPI binaries from https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html and I am getting linking errors:

   Creating library lib\torch_cpu.lib and object lib\torch_cpu.exp
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_malloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_calloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_realloc referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
MklAllocationHelper.cpp.obj : error LNK2019: unresolved external symbol i_free referenced in function "void __cdecl `dynamic initializer for 'g_b_registered_mkl_alloction''(void)" (??__Eg_b_registered_mkl_alloction@@YAXXZ)
bin\torch_cpu.dll : fatal error LNK1120: 4 unresolved externals

Looking at CMakeFiles/torch_cpu.rsp I see that mkl_core_dll.lib is used for linking and it doesn't define these i_* symbols.

Hi @gshimansky,
Please comments them

i_malloc = c10::mi_malloc_wrapper::c10_mi_malloc;
i_calloc = c10::mi_malloc_wrapper::c10_mi_calloc;
i_realloc = c10::mi_malloc_wrapper::c10_mi_realloc;
i_free = c10::mi_malloc_wrapper::c10_mi_free;
temparory. I'm refactor MKL cmake file, and expected to fix it later.

@gshimansky
Copy link

For now I've turned this feature off in cmake with USE_MIMALLOC OFF. Please let me know when it works with dynamic MKL.

@xuhancn
Copy link
Collaborator Author
xuhancn commented Oct 29, 2024

For now I've turned this feature off in cmake with USE_MIMALLOC OFF. Please let me know when it works with dynamic MKL.

Hi @gshimansky you can only set USE_MIMALLOC_ON_MKL=OFF to get better performance.
Currently, USE_STATIC_MKL is lost functionality, so we can't get the existing MKL build/link type, and handle this issue. We will fix this issue after #138996 merged.

@xuhancn
Copy link
Collaborator Author
xuhancn commented Oct 31, 2024

@gshimansky We disabled it temporary: #139204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request intel This tag is for PR from Intel Merged module: cpu CPU specific problem (e.g., perf, algorithm) module: mkl Related to our MKL support module: windows Windows support for PyTorch open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants
0