8000 Python 3.10 + intel-openmp failed to use numactl after import torch._C · Issue #136307 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Python 3.10 + intel-openmp failed to use numactl after import torch._C #136307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WeizhuoZhang-intel opened this issue Sep 19, 2024 · 12 comments
Open
Assignees
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: intel Specific to x86 architecture module: openmp Related to OpenMP (omp) support in PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@WeizhuoZhang-intel
Copy link
Contributor
WeizhuoZhang-intel commented Sep 19, 2024

🐛 Describe the bug

Insert debug code in torch.init.py

 366     if USE_GLOBAL_DEPS:
 367         _load_global_deps()
 368     import os
 369     print("Before import torch._C")
 370     os.system("numactl -C 1 ls")
 371     from torch._C import *  # noqa: F403
 372     print("After import torch._C")
 373     os.system("numactl -C 1 ls")

How to reproduce:

LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so KMP_AFFINITY=granularity=fine,compact,1,0 python -c "import torch"

Print:

pytorch_3.10) [root@d2a4b224fd20 workspace]# LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so KMP_AFFINITY=granularity=fine,compact,1,0 python -c "import torch"
Before import torch._C
DeepSpeed  log  oneCCL  test.py  torch-ccl  vision  whls  whls.zip
 

After import torch._C
libnuma: Warning: cpu argument 1 is out of range
<1> is invalid
usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]
               [--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]
               [--membind= | -m <nodes>] [--localalloc | -l] command args ...
       numactl [--show | -s]
       numactl [--hardware | -H]
       numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
               [--strict | -t]
               [--shmid | -I <id>] --shm | -S <shmkeyfile>
               [--shmid | -I <id>] --file | -f <tmpfsfile>
               [--huge | -u] [--touch | -T]
               memory policy | --dump | -d | --dump-nodes | -D
memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<nodes> is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
  netdev:DEV the node connected to network device DEV
  file:PATH  the node the block device of path is connected to
  ip:HOST    the node of the network device host routes through
  block:PATH the node of block device path
  pci:[seg:]bus:dev[:func] The node of a PCI device
<cpus> is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
<length> can have g (GB), m (MB) or k (KB) suffixes

Versions

Python 3.10
Intel-openmp: 2024

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @frank-wei

@malfet malfet added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user module: cpu CPU specific problem (e.g., perf, algorithm) module: openmp Related to OpenMP (omp) support in PyTorch module: intel Specific to x86 architecture triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 19, 2024
@malfet
Copy link
Contributor
malfet commented Sep 19, 2024

Not sure I understand what is the problem/ask here...

@WeizhuoZhang-intel
Copy link
Contributor Author

The problem is when we using torch launcher for CPU test, it will use numactl for core binding. But we found that on python 3.10, the core binding it's not work based on numactl. So dig it deeper and find that it's caused by os.system("numactl -C 1 ls") after import torch._C where nuamctl is unable to bind CPU core successfully.

This case existed when the following conditions satisfied:

  1. LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so
  2. KMP_AFFINITY=granularity=fine,compact,1,0
  3. And use numactl in torch launcher.

@LifengWang
Copy link
Contributor

Not sure I understand what is the problem/ask here...

Hi,@malfet. Something seems wrong with from torch._C import * when specifying the intel openmp shared library.

How to reproduce:

conda create -n pt_310 python=3.10
conda activate pt_310
pip install  torch
pip install intel-openmp numpy

Without specifying the intel openmp shared library
image

With specifying the intel openmp shared library:
LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so
image

I don't know why specifying the Intel OpenMP shared library is causing a segmentation fault.

@LifengWang
Copy link
Contributor
LifengWang commented Dec 4, 2024 8000

Here is the latest finding: After setting the KMP_AFFINITY parameter, it seems that numactl can recognize one cpu core.

Prepare the env:

conda create -y -n pt_310 python=3.10
conda activate pt_310
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip3 install intel-openmp numpy

The reproduce shell script:

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so

# Not  set the KMP_AFFINITY
export KMP_AFFINITY=
echo "=======================physcpubind 1 + KMP_AFFINITY=N/A======================================"
python -c 'import os; from torch._C import *; os.system("numactl -C 1 -s")'

# Set the KMP_AFFINITY
export KMP_AFFINITY=granularity=fine,compact,1,0
echo "=======================physcpubind 1 + KMP_AFFINITY=granularity=fine,compact,1,0==========================="
python -c 'import os; from torch._C import *; os.system("numactl -C 1 -s")'

echo "=======================physcpubind 0 + KMP_AFFINITY=granularity=fine,compact,1,0==========================="
python -c 'import os; from torch._C import *; os.system("numactl -C 0 -s")'

The test results are as follows: After setting the KMP_AFFINITY parameter, numactl can bind only to core 0. It seems that numactl can recognize only one CPU core after setting KMP_AFFINITY=granularity=fine,compact,1,0

=======================physcpubind 1 + KMP_AFFINITY=N/A======================================
policy: default
preferred node: current
physcpubind: 1
cpubind: 0
nodebind: 0
membind: 0 1
=======================physcpubind 1 + KMP_AFFINITY=granularity=fine,compact,1,0===========================
libnuma: Warning: cpu argument 1 is out of range

<1> is invalid
usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]
               [--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]
               [--membind= | -m <nodes>] [--localalloc | -l] command args ...
       numactl [--show | -s]
       numactl [--hardware | -H]
       numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
               [--strict | -t]
               [--shmid | -I <id>] --shm | -S <shmkeyfile>
               [--shmid | -I <id>] --file | -f <tmpfsfile>
               [--huge | -u] [--touch | -T]
               memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<nodes> is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
  netdev:DEV the node connected to network device DEV
  file:PATH  the node the block device of path is connected to
  ip:HOST    the node of the network device host routes through
  block:PATH the node of block device path
  pci:[seg:]bus:dev[:func] The node of a PCI device
<cpus> is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
<length> can have g (GB), m (MB) or k (KB) suffixes
=======================physcpubind 0 + KMP_AFFINITY=granularity=fine,compact,1,0===========================
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1

@yuchengliu1
Copy link
Contributor
yuchengliu1 commented Feb 18, 2025

Update on the latest findings:
This issue might be related to GOMP. I compiled PyTorch from source on a machine without GOMP, and this bug did not occur. The lib of PyTorch was linked to llvm-openmp. However, when I installed PyTorch using pip on the same machine, the issue reappeared. The reason is that the lib of pip-installed PyTorch linked to a GOMP lib downloaded with Pytorch.

Record some debugging progress by the way. I tried using the compilation option INTEL_OMP_DIR to specify llvm-openmp during building on a machine with GOMP and llvm-openmp both. However, the resulting PyTorch still partially linked to GOMP and triggered this bug.

@mwlon
Copy link
Contributor
mwlon commented Feb 19, 2025

I think it is expected that linking/dlopening multiple lib*omp implementations will cause problems. Last I checked, pytorch's MKL DNN component links to gomp regardless of whether you have it on the machine, so I suspect you'd also need to disable MKL DNN in the custom build.

@vpirogov
Copy link

Last I checked, pytorch's MKL DNN component links to gomp regardless of whether you have it on the machine

The flavor of OpenMP MKL DNN uses is defined by the compiler. GCC builds will always link to libgomp, CLANG builds will always link to libomp.

@chunyuan-w
Copy link
Collaborator

Not sure I understand what is the problem/ask here...

Hi,@malfet. Something seems wrong with from torch._C import * when specifying the intel openmp shared library.

How to reproduce:

conda create -n pt_310 python=3.10
conda activate pt_310
pip install  torch
pip install intel-openmp numpy

Without specifying the intel openmp shared library image

With specifying the intel openmp shared library: LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so image

I don't know why specifying the Intel OpenMP shared library is causing a segmentation fault.

This seg fault issue can be fixed by the next Intel OpenMP release (ETA end of March).
While for the original issue in the description, @yuchengliu1 will continue to look into it.

@yuchengliu1
Copy link
Contributor

This issue is caused by intel-openmp, and it is not related to pytorch. It can be reproduced by 3 step:

  1. using intel-openmp and set KMP_AFFINITY=granularity=fine,compact,1,0
  2. start a process which contains omp function (e.x omp_get_max_threads)
  3. create a subprocess and bond the subprocess to any none-zero core with numactl.

Obviously, the mkl in pytorch do some omp functions in initial.
Intel-openmp has bonded process to core in step 2, and so numactl could find only one core in step 3.
One way to mitigate this is to add reset modifier to the KMP_AFFINITY value:
KMP_AFFINITY=reset,granularity=fine,compact,1,0
It is still needed to check whether bonding core in step 2 is a bug.

@chunyuan-w
Copy link
Collaborator

Not sure I understand what is the problem/ask here...

Hi,@malfet. Something seems wrong with from torch._C import * when specifying the intel openmp shared library.

How to reproduce:

conda create -n pt_310 python=3.10
conda activate pt_310
pip install  torch
pip install intel-openmp numpy

Without specifying the intel openmp shared library image

With specifying the intel openmp shared library: LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so image

I don't know why specifying the Intel OpenMP shared library is causing a segmentation fault.

This seg fault issue should have been fixed by the latest intel-openmp: 2025.1.0 which is already released. @LifengWang you could try if using intel-openmp==2025.1.0 fixes this seg fault error.

@LifengWang
Copy link
Contributor

Hi, @chunyuan-w. I have verified that using intel-openmp==2025.1.0 with the PyTorch 0427 nightly wheels resolves the segmentation fault issue.

@yuchengliu1
Copy link
Contributor

This issue is caused by intel-openmp, and it is not related to pytorch. It can be reproduced by 3 step:

  1. using intel-openmp and set KMP_AFFINITY=granularity=fine,compact,1,0
  2. start a process which contains omp function (e.x omp_get_max_threads)
  3. create a subprocess and bond the subprocess to any none-zero core with numactl.

Obviously, the mkl in pytorch do some omp functions in initial. Intel-openmp has bonded process to core in step 2, and so numactl could find only one core in step 3. One way to mitigate this is to add reset modifier to the KMP_AFFINITY value: KMP_AFFINITY=reset,granularity=fine,compact,1,0 It is still needed to check whether bonding core in step 2 is a bug.

This bug can be fixed by the Intel OpenMP in OneAPI 2025.2 (ETA end of June).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: intel Specific to x86 architecture module: openmp Related to OpenMP (omp) support in PyTorch needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants
0