Python 3.10 + intel-openmp failed to use numactl after import torch._C #136307

WeizhuoZhang-intel · 2024-09-19T08:03:50Z

🐛 Describe the bug

Insert debug code in torch.init.py

 366     if USE_GLOBAL_DEPS:
 367         _load_global_deps()
 368     import os
 369     print("Before import torch._C")
 370     os.system("numactl -C 1 ls")
 371     from torch._C import *  # noqa: F403
 372     print("After import torch._C")
 373     os.system("numactl -C 1 ls")

How to reproduce:

LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so KMP_AFFINITY=granularity=fine,compact,1,0 python -c "import torch"

Print:

pytorch_3.10) [root@d2a4b224fd20 workspace]# LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so KMP_AFFINITY=granularity=fine,compact,1,0 python -c "import torch"
Before import torch._C
DeepSpeed  log  oneCCL  test.py  torch-ccl  vision  whls  whls.zip
 

After import torch._C
libnuma: Warning: cpu argument 1 is out of range
<1> is invalid
usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]
               [--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]
               [--membind= | -m <nodes>] [--localalloc | -l] command args ...
       numactl [--show | -s]
       numactl [--hardware | -H]
       numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
               [--strict | -t]
               [--shmid | -I <id>] --shm | -S <shmkeyfile>
               [--shmid | -I <id>] --file | -f <tmpfsfile>
               [--huge | -u] [--touch | -T]
               memory policy | --dump | -d | --dump-nodes | -D
memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<nodes> is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
  netdev:DEV the node connected to network device DEV
  file:PATH  the node the block device of path is connected to
  ip:HOST    the node of the network device host routes through
  block:PATH the node of block device path
  pci:[seg:]bus:dev[:func] The node of a PCI device
<cpus> is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
<length> can have g (GB), m (MB) or k (KB) suffixes

Versions

Python 3.10
Intel-openmp: 2024

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @frank-wei

malfet · 2024-09-19T17:02:53Z

Not sure I understand what is the problem/ask here...

WeizhuoZhang-intel · 2024-09-20T03:35:53Z

The problem is when we using torch launcher for CPU test, it will use numactl for core binding. But we found that on python 3.10, the core binding it's not work based on numactl. So dig it deeper and find that it's caused by os.system("numactl -C 1 ls") after import torch._C where nuamctl is unable to bind CPU core successfully.

This case existed when the following conditions satisfied:

LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so
KMP_AFFINITY=granularity=fine,compact,1,0
And use numactl in torch launcher.

LifengWang · 2024-11-14T08:29:15Z

Not sure I understand what is the problem/ask here...

Hi，@malfet. Something seems wrong with from torch._C import * when specifying the intel openmp shared library.

How to reproduce:

conda create -n pt_310 python=3.10
conda activate pt_310
pip install  torch
pip install intel-openmp numpy

Without specifying the intel openmp shared library

With specifying the intel openmp shared library:
LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so

I don't know why specifying the Intel OpenMP shared library is causing a segmentation fault.

LifengWang · 2024-12-04T06:01:55Z

Here is the latest finding: After setting the KMP_AFFINITY parameter, it seems that numactl can recognize one cpu core.

Prepare the env:

conda create -y -n pt_310 python=3.10
conda activate pt_310
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip3 install intel-openmp numpy

The reproduce shell script:

export LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so

# Not  set the KMP_AFFINITY
export KMP_AFFINITY=
echo "=======================physcpubind 1 + KMP_AFFINITY=N/A======================================"
python -c 'import os; from torch._C import *; os.system("numactl -C 1 -s")'

# Set the KMP_AFFINITY
export KMP_AFFINITY=granularity=fine,compact,1,0
echo "=======================physcpubind 1 + KMP_AFFINITY=granularity=fine,compact,1,0==========================="
python -c 'import os; from torch._C import *; os.system("numactl -C 1 -s")'

echo "=======================physcpubind 0 + KMP_AFFINITY=granularity=fine,compact,1,0==========================="
python -c 'import os; from torch._C import *; os.system("numactl -C 0 -s")'

The test results are as follows: After setting the KMP_AFFINITY parameter, numactl can bind only to core 0. It seems that numactl can recognize only one CPU core after setting KMP_AFFINITY=granularity=fine,compact,1,0

=======================physcpubind 1 + KMP_AFFINITY=N/A======================================
policy: default
preferred node: current
physcpubind: 1
cpubind: 0
nodebind: 0
membind: 0 1
=======================physcpubind 1 + KMP_AFFINITY=granularity=fine,compact,1,0===========================
libnuma: Warning: cpu argument 1 is out of range

<1> is invalid
usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]
               [--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]
               [--membind= | -m <nodes>] [--localalloc | -l] command args ...
       numactl [--show | -s]
       numactl [--hardware | -H]
       numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]
               [--strict | -t]
               [--shmid | -I <id>] --shm | -S <shmkeyfile>
               [--shmid | -I <id>] --file | -f <tmpfsfile>
               [--huge | -u] [--touch | -T]
               memory policy | --dump | -d | --dump-nodes | -D

memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l
<nodes> is a comma delimited list of node numbers or A-B ranges or all.
Instead of a number a node can also be:
  netdev:DEV the node connected to network device DEV
  file:PATH  the node the block device of path is connected to
  ip:HOST    the node of the network device host routes through
  block:PATH the node of block device path
  pci:[seg:]bus:dev[:func] The node of a PCI device
<cpus> is a comma delimited list of cpu numbers or A-B ranges or all
all ranges can be inverted with !
all numbers and ranges can be made cpuset-relative with +
the old --cpubind argument is deprecated.
use --cpunodebind or --physcpubind instead
<length> can have g (GB), m (MB) or k (KB) suffixes
=======================physcpubind 0 + KMP_AFFINITY=granularity=fine,compact,1,0===========================
policy: default
preferred node: current
physcpubind: 0
cpubind: 0
nodebind: 0
membind: 0 1

yuchengliu1 · 2025-02-18T11:25:05Z

Update on the latest findings:
This issue might be related to GOMP. I compiled PyTorch from source on a machine without GOMP, and this bug did not occur. The lib of PyTorch was linked to llvm-openmp. However, when I installed PyTorch using pip on the same machine, the issue reappeared. The reason is that the lib of pip-installed PyTorch linked to a GOMP lib downloaded with Pytorch.

Record some debugging progress by the way. I tried using the compilation option INTEL_OMP_DIR to specify llvm-openmp during building on a machine with GOMP and llvm-openmp both. However, the resulting PyTorch still partially linked to GOMP and triggered this bug.

mwlon · 2025-02-19T09:42:37Z

I think it is expected that linking/dlopening multiple lib*omp implementations will cause problems. Last I checked, pytorch's MKL DNN component links to gomp regardless of whether you have it on the machine, so I suspect you'd also need to disable MKL DNN in the custom build.

vpirogov · 2025-02-19T17:19:54Z

Last I checked, pytorch's MKL DNN component links to gomp regardless of whether you have it on the machine

The flavor of OpenMP MKL DNN uses is defined by the compiler. GCC builds will always link to libgomp, CLANG builds will always link to libomp.

chunyuan-w · 2025-03-21T05:58:47Z

Not sure I understand what is the problem/ask here...

Hi，@malfet. Something seems wrong with from torch._C import * when specifying the intel openmp shared library.

How to reproduce:
conda create -n pt_310 python=3.10
conda activate pt_310
pip install  torch
pip install intel-openmp numpy
Without specifying the intel openmp shared library

With specifying the intel openmp shared library: LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so

I don't know why specifying the Intel OpenMP shared library is causing a segmentation fault.

This seg fault issue can be fixed by the next Intel OpenMP release (ETA end of March).
While for the original issue in the description, @yuchengliu1 will continue to look into it.

yuchengliu1 · 2025-04-07T07:10:58Z

This issue is caused by intel-openmp, and it is not related to pytorch. It can be reproduced by 3 step:

using intel-openmp and set KMP_AFFINITY=granularity=fine,compact,1,0
start a process which contains omp function (e.x omp_get_max_threads)
create a subprocess and bond the subprocess to any none-zero core with numactl.

Obviously, the mkl in pytorch do some omp functions in initial.
Intel-openmp has bonded process to core in step 2, and so numactl could find only one core in step 3.
One way to mitigate this is to add reset modifier to the KMP_AFFINITY value:
KMP_AFFINITY=reset,granularity=fine,compact,1,0
It is still needed to check whether bonding core in step 2 is a bug.

chunyuan-w · 2025-04-15T08:24:49Z

Not sure I understand what is the problem/ask here...

Hi，@malfet. Something seems wrong with from torch._C import * when specifying the intel openmp shared library.

How to reproduce:
conda create -n pt_310 python=3.10
conda activate pt_310
pip install  torch
pip install intel-openmp numpy
Without specifying the intel openmp shared library

With specifying the intel openmp shared library: LD_PRELOAD=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}/lib/libiomp5.so

I don't know why specifying the Intel OpenMP shared library is causing a segmentation fault.

This seg fault issue should have been fixed by the latest intel-openmp: 2025.1.0 which is already released. @LifengWang you could try if using intel-openmp==2025.1.0 fixes this seg fault error.

LifengWang · 2025-04-28T00:30:16Z

Hi, @chunyuan-w. I have verified that using intel-openmp==2025.1.0 with the PyTorch 0427 nightly wheels resolves the segmentation fault issue.

yuchengliu1 · 2025-05-11T16:40:09Z

This issue is caused by intel-openmp, and it is not related to pytorch. It can be reproduced by 3 step:

using intel-openmp and set KMP_AFFINITY=granularity=fine,compact,1,0

start a process which contains omp function (e.x omp_get_max_threads)

create a subprocess and bond the subprocess to any none-zero core with numactl.

Obviously, the mkl in pytorch do some omp functions in initial. Intel-openmp has bonded process to core in step 2, and so numactl could find only one core in step 3. One way to mitigate this is to add reset modifier to the KMP_AFFINITY value: KMP_AFFINITY=reset,granularity=fine,compact,1,0 It is still needed to check whether bonding core in step 2 is a bug.

This bug can be fixed by the Intel OpenMP in OneAPI 2025.2 (ETA end of June).

leslie-fang-intel assigned zhuhaozhe Oct 14, 2024

vadimkantorov mentioned this issue Feb 18, 2025

Something like "model instance index" inside python backend triton-inference-server/server#7984

Open

leslie-fang-intel mentioned this issue Feb 19, 2025

Prioritize building with libgomp over libomp #138834

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python 3.10 + intel-openmp failed to use numactl after import torch._C #136307

Python 3.10 + intel-openmp failed to use numactl after import torch._C #136307

Python 3.10 + intel-openmp failed to use numactl after import torch._C #136307

Python 3.10 + intel-openmp failed to use numactl after import torch._C #136307

Comments

🐛 Describe the bug

Versions