nn.init.orthogonal_ doesn't work with multiprocessing #21956

yiwan-rl · 2019-06-19T05:55:38Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
    print("test in")
    layer = nn.Linear(5 * 5 * 64, 64)
    nn.init.orthogonal_(layer.weight.data)
    print("test out")
processes = []
test()
for rank in range(0, 1):
    p = mp.Process(target=test)
    p.start()
    processes.append(p)
    time.sleep(0.1)
for p in processes:
    time.sleep(0.1)
    p.join()
print('t')

output:
test in
test out
test in
t

cc @albanD @mruberry

The text was updated successfully, but these errors were encountered:

vishwakftw · 2019-06-19T14:50:14Z

Could you please copy-paste the output from the environment collection script found here?

yiwan-rl · 2019-06-19T16:07:06Z

Collecting environment information...
PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14
GCC version: Could not collect
CMake version: version 3.3.2

Python version: 3.5
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.14.5
[pip3] torch==0.4.1
[pip3] torchvision==0.2.1
[conda] torch 0.4.1
[conda] torchvision 0.2.1

vishwakftw · 2019-06-19T16:16:28Z

Seems to be related: numpy/numpy#654

yiwan-rl · 2019-06-19T16:54:35Z

Does pytorch internally use openblas?

vishwakftw · 2019-06-19T17:31:01Z

PyTorch internally uses a flavor of BLAS: MKL, OpenBLAS and so on. The orthogonal_ initialization uses the QR decomposition internally (which in turn is implemented using geqrf, orgqr which are LAPACK routines), which is why this issue has popped up.

If you have NumPy installed, could you copy paste the output of numpy.__config__.show()?

I ran the script that you have provided and seems to work fine for me. I use a Linux machine with Ubuntu 18.10 w/ CUDA 9.2.

yiwan-rl · 2019-06-19T18:55:43Z

Thanks for your response, I am using macbook and conda 4.4.3.

Here is the output:

lapack_mkl_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3']
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blis_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

If I run this code outside of conda environment, I get following:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

but the problem still exists when I run problematic code outside of conda.

vishwakftw · 2019-06-20T02:36:28Z

Outside of conda, can you try setting OPENBLAS_NUM_THREADS=1 and see if the problem still persists?

yiwan-rl · 2019-06-20T03:19:10Z

still persists...

nmathew · 2019-06-20T10:26:32Z

Try this

import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
    print("test in")
    layer = nn.Linear(5 * 5 * 64, 64)
    nn.init.orthogonal_(layer.weight.data)
    print("test out")
processes = []

if __name__ == '__main__':
    test()
    for rank in range(0, 1):
        p = mp.Process(target=test)
        p.start()
        processes.append(p)
        time.sleep(0.1)
    for p in processes:
        time.sleep(0.1)
        p.join()
    print('t')

yiwan-rl · 2019-06-22T00:11:51Z

Thanks, but the problem still exists.

yiwan-rl · 2019-06-22T00:13:42Z

Actually if I reduce the size of args of nn.Linear, the problem goes away.

ezyang · 2019-07-08T17:38:43Z

Unfortunately, we're going to need some sort of reproducer if we're going to be able to fix this.

If anyone else is having this problem, please shout!

obilaniu · 2020-05-30T06:17:22Z

@ezyang I have seen this problem before. It's not just nn.init.orthogonal_(); Even torch.zeros() can provoke this for a "sufficiently large" created Tensor. This lines up well with yiwan-rl's report that reducing the size of a Tensor can cause the problem to go away; In fact, it is how I first encountered this problem.

Root Cause: `libgomp`

I root-caused it to OpenMP, specifically libgomp. When PyTorch or the BLAS it links against require libgomp, and the libgomp runtime is first initialized (such as by a "sufficiently large" or "sufficiently complex" Tensor operation, which QR is) and then fork()ed without an immediate exec(), libgomp chokes.

Diagnosis

The characteristic signs of this condition are:

libgomp is linked (can grep -e libgomp /proc/<PID>/maps on the hung process <PID>)
The environment variable OMP_NUM_THREADS=1 solves the problem
Hang happens in forked processes

Solutions

Intel MKL's implementation of OpenMP, libiomp5, does not suffer from this problem.
OpenBLAS with the pthreads backend, libopenblas_pthreads.so does not suffer from this problem.

Both have in common that they are fork-safe, in that they use the pthread_atfork() hack to "properly" shut down their worker threads before forking.

For this reason, it has been my practice, when building PyTorch from source, to:

Configure with MKL, if available, otherwise
Configure with OpenBLAS with the pthreads backend (libopenblas_pthreads.so, not libopenblas_openmp.so), and specify WITH_OPENMP=0.

References

[1], [2], [3].

ezyang · 2020-06-01T14:15:09Z

@obilaniu Thank you for a really detailed description of the problem. Would you consider this high priority to fix? I can think of two ways we can fix it given your root cause: (1) poison use of omp if we detect we're using a version of a library that doesn't orderly shut down on fork, similar to how we do this for the cuda runtime, (2) make sure our binary packaging doesn't use libgomp (might be easier said than done?)

ezyang · 2020-06-01T14:15:17Z

cc @malfet

obilaniu · 2020-06-01T17:10:13Z

@ezyang The problem is difficult to solve.

Some OpenMP implementations do work (Intel MKL's in particular)
Some BLAS implementations do work depending on configuration (OpenBLAS)
Before PyTorch is imported, another package could have pulled in libgomp.so or another unsafe library and initialized its runtime
After PyTorch is imported, another package could have pulled in libgomp.so or another unsafe library and initialized its runtime
The above two cases can happen even if PyTorch itself studiously avoids pulling in libgomp.so itself.
There is no control over fork() except that which pthread_atfork() gives.
There is no way to know if pthread_atfork() handlers have been invoked because of a fork() with the intent to exec() immediately after, which is what "spawn" fork mode or any subprocess call is about, and is safe.

I don't know how one can handle the general case except being careful. There might be easy cases that can be caught, and it might reduce the number of bug reports about inexplicable deadlocks, but catching all cases appears to require too much information for PyTorch to infer from context.

ezyang · 2020-06-02T13:40:50Z

Yeah, I don't think we need to try for a complete solution, just catch the most common error cases.

cm9vd · 2023-08-29T14:24:15Z

In my case, the solution was to set the start method with mp.set_start_method('spawn') at the beginning
of the if __name__ == '__main__' clause. It also helped after replicating your bug.
Together it should look something like this:

import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
    print("test in")
    layer = nn.Linear(5 * 5 * 64, 64)
    nn.init.orthogonal_(layer.weight.data)
    print("test out")
processes = []

if __name__ == '__main__':
    mp.set_start_method('spawn')
    test()
    for rank in range(0, 1):
        p = mp.Process(target=test)
        p.start()
        processes.append(p)
        time.sleep(0.1)
    for p in processes:
        time.sleep(0.1)
        p.join()
    print('t')

vishwakftw added triage review needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user module: nn Related to torch.nn module: operators labels Jun 19, 2019

izdeby added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 3, 2019

ezyang removed the triage review label Jul 8, 2019

ezyang added module: dependency bug Problem is not caused by us, but caused by an upstream library we use and removed needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Jun 1, 2020

mruberry added module: multiprocessing Related to torch.multiprocessing module: initialization Related to weight initialization on operators and removed module: operators (deprecated) labels Oct 10, 2020

Janrupf mentioned this issue May 31, 2025

[Bug]: vLLM v0.8.5.post1 hanging with Llama 3.3 70b vllm-project/vllm#18260

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nn.init.orthogonal_ doesn't work with multiprocessing #21956

nn.init.orthogonal_ doesn't work with multiprocessing #21956

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nn.init.orthogonal_ doesn't work with multiprocessing #21956

nn.init.orthogonal_ doesn't work with multiprocessing #21956

Comments

Uh oh!

🐛 Bug

To Reproduce

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Root Cause: libgomp

Diagnosis

Solutions

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Root Cause: `libgomp`