8000 nn.init.orthogonal_ doesn't work with multiprocessing · Issue #21956 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

nn.init.orthogonal_ doesn't work with multiprocessing #21956

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yiwan-rl opened this issue Jun 19, 2019 · 18 comments
Open

nn.init.orthogonal_ doesn't work with multiprocessing #21956

yiwan-rl opened this issue Jun 19, 2019 · 18 comments
Labels
module: dependency bug Problem is not caused by us, but caused by an upstream library we use module: initialization Related to weight initialization on operators module: multiprocessing Related to torch.multiprocessing module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@yiwan-rl
Copy link
yiwan-rl commented Jun 19, 2019

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
    print("test in")
    layer = nn.Linear(5 * 5 * 64, 64)
    nn.init.orthogonal_(layer.weight.data)
    print("test out")
processes = []
test()
for rank in range(0, 1):
    p = mp.Process(target=test)
    p.start()
    processes.append(p)
    time.sleep(0.1)
for p in processes:
    time.sleep(0.1)
    p.join()
print('t')

output:
test in
test out
test in
t

cc @albanD @mruberry

@vishwakftw
Copy link
Contributor

Could you please copy-paste the output from the environment collection script found here?

@yiwan-rl
Copy link
Author

Collecting environment information...
PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: None

OS: Mac OSX 10.14
GCC version: Could not collect
CMake version: version 3.3.2

Python version: 3.5
Is CUDA available: No
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA

Versions of relevant libraries:
[pip3] numpy==1.14.5
[pip3] torch==0.4.1
[pip3] torchvision==0.2.1
[conda] torch 0.4.1
[conda] torchvision 0.2.1

@vishwakftw
Copy link
Contributor

Seems to be related: numpy/numpy#654

@yiwan-rl
Copy link
Author

Does pytorch internally use openblas?

@vishwakftw
Copy link
Contributor

PyTorch internally uses a flavor of BLAS: MKL, OpenBLAS and so on. The orthogonal_ initialization uses the QR decomposition internally (which in turn is implemented using geqrf, orgqr which are LAPACK routines), which is why this issue has popped up.

If you have NumPy installed, could you copy paste the output of numpy.__config__.show()?

I ran the script that you have provided and seems to work fine for me. I use a Linux machine with Ubuntu 18.10 w/ CUDA 9.2.

@vishwakftw vishwakftw added triage review needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user module: nn Related to torch.nn module: operators labels Jun 19, 2019
@yiwan-rl
Copy link
Author
yiwan-rl commented Jun 19, 2019

Thanks for your response, I am using macbook and conda 4.4.3.

Here is the output:

lapack_mkl_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
lapack_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3']
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
blas_opt_info:
    extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
    define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
blis_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

If I run this code outside of conda environment, I get following:

lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
blis_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
    language = c
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE

but the problem still exists when I run problematic code outside of conda.

@vishwakftw
Copy link
Contributor

Outside of conda, can you try setting OPENBLAS_NUM_THREADS=1 and see if the problem still persists?

@yiwan-rl
Copy link
Author

still persists...

@nmathew
Copy link
nmathew commented Jun 20, 2019

Try this

import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
    print("test in")
    layer = nn.Linear(5 * 5 * 64, 64)
    nn.init.orthogonal_(layer.weight.data)
    print("test out")
processes = []

if __name__ == '__main__':
    test()
    for rank in range(0, 1):
        p = mp.Process(target=test)
        p.start()
        processes.append(p)
        time.sleep(0.1)
    for p in processes:
        time.sleep(0.1)
        p.join()
    print('t')

@yiwan-rl
Copy link
Author

Thanks, but the problem still exists.

@yiwan-rl
Copy link
Author

Actually if I reduce the size of args of nn.Linear, the problem goes away.

@izdeby izdeby added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jul 3, 2019
@ezyang
Copy link
Contributor
ezyang commented Jul 8, 2019

Unfortunately, we're going to need some sort of reproducer if we're going to be able to fix this.

If anyone else is having this problem, please shout!

@obilaniu
Copy link
obilaniu commented May 30, 2020

@ezyang I have seen this problem before. It's not just nn.init.orthogonal_(); Even torch.zeros() can provoke this for a "sufficiently large" created Tensor. This lines up well with yiwan-rl's report that reducing the size of a Tensor can cause the problem to go away; In fact, it is how I first encountered this problem.

Root Cause: libgomp

I root-caused it to OpenMP, specifically libgomp. When PyTorch or the BLAS it links against require libgomp, and the libgomp runtime is first initialized (such as by a "sufficiently large" or "sufficiently complex" Tensor operation, which QR is) and then fork()ed without an immediate exec(), libgomp chokes.

Diagnosis

The characteristic signs of this condition are:

  • libgomp is linked (can grep -e libgomp /proc/<PID>/maps on the hung process <PID>)
  • The environment variable OMP_NUM_THREADS=1 solves the problem
  • Hang happens in forked processes

Solutions

Intel MKL's implementation of OpenMP, libiomp5, does not suffer from this problem.
OpenBLAS with the pthreads backend, libopenblas_pthreads.so does not suffer from this problem.

Both have in common that they are fork-safe, in that they use the pthread_atfork() hack to "properly" shut down their worker threads before forking.

For this reason, it has been my practice, when building PyTorch from source, to:

  • Configure with MKL, if available, otherwise
  • Configure with OpenBLAS with the pthreads backend (libopenblas_pthreads.so, not libopenblas_openmp.so), and specify WITH_OPENMP=0.

References

[1], [2], [3].

@ezyang ezyang added module: dependency bug Problem is not caused by us, but caused by an upstream library we use and removed needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Jun 1, 2020
@ezyang
Copy link
Contributor
ezyang commented Jun 1, 2020

@obilaniu Thank you for a really detailed description of the problem. Would you consider this high priority to fix? I can think of two ways we can fix it given your root cause: (1) poison use of omp if we detect we're using a version of a library that doesn't orderly shut down on fork, similar to how we do this for the cuda runtime, (2) make sure our binary packaging doesn't use libgomp (might be easier said than done?)

@ezyang
Copy link
Contributor
ezyang commented Jun 1, 2020

cc @malfet

@obilaniu
Copy link
obilaniu commented Jun 1, 2020

@ezyang The problem is difficult to solve.

  • Some OpenMP implementations do work (Intel MKL's in particular)
  • Some BLAS implementations do work depending on configuration (OpenBLAS)
  • Before PyTorch is imported, another package could have pulled in libgomp.so or another unsafe library and initialized its runtime
  • After PyTorch is imported, another package could have pulled in libgomp.so or another unsafe library and initialized its runtime
  • The above two cases can happen even if PyTorch itself studiously avoids pulling in libgomp.so itself.
  • There is no control over fork() except that which pthread_atfork() gives.
  • There is no way to know if pthread_atfork() handlers have been invoked because of a fork() with the intent to exec() immediately after, which is what "spawn" fork mode or any subprocess call is about, and is safe.

I don't know how one can handle the general case except being careful. There might be easy cases that can be caught, and it might reduce the number of bug reports about inexplicable deadlocks, but catching all cases appears to require too much information for PyTorch to infer from context.

@ezyang
Copy link
Contributor
ezyang commented Jun 2, 2020

Yeah, I don't think we need to try for a complete solution, just catch the most common error cases.

@mruberry mruberry added module: multiprocessing Related to torch.multiprocessing module: initialization Related to weight initialization on operators and removed module: operators (deprecated) labels Oct 10, 2020
@cm9vd
Copy link
cm9vd commented Aug 29, 2023

In my case, the solution was to set the start method with mp.set_start_method('spawn') at the beginning
of the if __name__ == '__main__' clause. It also helped after replicating your bug.
Together it should look something like this:

import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
    print("test in")
    layer = nn.Linear(5 * 5 * 64, 64)
    nn.init.orthogonal_(layer.weight.data)
    print("test out")
processes = []

if __name__ == '__main__':
    mp.set_start_method('spawn')
    test()
    for rank in range(0, 1):
        p = mp.Process(target=test)
        p.start()
        processes.append(p)
        time.sleep(0.1)
    for p in processes:
        time.sleep(0.1)
        p.join()
    print('t')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dependency bug Problem is not caused by us, but caused by an upstream library we use module: initialization Related to weight initialization on operators module: multiprocessing Related to torch.multiprocessing module: nn Related to torch.nn triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants
0