-
Notifications
You must be signed in to change notification settings - Fork 24.3k
nn.init.orthogonal_ doesn't work with multiprocessing #21956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you please copy-paste the output from the environment collection script found here? |
Collecting environment information... OS: Mac OSX 10.14 Python version: 3.5 Versions of relevant libraries: |
Seems to be related: numpy/numpy#654 |
Does pytorch internally use openblas? |
PyTorch internally uses a flavor of BLAS: MKL, OpenBLAS and so on. The If you have NumPy installed, could you copy paste the output of I ran the script that you have provided and seems to work fine for me. I use a Linux machine with Ubuntu 18.10 w/ CUDA 9.2. |
Thanks for your response, I am using macbook and conda 4.4.3. Here is the output:
If I run this code outside of conda environment, I get following:
but the problem still exists when I run problematic code outside of conda. |
Outside of conda, can you try setting OPENBLAS_NUM_THREADS=1 and see if the problem still persists? |
still persists... |
Try this import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
print("test in")
layer = nn.Linear(5 * 5 * 64, 64)
nn.init.orthogonal_(layer.weight.data)
print("test out")
processes = []
if __name__ == '__main__':
test()
for rank in range(0, 1):
p = mp.Process(target=test)
p.start()
processes.append(p)
time.sleep(0.1)
for p in processes:
time.sleep(0.1)
p.join()
print('t') |
Thanks, but the problem still exists. |
Actually if I reduce the size of args of nn.Linear, the problem goes away. |
Unfortunately, we're going to need some sort of reproducer if we're going to be able to fix this. If anyone else is having this problem, please shout! |
@ezyang I have seen this problem before. It's not just Root Cause:
|
@obilaniu Thank you for a really detailed description of the problem. Would you consider this high priority to fix? I can think of two ways we can fix it given your root cause: (1) poison use of omp if we detect we're using a version of a library that doesn't orderly shut down on fork, similar to how we do this for the cuda runtime, (2) make sure our binary packaging doesn't use libgomp (might be easier said than done?) |
cc @malfet |
@ezyang The problem is difficult to solve.
I don't know how one can handle the general case except being careful. There might be easy cases that can be caught, and it might reduce the number of bug reports about inexplicable deadlocks, but catching all cases appears to require too much information for PyTorch to infer from context. |
Yeah, I don't think we need to try for a complete solution, just catch the most common error cases. |
In my case, the solution was to set the start method with import torch.nn as nn
import torch.multiprocessing as mp
import time
def test():
print("test in")
layer = nn.Linear(5 * 5 * 64, 64)
nn.init.orthogonal_(layer.weight.data)
print("test out")
processes = []
if __name__ == '__main__':
mp.set_start_method('spawn')
test()
for rank in range(0, 1):
p = mp.Process(target=test)
p.start()
processes.append(p)
time.sleep(0.1)
for p in processes:
time.sleep(0.1)
p.join()
print('t') |
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
output:
test in
test out
test in
t
cc @albanD @mruberry
The text was updated successfully, but these errors were encountered: