8000 torch.multiprocessing.Queue Zeroes Out Tensors on Retrieval · Issue #149155 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

torch.multiprocessing.Queue Zeroes Out Tensors on Retrieval #149155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ManuelZ opened th 8000 is issue Mar 13, 2025 · 5 comments
Open

torch.multiprocessing.Queue Zeroes Out Tensors on Retrieval #149155

ManuelZ opened this issue Mar 13, 2025 · 5 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: multiprocessing Related to torch.multiprocessing module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ManuelZ
Copy link
ManuelZ commented Mar 13, 2025

🐛 Describe the bug

When sending a CUDA tensor through a torch.multiprocessing.Queue, the received tensor contains only zeros instead of the expected values.

I reproduced it in Windows 10 with Pytorch 2.5.1 and 2.6.0.
I couldn't reproduce it in Colab with Pytorch 2.5.1.

Minimally reproducible example:

# Uncomment to test it in Colab
# %%writefile bug_report.py

import torch
import torch.multiprocessing as mp


def f1(shared_queue):
    """Send a CUDA tensor through the multiprocessing queue."""
    t = torch.tensor((1, 2), device="cuda:0")
    print("Tensor sent: ", t)
    shared_queue.put(t)


def f2(shared_queue):
    """Retrieve the tensor from the queue and print it."""
    while True:
        if shared_queue.empty():
            continue
        t = shared_queue.get()
        print(f"Tensor received: {t}")
        break


if __name__ == "__main__":

    mp.set_start_method("spawn", True)

    shared_queue = torch.multiprocessing.Queue()
    
    p1 = mp.Process(target=f1, args=(shared_queue,))
    p2 = mp.Process(target=f2, args=(shared_queue,))
    
    p1.start()
    p2.start()
    
    p1.join()
    p2.join()

# Uncomment to test it in Colab, in a new cell
# !python bug_report.py
Tensor sent:  tensor([1, 2], device='cuda:0')
Tensor received: tensor([0, 0], device='cuda:0')

Versions

PyTorch version: 2.6.0+cu126
Is debug build: False
CUDA used to build PyTorch: 12.6
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Home (10.0.19045 64-bit)
GCC version: (Rev6, Built by MSYS2 project) 13.1.0
Clang version: Could not collect
CMake version: version 3.31.0
Libc version: N/A

Python version: 3.11.11 | packaged by conda-forge | (main, Dec  5 2024, 14:06:23) [MSC v.1942 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 12.6.77
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce GTX 1050
Nvidia driver version: 560.94
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Name: Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
Manufacturer: GenuineIntel
Family: 198
Architecture: 9
ProcessorType: 3
DeviceID: CPU0
CurrentClockSpeed: 2800
MaxClockSpeed: 2801
L2CacheSize: 1024
L2CacheSpeed: None
Revision: None

Versions of relevant libraries:
[pip3] efficientnet_pytorch==0.7.1
[pip3] numpy==1.26.4
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] onnx==1.17.0
[pip3] onnxruntime-gpu==1.21.0
[pip3] onnxslim==0.1.48
[pip3] pytorch_toolbelt==0.8.0
[pip3] segmentation_models_pytorch==0.4.0
[pip3] torch==2.6.0+cu126
[pip3] torch-lr-finder==0.2.2
[pip3] torchaudio==2.6.0+cu126
[pip3] torcheval==0.0.7
[pip3] torchinfo==1.8.0
[pip3] torchvision==0.21.0+cu126
[conda] efficientnet-pytorch      0.7.1                    pypi_0    pypi
[conda] libblas                   3.9.0           31_h641d27c_mkl    conda-forge
[conda] libcblas                  3.9.0           31_h5e41251_mkl    conda-forge
[conda] liblapack                 3.9.0           31_h1aa476e_mkl    conda-forge
[conda] mkl                       2024.2.2            h66d3029_15    conda-forge
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] pytorch-toolbelt          0.8.0                    pypi_0    pypi
[conda] segmentation-models-pytorch 0.4.0                    pypi_0    pypi
[conda] torch                     2.6.0+cu126              pypi_0    pypi
[conda] torch-lr-finder           0.2.2                    pypi_0    pypi
[conda] torchaudio                2.6.0+cu126              pypi_0    pypi
[conda] torcheval                 0.0.7                    pypi_0    pypi
[conda] torchinfo                 1.8.0                    pypi_0    pypi
[conda] torchvision               0.21.0+cu126             pypi_0    pypi

cc @peterjc123 @mszhanyi @skyline75489 @nbcsm @iremyux @Blackhex @VitalyFedyunin @albanD @ptrblck @msaroufim @eqy

@zou3519 zou3519 added module: multiprocessing Related to torch.multiprocessing triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Mar 17, 2025
@albanD albanD added module: cuda Related to torch.cuda, and CUDA support in general module: windows Windows support for PyTorch labels Apr 9, 2025
@albanD
Copy link
Collaborator
albanD commented Apr 9, 2025

Cannot repro on linux either.
@ptrblck is that something people on your end could take a look at?

@neurochen
Copy link

I encountered this issue as well. Usually the data will be zeroed out for the first several queue.put. It happens under both Windows 11 and Ubuntu 24.04.

@albanD
Copy link
Collaborator
albanD commented May 12, 2025

Can you give more details on the Ubuntu setup where you were able to reproduce this? (running collect_env if possible) I was not able to on my end.

@neurochen
Copy link
neurochen commented May 12, 2025

Can you give more details on the Ubuntu setup where you were able to reproduce this? (running collect_env if possible) I was not able to on my end.

Thanks for checking on this. I tested ubuntu 24.04 using WSL2 and the above script shared BY @ManuelZ, with CUDA toolkit 12.8 for WSL2 installed. It's a clean setup: python 3.11.7, torch 2.0 or later versions. The common observation across both Windows and WSL2, across all versions of pytorch, is: the queue works fine for CPU tensors but zeroes out cuda tensors.

Actually this zerors-out happens for the first several attempts to send cuda tensors. One way to overcome this bug is to resend the cuda tensor again if it fails. Incorporating this idea in dataloader.py and worker.py, however, sometimes results in another error here

assert not self._shutdown and self._tasks_outstanding > 0

This makes things more complicated. Ultimately, I had to give up editing myself and wait for an official bug fix.

@neurochen
Copy link

Can you give more details on the Ubuntu setup where you were able to reproduce this? (running collect_env if possible) I was not able to on my end.

Just found a native ubuntu to test this. Works fine there. Thanks for working on this windows-specific issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: multiprocessing Related to torch.multiprocessing module: windows Windows support for PyTorch triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants
0