8000 Pytorch nighlty and openAI/triton cuda · Issue #106144 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Pytorch nighlty and openAI/triton cuda #106144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
bhack opened this issue Jul 27, 2023 · 21 comments
Closed

Pytorch nighlty and openAI/triton cuda #106144

bhack opened this issue Jul 27, 2023 · 21 comments
Labels
module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@bhack
Copy link
Contributor
bhack commented Jul 27, 2023

🐛 Describe the bug

If I am not wrong I think that the nightly pytorch (Cuda 11.8) wheels are not compatible with the Triton pinned commit as I am seeing something like triton-lang/triton#1955

See more:
https://discuss.pytorch.org/t/any-change-of-using-cuda-12-2/184461/6

If it is true why the CI is not failing with tests on pytorch nighlty cuda 11.8?

Versions

nightly

cc @seemethere @malfet @osalpekar @atalman @ptrblck @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @gchanan

@malfet
Copy link
Contributor
malfet commented Jul 28, 2023

Can you please run python3 -mtorch.utils.collect_env and post your results here?

@malfet malfet added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: pt2 labels Jul 28, 2023
@bhack
Copy link
Contributor Author
bhack commented Jul 29, 2023

I need to check if I can get again exactly that env.
What I had in the log is:

torch._dynamo.convert_frame: [WARNING] mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device)
torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
torch._dynamo.convert_frame: [WARNING] RuntimeError: Triton Error [CUDA]: device kernel image is invalid

The env was mainly:
docker pull pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + official command pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

@stebix
Copy link
stebix commented Jul 29, 2023

Hi all,

I am seeing similar problems resulting in Triton Error [CUDA]: device kernel image is invalid. Attempting at compiling a very simple model or function produces a lot of log warnings via a deep call stack, Finally pytorch seems to fall back to pre-2.x eager mode without speedups.

My environment was built as a fresh conda environment with the pytorch-nightly installation commands as provided by the main pytorch docs, since I wanted to use torch.compile and python 3.11.

I get the following reason for the compilation failure:

[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING]   File "/home/user/anaconda3/envs/torchnightly/lib/python3.11/site-packages/triton/compiler/compiler.py", line 589, in _init_handles
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING]     mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device)
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING] RuntimeError: Triton Error [CUDA]: device kernel image is invalid

This is observed for the following environment:

PyTorch version: 2.1.0.dev20230727
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.0-8-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.2.152
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB

Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2480.056
CPU max MHz:                     3364.3550
CPU min MHz:                     1500.0000
BogoMIPS:                        4700.08
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.0.dev20230727
[pip3] torchaudio==2.1.0.dev20230727
[pip3] torchvision==0.16.0.dev20230727
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl  
[conda] brotlipy                  0.7.0           py311h9bf148f_1002    pytorch-nightly
[conda] cffi                      1.15.1          py311h9bf148f_3    pytorch-nightly
[conda] cryptography              38.0.4          py311h46ebde7_0    pytorch-nightly
[conda] cudatoolkit               11.8.0              h37601d7_11    conda-forge
[conda] filelock                  3.9.0                   py311_0    pytorch-nightly
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0           py311h9bf148f_0    pytorch-nightly
[conda] mkl_fft                   1.3.1           py311hc796f24_0    pytorch-nightly
[conda] mkl_random                1.2.2           py311hbba84a0_0    pytorch-nightly
[conda] mpmath                    1.2.1                   py311_0    pytorch-nightly
[conda] numpy                     1.24.3          py311hc206e33_0  
[conda] numpy-base                1.24.3          py311hfd5febd_0  
[conda] pillow                    9.3.0           py311h3fd9d12_2    pytorch-nightly
[conda] pysocks                   1.7.1                   py311_0    pytorch-nightly
[conda] pytorch                   2.1.0.dev20230727 py3.11_cuda11.8_cudnn8.7.0_0    pytorch-nightly
[conda] pytorch-cuda              11.8                 h7e8668a_5    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] requests                  2.28.1                  py311_0    pytorch-nightly
[conda] torchaudio                2.1.0.dev20230727     py311_cu118    pytorch-nightly
[conda] torchtriton               2.1.0+9e3e10c5ed           py311    pytorch-nightly
[conda] torchvision               0.16.0.dev20230727     py311_cu118    pytorch-nightly
[conda] urllib3                   1.26.14                 py311_0    pytorch-nightly

The script that reproduces this for my build is rathe 8000 r simple:

import torch
import torch._dynamo
import logging

from torch._dynamo import config

config.verbose = True
# config.log_level = logging.INFO

class Model(torch.nn.Module):

    def __init__(self) -> None:
        super().__init__()

        self.norm = torch.nn.InstanceNorm3d(num_features=1)
        self.conv = torch.nn.Conv3d(in_channels=1, out_channels=3, kernel_size=3)
        self.activation = torch.nn.ReLU(inplace=True)
        self.final = torch.nn.Conv3d(in_channels=3, out_channels=3, kernel_size=3)

    def forward(self, x):
        y = self.norm(x)
        y = self.conv(y)
        y = self.activation(y)
        y = self.final(y)
        return y


def main():
    model = Model()
    model = model.float()
    device = torch.device('cuda:0')
    model = model.to(device)
    opt_model = torch.compile(model, backend='inductor')

    input_shape = (1, 64, 64, 64)
    x = torch.randn(input_shape, dtype=torch.float32, device=device)

    result = opt_model(x)
    print(result.shape)


if __name__ == '__main__':
    main()

@albanD albanD changed the title Pytorch nighlty and Trition cuda Pytorch nighlty and openAI/triton cuda Jul 31, 2023
@ptrblck
Copy link
Collaborator
ptrblck commented Jul 31, 2023

Not reproducible on V100s.
Install command:

pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu118

Output:

python tmp.py
torch.Size([3, 60, 60, 60])

Env:

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==2.1.0+e6216047b8
[pip3] torch==2.1.0.dev20230731+cu118
[pip3] torch-tensorrt==2.0.0.dev0
[pip3] torchvision==0.16.0.dev20230731+cu118
[pip3] triton==2.1.0
[conda] Could not collect

Will try on A100 next.

@malfet
Copy link
Contributor
malfet commented Jul 31, 2023

@ptrblck from the this comment, it looks like problem is an old CUDA driver (460.91.03), which is probably incompatible with Triton?

@ptrblck
Copy link
Collaborator
ptrblck commented Jul 31, 2023

it looks like problem is an old CUDA driver (460.91.03), which is probably incompatible with Triton?

Yes, good catch! It would be incompatible with the CUDA 12.x stack, but given we are installing the CUDA 11.8 nightly PyTorch wheels, I have assumed Triton uses CUDA11.x, too. Do you know if pytorch-triton is using CUDA 12 stack?

@ptrblck
Copy link
Collaborator
ptrblck commented Jul 31, 2023

Also no repro on A100.
@malfet you might be right. This is the ptxas shipped in triton:

/workspace/src# /usr/local/lib/python3.10/dist-packages/triton/third_party/
cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:13:45_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

if I install the CUDA 11.8 PyTorch nightlies.

@malfet
Copy link
Contributor
malfet commented Jul 31, 2023

Yes, good catch! It would be incompatible with the CUDA 12.x stack, but given we are installing the CUDA 11.8 nightly PyTorch wheels, I have assumed Triton uses CUDA11.x, too. Do you know if pytorch-triton is using CUDA 12 stack?

Yes, triton always uses cuda-12

@bhack
Copy link
Contributor Author
bhack commented Jul 31, 2023

As we have discussed at the mentioned triton-lang/triton#1955 it seems to be hardcoded right?

IMHO the main problem is more that currently the CI it was not going to covering this case with regular tests.

immagine

@malfet
Copy link
Contributor
malfet commented Jul 31, 2023

@bhack yes, it is. But can one can (in theory) ask Trition to use different ptxas using TRITON_PTXAS_PATH environment variable, see https://github.com/openai/triton/blob/89b0b79d7578a6e17b2cb7ad7b451c158038bd0b/python/triton/common/backend.py#L107

It's somewhat hard to test something like that in CI, as runners are provisioned with the latest kernel driver in order be usable with both CUDA-12 and CUDA-11.8. Also, older driver is less stable, so we run into a multiple hangs/segfaults that were mitigated by installing newer driver.

@bhack
Copy link
Contributor Author
bhack commented Jul 31, 2023

But can one can pass an environment variable to Triton to use different PTXAS using TRITON_PTXAS_PATH, see

Ok but it seems that the CI here it is not testing this conf right? Also, is the TRITON CI still testing 11.x on the commit hash we have picked?

@bhack
Copy link
Contributor Author
bhack commented Aug 28, 2023

As we are approaching to the release with #108055 can we re-label this one?

@bhack
Copy link
Contributor Author
bhack commented Dec 4, 2023

Are we sure that we can deliver 11.x reliable wheels?
See also #115075

@atalman
Copy link
Contributor
atalman commented Dec 4, 2023

Tried reproducing this on A100 with

torch                     2.1.2+cu118              pypi_0    pypi
torchaudio                2.1.2+cu118              pypi_0    pypi
torchvision               0.16.2+cu118             pypi_0    pypi

using code from comment: #106144 (comment)
I am seeing valid output:

torch.Size([3, 60, 60, 60])

@bhack
Copy link
Contributor Author
bhack commented Dec 4, 2023

Can you recheck #106144 (comment)

@bhack
Copy link
Contributor Author
bhack commented Dec 5, 2023

@atalman You need to rerun with nightly:

/opt/conda/pkgs/torchtriton-2.1.0-py310/lib/python3.10/site-packages/triton/third_party/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:13:45_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
/opt/conda/pkgs/torchtriton-2.1.0-py310/lib/python3.10/site-packages/triton/third_party/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:13:45_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
/usr/local/cuda-11.8/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:31:59_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@malfet What do you think about this current Pytorch nightly (but also next stable 2.1.2) CUDA 11.x wheel status?

@malfet malfet added module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general and removed needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user labels Dec 6, 2023
@danpetry
Copy link
danpetry commented Dec 8, 2023

We're working on a triton 2.1.0 conda package for defaults, FYI, which will have mlir and cudatoolkit unvendored. We'll try to use cudatoolkit 11.8 for this, although we haven't checked yet if triton's using any API entry points that were introduced with 12.x

@malfet
Copy link
Contributor
malfet commented Dec 8, 2023

@danpetry for 2.1.0 it does not, but for 2.2.0 we need to cherry-pick the change to make CUDA-12 specific API call optional.
Please note, that currently for https://anaconda.org/pytorch/torchtriton we generated meta.yaml using the following script:

if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(
f"package:\n name: torchtriton\n version: {version}\n",
file=meta,

@danpetry
Copy link
danpetry commented Dec 8, 2023

Ok, thanks. That looks like a pure python package recipe is generated, does your recipe compile triton from source?

@malfet
Copy link
Contributor
malfet commented Dec 8, 2023

Not, it's not a pure python package.

@bhack
Copy link
Contributor Author
bhack commented Feb 8, 2024

Can we cherry-pick or update the commit sha for triton-lang/triton#3053?

Compiling on pytroch nightly is broken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: binaries Anything related to official binaries that we release to users module: cuda Related to torch.cuda, and CUDA support in general oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

9 participants
0