Pytorch nighlty and openAI/triton cuda #106144

bhack · 2023-07-27T17:16:16Z

🐛 Describe the bug

If I am not wrong I think that the nightly pytorch (Cuda 11.8) wheels are not compatible with the Triton pinned commit as I am seeing something like triton-lang/triton#1955

See more:
https://discuss.pytorch.org/t/any-change-of-using-cuda-12-2/184461/6

If it is true why the CI is not failing with tests on pytorch nighlty cuda 11.8?

Versions

nightly

cc @seemethere @malfet @osalpekar @atalman @ptrblck @ezyang @msaroufim @wconstab @bdhirsh @anijain2305 @zou3519 @gchanan

malfet · 2023-07-28T19:28:58Z

Can you please run python3 -mtorch.utils.collect_env and post your results here?

bhack · 2023-07-29T12:14:12Z

I need to check if I can get again exactly that env.
What I had in the log is:

torch._dynamo.convert_frame: [WARNING] mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device)
torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
torch._dynamo.convert_frame: [WARNING] RuntimeError: Triton Error [CUDA]: device kernel image is invalid

The env was mainly:
docker pull pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel + official command pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

stebix · 2023-07-29T21:23:21Z

Hi all,

I am seeing similar problems resulting in Triton Error [CUDA]: device kernel image is invalid. Attempting at compiling a very simple model or function produces a lot of log warnings via a deep call stack, Finally pytorch seems to fall back to pre-2.x eager mode without speedups.

My environment was built as a fresh conda environment with the pytorch-nightly installation commands as provided by the main pytorch docs, since I wanted to use torch.compile and python 3.11.

I get the following reason for the compilation failure:

[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING]   File "/home/user/anaconda3/envs/torchnightly/lib/python3.11/site-packages/triton/compiler/compiler.py", line 589, in _init_handles
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING]     mod, func, n_regs, n_spills = fn_load_binary(self.metadata["name"], self.asm[bin_path], self.shared, device)
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING] torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
[2023-07-29 04:04:26,800] torch._dynamo.convert_frame: [WARNING] RuntimeError: Triton Error [CUDA]: device kernel image is invalid

This is observed for the following environment:

PyTorch version: 2.1.0.dev20230727
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.11.4 (main, Jul  5 2023, 13:45:01) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.10.0-8-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.2.152
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB

Nvidia driver version: 460.91.03
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   43 bits physical, 48 bits virtual
CPU(s):                          128
On-line CPU(s) list:             0-127
Thread(s) per core:              2
Core(s) per socket:              32
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7452 32-Core Processor
Stepping:                        0
Frequency boost:                 enabled
CPU MHz:                         2480.056
CPU max MHz:                     3364.3550
CPU min MHz:                     1500.0000
BogoMIPS:                        4700.08
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-31,64-95
NUMA node1 CPU(s):               32-63,96-127
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd mba sev ibrs ibpb stibp vmmcall sev_es fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca

Versions of relevant libraries:
[pip3] numpy==1.24.3
[pip3] torch==2.1.0.dev20230727
[pip3] torchaudio==2.1.0.dev20230727
[pip3] torchvision==0.16.0.dev20230727
[pip3] triton==2.1.0
[conda] blas                      1.0                         mkl  
[conda] brotlipy                  0.7.0           py311h9bf148f_1002    pytorch-nightly
[conda] cffi                      1.15.1          py311h9bf148f_3    pytorch-nightly
[conda] cryptography              38.0.4          py311h46ebde7_0    pytorch-nightly
[conda] cudatoolkit               11.8.0              h37601d7_11    conda-forge
[conda] filelock                  3.9.0                   py311_0    pytorch-nightly
[conda] mkl                       2021.4.0           h06a4308_640  
[conda] mkl-service               2.4.0           py311h9bf148f_0    pytorch-nightly
[conda] mkl_fft                   1.3.1           py311hc796f24_0    pytorch-nightly
[conda] mkl_random                1.2.2           py311hbba84a0_0    pytorch-nightly
[conda] mpmath                    1.2.1                   py311_0    pytorch-nightly
[conda] numpy                     1.24.3          py311hc206e33_0  
[conda] numpy-base                1.24.3          py311hfd5febd_0  
[conda] pillow                    9.3.0           py311h3fd9d12_2    pytorch-nightly
[conda] pysocks                   1.7.1                   py311_0    pytorch-nightly
[conda] pytorch                   2.1.0.dev20230727 py3.11_cuda11.8_cudnn8.7.0_0    pytorch-nightly
[conda] pytorch-cuda              11.8                 h7e8668a_5    pytorch-nightly
[conda] pytorch-mutex             1.0                        cuda    pytorch-nightly
[conda] requests                  2.28.1                  py311_0    pytorch-nightly
[conda] torchaudio                2.1.0.dev20230727     py311_cu118    pytorch-nightly
[conda] torchtriton               2.1.0+9e3e10c5ed           py311    pytorch-nightly
[conda] torchvision               0.16.0.dev20230727     py311_cu118    pytorch-nightly
[conda] urllib3                   1.26.14                 py311_0    pytorch-nightly

The script that reproduces this for my build is rathe 8000 r simple:

import torch
import torch._dynamo
import logging

from torch._dynamo import config

config.verbose = True
# config.log_level = logging.INFO

class Model(torch.nn.Module):

    def __init__(self) -> None:
        super().__init__()

        self.norm = torch.nn.InstanceNorm3d(num_features=1)
        self.conv = torch.nn.Conv3d(in_channels=1, out_channels=3, kernel_size=3)
        self.activation = torch.nn.ReLU(inplace=True)
        self.final = torch.nn.Conv3d(in_channels=3, out_channels=3, kernel_size=3)

    def forward(self, x):
        y = self.norm(x)
        y = self.conv(y)
        y = self.activation(y)
        y = self.final(y)
        return y


def main():
    model = Model()
    model = model.float()
    device = torch.device('cuda:0')
    model = model.to(device)
    opt_model = torch.compile(model, backend='inductor')

    input_shape = (1, 64, 64, 64)
    x = torch.randn(input_shape, dtype=torch.float32, device=device)

    result = opt_model(x)
    print(result.shape)


if __name__ == '__main__':
    main()

ptrblck · 2023-07-31T17:10:04Z

Not reproducible on V100s.
Install command:

pip install --pre torch torchvision --index-url https://download.pytorch.org/whl/nightly/cu118

Output:

python tmp.py
torch.Size([3, 60, 60, 60])

Env:

Versions of relevant libraries:
[pip3] numpy==1.22.2
[pip3] pytorch-quantization==2.1.2
[pip3] pytorch-triton==2.1.0+e6216047b8
[pip3] torch==2.1.0.dev20230731+cu118
[pip3] torch-tensorrt==2.0.0.dev0
[pip3] torchvision==0.16.0.dev20230731+cu118
[pip3] triton==2.1.0
[conda] Could not collect

Will try on A100 next.

malfet · 2023-07-31T17:13:27Z

@ptrblck from the this comment, it looks like problem is an old CUDA driver (460.91.03), which is probably incompatible with Triton?

ptrblck · 2023-07-31T17:17:30Z

it looks like problem is an old CUDA driver (460.91.03), which is probably incompatible with Triton?

Yes, good catch! It would be incompatible with the CUDA 12.x stack, but given we are installing the CUDA 11.8 nightly PyTorch wheels, I have assumed Triton uses CUDA11.x, too. Do you know if pytorch-triton is using CUDA 12 stack?

ptrblck · 2023-07-31T17:21:08Z

Also no repro on A100.
@malfet you might be right. This is the ptxas shipped in triton:

/workspace/src# /usr/local/lib/python3.10/dist-packages/triton/third_party/
cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:13:45_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

if I install the CUDA 11.8 PyTorch nightlies.

malfet · 2023-07-31T17:21:56Z

Yes, good catch! It would be incompatible with the CUDA 12.x stack, but given we are installing the CUDA 11.8 nightly PyTorch wheels, I have assumed Triton uses CUDA11.x, too. Do you know if pytorch-triton is using CUDA 12 stack?

Yes, triton always uses cuda-12

bhack · 2023-07-31T17:41:54Z

As we have discussed at the mentioned triton-lang/triton#1955 it seems to be hardcoded right?

IMHO the main problem is more that currently the CI it was not going to covering this case with regular tests.

malfet · 2023-07-31T18:07:30Z

@bhack yes, it is. But can one can (in theory) ask Trition to use different ptxas using TRITON_PTXAS_PATH environment variable, see https://github.com/openai/triton/blob/89b0b79d7578a6e17b2cb7ad7b451c158038bd0b/python/triton/common/backend.py#L107

It's somewhat hard to test something like that in CI, as runners are provisioned with the latest kernel driver in order be usable with both CUDA-12 and CUDA-11.8. Also, older driver is less stable, so we run into a multiple hangs/segfaults that were mitigated by installing newer driver.

bhack · 2023-07-31T18:15:58Z

But can one can pass an environment variable to Triton to use different PTXAS using TRITON_PTXAS_PATH, see

Ok but it seems that the CI here it is not testing this conf right? Also, is the TRITON CI still testing 11.x on the commit hash we have picked?

bhack · 2023-08-28T20:48:29Z

As we are approaching to the release with #108055 can we re-label this one?

bhack · 2023-12-04T17:45:32Z

Are we sure that we can deliver 11.x reliable wheels?
See also #115075

atalman · 2023-12-04T21:15:49Z

Tried reproducing this on A100 with

torch                     2.1.2+cu118              pypi_0    pypi
torchaudio                2.1.2+cu118              pypi_0    pypi
torchvision               0.16.2+cu118             pypi_0    pypi

using code from comment: #106144 (comment)
I am seeing valid output:

torch.Size([3, 60, 60, 60])

bhack · 2023-12-04T21:56:24Z

Can you recheck #106144 (comment)

bhack · 2023-12-05T11:17:40Z

@atalman You need to rerun with nightly:

/opt/conda/pkgs/torchtriton-2.1.0-py310/lib/python3.10/site-packages/triton/third_party/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:13:45_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105

/opt/conda/pkgs/torchtriton-2.1.0-py310/lib/python3.10/site-packages/triton/third_party/cuda/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:13:45_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105

/usr/local/cuda-11.8/bin/ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:31:59_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0

@malfet What do you think about this current Pytorch nightly (but also next stable 2.1.2) CUDA 11.x wheel status?

danpetry · 2023-12-08T22:08:52Z

We're working on a triton 2.1.0 conda package for defaults, FYI, which will have mlir and cudatoolkit unvendored. We'll try to use cudatoolkit 11.8 for this, although we haven't checked yet if triton's using any API entry points that were introduced with 12.x

malfet · 2023-12-08T22:24:02Z

@danpetry for 2.1.0 it does not, but for 2.2.0 we need to cherry-pick the change to make CUDA-12 specific API call optional.
Please note, that currently for https://anaconda.org/pytorch/torchtriton we generated meta.yaml using the following script:

pytorch/.github/scripts/build_triton_wheel.py

Lines 88 to 92 in 3e47e3f

    
           if build_conda: 
        
               with open(triton_basedir / "meta.yaml", "w") as meta: 
        
                   print( 
        
                       f"package:\n  name: torchtriton\n  version: {version}\n", 
        
                       file=meta,

danpetry · 2023-12-08T22:35:35Z

Ok, thanks. That looks like a pure python package recipe is generated, does your recipe compile triton from source?

malfet · 2023-12-08T22:45:24Z

Not, it's not a pure python package.

bhack · 2024-02-08T15:24:26Z

Can we cherry-pick or update the commit sha for triton-lang/triton#3053?

Compiling on pytroch nightly is broken

malfet added needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module oncall: pt2 labels Jul 28, 2023

ezyang added the high priority label Jul 30, 2023

pytorch-bot bot added the triage review label Jul 30, 2023

albanD changed the title ~~Pytorch nighlty and Trition cuda~~ Pytorch nighlty and openAI/triton cuda Jul 31, 2023

soulitzer removed the triage review label Jul 31, 2023

voznesenskym removed the high priority label Nov 13, 2023

bhack mentioned this issue Dec 4, 2023

[v2.1.2] Release Tracker #113962

Closed

bhack closed this as completed Jun 11, 2024

This was referenced Jul 16, 2024

Add support for testing with minimum supported Nvidia Drivers to release validations pytorch/test-infra#5434

Open

Pytorch 2.4 RC cu118 wheels do not work on old drivers #130684

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pytorch nighlty and openAI/triton cuda #106144

Pytorch nighlty and openAI/triton cuda #106144

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Pytorch nighlty and openAI/triton cuda #106144

Pytorch nighlty and openAI/triton cuda #106144

Comments

Uh oh!

🐛 Describe the bug

Versions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!