Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs #146084

tinglvv · 2025-01-30T21:42:21Z

#145570

Adding libtorch build to nightlies

Follow up for #145792

Testing @Skylion007 's suggestion in #145792 (comment)

cc @atalman @malfet @ptrblck @nWEIdia

pytorch-bot · 2025-01-30T21:42:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146084

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 1515b5f with merge base bfaf76b ():

NEW FAILURES - The following jobs have failed:

linux-binary-libtorch-cxx11-abi / libtorch-cuda12_8-shared-with-deps-cxx11-abi-build / build (gh)
ninja: build stopped: subcommand failed
linux-binary-libtorch-pre-cxx11 / libtorch-cuda12_8-shared-with-deps-pre-cxx11-build / build (gh)
undefined reference to log2f@GLIBC_2.27'`

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Skylion007 · 2025-02-01T16:10:49Z

@eqy @nWEIdia Know any way to fix these CUDA errors with large shared libs or properly fix this workaround. Alternatively, know anyone at Nvidia that might know?

Skylion007 · 2025-02-01T16:17:08Z

Alternatively, we may need to use gen-lcs if PyTorch uses it's own linker scripts outside of NVCC:

gen-lcs

    Generates a host linker script that can be passed to host linker manually, in the case where host linker is invoked separately outside of nvcc. This option can be combined with -shared or -r option to generate linker scripts that can be used while generating host shared libraries or host relocatable links respectively.

    The file generated using this options must be provided as the last input file to the host linker.

    The output is generated to stdout by default. Use the option -o filename to specify the output filename.

A linker script may already be in used and passed to the host linker using the host linker option --script (or -T), then the generated host linker script must augment the existing linker script. In such cases, the option -aug-hls must be used to generate linker script that contains only the augmentation parts. Otherwise, the host linker behaviour is undefined.

A host linker option, such as -z with a non-default argument, that can modify the default linker script internally, is incompatible with this option and the behavior of any such usage is undefined

Hmm...

pytorch/tools/setup_helpers/generate_linker_script.py

Line 5 in 0f768c7

def gen_linker_script(

.ci/manywheel/build_cuda.sh

tinglvv · 2025-02-01T16:38:28Z

Hi @Skylion007, build failure was due to usage of many Linux container in the pre-cxx11 build. Andrey just removed it yesterday. #146200

Let's see the result of the cxx11 build for libtorch.

Skylion007 · 2025-02-01T16:43:47Z

Oh we probably need to rebase too then, right?

Skylion007 · 2025-02-01T18:00:28Z

.ci/manywheel/build_cuda.sh

@@ -52,6 +52,10 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

 TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
 case ${CUDA_VERSION} in
+    12.8)
+        TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error


Suggested change

TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error

TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX"

Skylion007 · 2025-02-01T18:53:25Z

.ci/manywheel/build_cuda.sh

    export USE_STATIC_CUDNN=0
    # Try parallelizing nvcc as well
-    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
+    export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2 --host-linker-script=use-lcs"


We should probably modify it to just append --threads 2 to TORCH_NVCC_FLAGS if possible :)

Skylion007 · 2025-02-01T18:56:14Z

@tinglvv Seems like linker scripts have fixed all the other wheels aside from the preCXX11 and CXX-ABI ones!

tinglvv · 2025-02-02T02:12:28Z

rebase was a bit hard due to the file was removed. Starting a new pr.

tinglvv added 12 commits January 27, 2025 15:14

Build CUDA 12.8 nightlies

0968137

Add CUDA 12.8 in build_cuda.sh

8a7e8f1

retrigger test and small edit

9e75e7c

refactor with CUDA_ARCHES

a659716

fix lint 2

002cc88

skip libtorch builds for now

f097b1d

fix lint 3

84a4f6f

fix lint

1700587

restore the modify build part

095d0bc

Fix lint and remove sm50 and sm60

ac0d6b2

Generate manywheel only

0c296a1

Test libtorch with use-lcs flag

2f6da1a

pytorch-bot bot added the release notes: releng release notes category label Jan 30, 2025

pytorchbot added the open source label Jan 31, 2025

tinglvv marked this pull request as ready for review January 31, 2025 17:40

tinglvv requested a review from a team as a code owner January 31, 2025 17:40

tinglvv mentioned this pull request Jan 31, 2025

Add CUDA 12.8 manywheel x86 Builds to Binaries Matrix #145792

Closed

colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 31, 2025

Skylion007 reviewed Feb 1, 2025

View reviewed changes

.ci/manywheel/build_cuda.sh Outdated Show resolved Hide resolved

Test if adding it here, fixes it

1515b5f

Skylion007 reviewed Feb 1, 2025

View reviewed changes

tinglvv closed this Feb 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs #146084

Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs #146084

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error
	TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX"

Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs #146084

Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs #146084

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146084

❌ 2 New Failures

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!