8000 Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs by tinglvv · Pull Request #146084 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

Libtorch CUDA 12.8 Test with --host-linker-script=use-lcs #146084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from

Conversation

tinglvv
Copy link
Collaborator
@tinglvv tinglvv commented Jan 30, 2025

#145570

Adding libtorch build to nightlies

Follow up for #145792

Testing @Skylion007 's suggestion in #145792 (comment)

cc @atalman @malfet @ptrblck @nWEIdia

@pytorch-bot pytorch-bot bot added the release notes: releng release notes category label Jan 30, 2025
Copy link
pytorch-bot bot commented Jan 30, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/146084

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit 1515b5f with merge base bfaf76b (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@tinglvv tinglvv added topic: not user facing topic category ciflow/nightly Trigger all jobs we run nightly (nightly.yml) ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR ciflow/binaries Trigger all binary build and upload jobs on the PR and removed release notes: releng release notes category ciflow/nightly Trigger all jobs we run nightly (nightly.yml) ciflow/binaries_libtorch Trigger binary build and upload jobs for libtorch on the PR labels Jan 30, 2025
@tinglvv tinglvv marked this pull request as ready for review January 31, 2025 17:40
@tinglvv tinglvv requested a review from a team as a code owner January 31, 2025 17:40
@colesbury colesbury added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jan 31, 2025
@Skylion007
Copy link
Collaborator

@eqy @nWEIdia Know any way to fix these CUDA errors with large shared libs or properly fix this workaround. Alternatively, know anyone at Nvidia that might know?

@Skylion007
Copy link
Collaborator
Skylion007 commented Feb 1, 2025

Alternatively, we may need to use gen-lcs if PyTorch uses it's own linker scripts outside of NVCC:

gen-lcs

    Generates a host linker script that can be passed to host linker manually, in the case where host linker is invoked separately outside of nvcc. This option can be combined with -shared or -r option to generate linker scripts that can be used while generating host shared libraries or host relocatable links respectively.

    The file generated using this options must be provided as the last input file to the host linker.

    The output is generated to stdout by default. Use the option -o filename to specify the output filename.

A linker script may already be in used and passed to the host linker using the host linker option --script (or -T), then the generated host linker script must augment the existing linker script. In such cases, the option -aug-hls must be used to generate linker script that contains only the augmentation parts. Otherwise, the host linker behaviour is undefined.

A host linker option, such as -z with a non-default argument, that can modify the default linker script internally, is incompatible with this option and the behavior of any such usage is undefined

Hmm...

@tinglvv
Copy link
Collaborator Author
tinglvv commented Feb 1, 2025

Hi @Skylion007, build failure was due to usage of many Linux container in the pre-cxx11 build. Andrey just removed it yesterday. #146200

Let's see the result of the cxx11 build for libtorch.

@Skylion007
Copy link
Collaborator

Oh we probably need to rebase too then, right?

@@ -52,6 +52,10 @@ cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')

TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX" #Ripping out 5.0 and 6.0 due to ld error
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0;10.0;12.0+PTX"

export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2 --host-linker-script=use-lcs"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably modify it to just append --threads 2 to TORCH_NVCC_FLAGS if possible :)

@Skylion007
Copy link
Collaborator
Skylion007 commented Feb 1, 2025

@tinglvv Seems like linker scripts have fixed all the other wheels aside from the preCXX11 and CXX-ABI ones!

@tinglvv
Copy link
Collaborator Author
tinglvv commented Feb 2, 2025

rebase was a bit hard due to the file was removed. Starting a new pr.

@tinglvv tinglvv closed this Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR open source topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0