8000 [CUDA][CUBLAS] Explicitly link against `cuBLASLt` by eqy · Pull Request #95094 · pytorch/pytorch · GitHub
[go: up one dir, main page]

Skip to content

[CUDA][CUBLAS] Explicitly link against cuBLASLt #95094

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

eqy
Copy link
Collaborator
@eqy eqy commented Feb 17, 2023

An issue surfaced recently that revealed that we were never explicitly linking against cuBLASLt, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel

@pytorch-bot
Copy link
pytorch-bot bot commented Feb 17, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/95094

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 5f4ad94:
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@eqy eqy added topic: not user facing topic category open source ciflow/trunk Trigger trunk jobs on your pull request labels Feb 17, 2023
@ptrblck
Copy link
Collaborator
ptrblck commented Feb 17, 2023

CC @atalman @malfet as we've just discussed it.

A bit more information:
Before this PR cublasLt symbols are known but since we are not explicitly linking against cublasLt, the linker tries to resolve it during the runtime.

# before
$ nm -gA /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so | grep "cublasCreate\|cublasLtMatmul\>"
...
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so:                 U cublasLtMatmul

# after it should point to the right library
nm -gA /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so | grep "cublasCreate\|cublasLtMatmul\>"
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so:                 U cublasCreate_v2@@libcublas.so.11
/usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cuda.so:                 U cublasLtMatmul@@libcublasLt.so.11

Might also be related to #91067 although we have not verified it yet.

Copy link
Contributor
@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we should do it

@ngimel
Copy link
Collaborator
ngimel commented Feb 17, 2023

cc @peterbell10

@eqy
Copy link
Collaborator Author
eqy commented Feb 18, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased eqy-patch-11 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout eqy-patch-11 && git pull --rebase)

@ptrblck
Copy link
Collaborator
ptrblck commented Feb 18, 2023

A lot of failing tests (30+) but it seems all of them fail with either:

Download action repository 'actions/upload-artifact@v3' (SHA:0b7f8abb1508181956e8e162db84b466c27e18ce)
Warning: Failed to download action 'https://api.github.com/repos/actions/upload-artifact/tarball/0b7f8abb1508181956e8e162db84b466c27e18ce'. Error: Response status code does not indicate success: 401 (Unauthorized).
Warning: Back off 25.499 seconds before retry.
Warning: Failed to download action 'https://api.github.com/repos/actions/upload-artifact/tarball/0b7f8abb1508181956e8e162db84b466c27e18ce'. Error: Response status code does not indicate success: 401 (Unauthorized).
Warning: Back off 24.156 seconds before retry.
Error: Response status code does not indicate success: 401 (Unauthorized).

or

 Prepare all required actions
Getting action download info
Failed to resolve action download info. Error: Internal Server Error

Retrying in 25.391 seconds
Failed to resolve action download info. Error: Internal Server Error

Retrying in 10.731 seconds
Error: Failed to resolve action download info.

@eqy could you try another rebase to re-trigger the CI?

@eqy
Copy link
Collaborator Author
eqy commented Feb 19, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased eqy-patch-11 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout eqy-patch-11 && git pull --rebase)

@peterbell10 peterbell10 added the ciflow/binaries Trigger all binary build and upload jobs on the PR label Feb 20, 2023
@eqy
Copy link
Collaborator Author
eqy commented Feb 21, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased eqy-patch-11 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout eqy-patch-11 && git pull --rebase)

@ptrblck
Copy link
Collaborator
ptrblck commented Feb 22, 2023

Some Windows builds fail with:

C:/actions-runner/_work/pytorch/pytorch/builder/windows/pytorch/aten/src/ATen/native/cpu/GridSamplerKernel.cpp(514): error C2672: 'convert_to_int_of_same_size': no matching overloaded function found

which seems to be related to #95170.

@atalman
Copy link
Contributor
atalman commented Feb 22, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased eqy-patch-11 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout eqy-patch-11 && git pull --rebase)

@atalman
Copy link
Contributor
atalman commented Feb 24, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased eqy-patch-11 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout eqy-patch-11 && git pull --rebase)

@atalman
Copy link
Contributor
atalman commented Feb 24, 2023

@pytorchmergebot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot successfully started a rebase job. Check the current status here

An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
@pytorchmergebot
Copy link
Collaborator

Successfully rebased eqy-patch-11 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout eqy-patch-11 && git pull --rebase)

@atalman
Copy link
Contributor
atalman commented Feb 24, 2023

@pytorchmergebot -f "all required tests are green"

@pytorch-bot
Copy link
pytorch-bot bot commented Feb 24, 2023

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: argument command: invalid choice: 'all required tests are green' (choose from 'merge', 'revert', 'rebase', 'label', 'drci')

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci} ...

Try @pytorchbot --help for more info.

@atalman
Copy link
Contributor
atalman commented Feb 24, 2023

@pytorchmergebot merge -f "all required tests are green"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 25, 2023
An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
Pull Request resolved: pytorch/pytorch#95094
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Feb 25, 2023
An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
Pull Request resolved: pytorch/pytorch#95094
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman
atalman pushed a commit to atalman/pytorch that referenced this pull request Feb 27, 2023
An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
Pull Request resolved: pytorch#95094
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman
atalman added a commit that referenced this pull request Feb 27, 2023
An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
Pull Request resolved: #95094
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman

Co-authored-by: eqy <eddiey@nvidia.com>
cyyever pushed a commit to cyyever/pytorch_private that referenced this pull request Mar 5, 2023
An issue surfaced recently that revealed that we were never explicitly linking against `cuBLASLt`, this fixes it by linking explicitly rather than depending on linker magic.

CC @ptrblck @ngimel
Pull Request resolved: pytorch/pytorch#95094
Approved by: https://github.com/malfet, https://github.com/ngimel, https://github.com/atalman
pruthvistony added a commit to ROCm/pytorch that referenced this pull request May 2, 2023
pruthvistony pushed a commit to ROCm/pytorch that referenced this pull request May 3, 2023
Labels
ciflow/binaries Trigger all binary build and upload jobs on the PR ciflow/trunk Trigger trunk jobs on your pull request Merged open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants
0